I'm working on a machine learning project that requires me to split my data into 2 groups (as is common in machine learning): A training set (90% of the original data) and a test set (10% of the data).
My data come in replicates, however: each sample is measured 3 times (appended with _R1/_R2/_R3). It is critical that when splitting the data, the replicates remain together: (if xxx_R1 is in the test set, then xxx_R2 and xxx_R3 also need to be in the test set.).
In a previous post , I made a Minimal Reprex for this problem, and @FJCC solved it! The data could be split into 90% and 10% sets while keeping the replicates together.
library(tidyverse)
R1 <- paste0(rownames(mtcars),"_R1")
R2 <- paste0(rownames(mtcars),"_R2")
R3 <- paste0(rownames(mtcars),"_R3")
mydf <- data.frame("samples" = Reduce(union, c(R1,R2,R3)))
#Randomly shuffle the rows to simulate my 'real' data
mydf <- data.frame("samples"=mydf[sample(1:nrow(mydf)),])
mydf <- mydf %>% separate(samples,into = c("Root","Repeat"),
remove = FALSE, sep = "_")
Roots <- unique(mydf$Root)
group1_number <- ceiling(0.9 * length(Roots))
Group1 <- sample(Roots, group1_number)
group1_df <- mydf %>% filter(Root %in% Group1)
group2_df <- mydf %>% filter(!Root %in% Group1)
My question now: How do make 10 different iterations of this, where a different 10% is kept for each test set (and different 90% for each training set) ? Also known as 10-fold cross validation.
I want to add 10 columns to the data: FOLD1 through FOLD10. In these columns the values will be "training" or "test".