I'm following one of the fantastic Tidymodels tutorials on XGBoost from @julia, and everything works as expected for my own data except one thing: the split of training and test sets.
My data have 3 replicates, i.e. each sample was measured 3 times.
I don't want to split the data such that sampleA_R1 and sampleA_R2 end up in the training set, but sampleA_R3 ends up in the test set.
How can I prevent this, still using the same simple syntax from {tidymodels} and {rsample} ?
Min reprex: (mtcars with 3 replicates)
#Make a version of mtcars that has 3 replicates for each row
# (as in if each car were measured 3 times rather than 1)
mtcars_R1 <-
mtcars %>%
`rownames<-` (paste0(rownames(.),"_R1"))
mtcars_R2 <-
mtcars %>%
`rownames<-` (paste0(rownames(.),"_R2")) %>%
mutate_all(function(x)x*2)
mtcars_R3 <-
mtcars %>%
`rownames<-` (paste0(rownames(.),"_R3")) %>%
mutate_all(function(x)x*3)
mtcars_new <-
rbind(
mtcars_R1,
rbind(
mtcars_R2,
mtcars_R3))
The group_initial_split() function from rsample should take care of this use case. Check out the "Grouped Resampling" section of this article on the rsample website for more information.
Thank you! I looked at the group_initial_split() function and it seems that's what I need. I can't seem to make it work with the reprex I posted, though.
Can you help me with what the code should look like here?
initial_split has the parameter strata (which I still want to use). group_initial_split needs a way to group the names before "_r1/2/3"? Is that right?
Thanks for the help here to you and @nirgrahamuk !
I want all replicates of a given sample to be together in either the training set or the test set.
For example, I want sampleA_R1, sampleA_R2, and sampleA_R3 to all be in the test set, or all be in the training set.
I tried using the 'root' sample name (sampleA for example rather than sampleA_R1) for the group argument of group_initial_split , but that doesn't seem to be how it's used:
Error in `check_grouped_strata()`:
! `strata` must be constant across all members of each `group`.
Run `rlang::last_error()` to see where the error occurred.
How can I keep all replicates of a sample together this way?