Split entries into 2 groups, where there is no overlap?

cwright1 · April 7, 2022, 1:17am

I have a list of sample names each repeated 3 times, identified by "_R1" or "_R2" or "_R3".
How can I split them such that:

90% of samples go to Group1, the rest go to Group2
and
There is no overlap of the 'original' sample names between the two groups? (aka avoid a scenario like Group1 contains samplex_R1 and samplex_R2 - and Group2 has samplex_R3)?

Minimal RePrex:


#Make a dataframe with a list of names in triplicate, idenifited by _R1/2/3
R1 <- paste0(rownames(mtcars),"_R1")
R2 <- paste0(rownames(mtcars),"_R2")
R3 <- paste0(rownames(mtcars),"_R3")

mydf <- data.frame("samples" = Reduce(union, c(R1,R2,R3)))

#Randomly shuffle the rows to simulate my 'real' data
mydf <- data.frame("samples"=mydf[sample(1:nrow(mydf)),])


#########################################################################
#----Separate 90% of the samples into group1, put the rest in group2----#
#########################################################################

#First get the numbers of samples going into each group
group1_number <- ceiling(0.9 * nrow(mydf)) #90%
group2_number <- (nrow(mydf)) -  group1_number #the rest

#Get the names that will go into group1/2
group1_names <- mydf[1:group1_number,c("samples")]
group2_names <- mydf[(group1_number+1):nrow(mydf),c("samples")]

#Place samples in group1
group1 <- data.frame("samples"=mydf[mydf$samples %in% group1_names,])
group2 <- data.frame("samples"=mydf[mydf$samples %in% group2_names,])

#How do I avoid overlap of 'base' sample names in these groups?

FJCC · April 7, 2022, 2:37am

Does this give you want you want?

library(tidyverse)
R1 <- paste0(rownames(mtcars),"_R1")
R2 <- paste0(rownames(mtcars),"_R2")
R3 <- paste0(rownames(mtcars),"_R3")

mydf <- data.frame("samples" = Reduce(union, c(R1,R2,R3)))

#Randomly shuffle the rows to simulate my 'real' data
mydf <- data.frame("samples"=mydf[sample(1:nrow(mydf)),])
mydf <- mydf |> separate(samples,into = c("Root","Repeat"),
                         remove = FALSE, sep = "_")

Roots <- unique(mydf$Root)
group1_number <- ceiling(0.9 * length(Roots)) 
Group1 <- sample(Roots, group1_number)

group1_df <- mydf |> filter(Root %in% Group1)
group2_df <- mydf |> filter(!Root %in% Group1)

system · April 14, 2022, 2:37am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.