Suppose I have the following lists of factor:
factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")
I made the following data frame that contains all (3 * 2 * 4 * 3 = ) 72 combinations of these factors:
data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4)
data_exp$id = 1:nrow(data_exp)
> head(data_exp)
Var1 Var2 Var3 Var4 id
1 A1 B1 C1 D1 1
2 A2 B1 C1 D1 2
3 A3 B1 C1 D1 3
4 A1 B2 C1 D1 4
5 A2 B2 C1 D1 5
6 A3 B2 C1 D1 6
I want to randomly split this data (data_exp) into 3 datasets such that each row only appears in one of these datasets - furthermore, these 3 datasets do not have to be the same size. I tried to do this with the following code.
First, I randomly generate 3 random numbers corresponding to the number of rows for each of these datasets, such that these 3 random numbers add to 72:
# https://stackoverflow.com/questions/24845909/generate-n-random-integers-that-sum-to-m-in-r
rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
vec <- rnorm(N, M/N, sd)
if (abs(sum(vec)) < 0.01) vec <- vec + 1
vec <- round(vec / sum(vec) * M)
deviation <- M - sum(vec)
for (. in seq_len(abs(deviation))) {
vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
}
if (pos.only) while (any(vec < 0)) {
negs <- vec < 0
pos <- vec > 0
vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
vec[pos][i] <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
}
vec
}
r = rand_vect(3, 72)
[1] 26 23 23
Next, I tried to create these datasets using these random numbers:
data_1 = data_exp[sample(nrow(data_exp), r[1]), ]
data_2 = data_exp[sample(nrow(data_exp), r[2]), ]
data_3 = [sample(nrow(data_exp), r[3]), ]
- The problem with this approach is that
data_1, data_2, data_3
have common rows, and not all the rows from data_exp are necessarily present withindata_1, data_2, data_3
.
Is there a way to fix this problem?
Thank you!