Making Combinations of Items

omario · April 20, 2022, 3:29am

Suppose I have the following lists of factor:

factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")

I made the following data frame that contains all (3 * 2 * 4 * 3 = ) 72 combinations of these factors:

data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4) 
data_exp$id = 1:nrow(data_exp)

> head(data_exp)
  Var1 Var2 Var3 Var4 id
1   A1   B1   C1   D1  1
2   A2   B1   C1   D1  2
3   A3   B1   C1   D1  3
4   A1   B2   C1   D1  4
5   A2   B2   C1   D1  5
6   A3   B2   C1   D1  6

I want to randomly split this data (data_exp) into 3 datasets such that each row only appears in one of these datasets - furthermore, these 3 datasets do not have to be the same size. I tried to do this with the following code.

First, I randomly generate 3 random numbers corresponding to the number of rows for each of these datasets, such that these 3 random numbers add to 72:

# https://stackoverflow.com/questions/24845909/generate-n-random-integers-that-sum-to-m-in-r

rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
  vec <- rnorm(N, M/N, sd)
  if (abs(sum(vec)) < 0.01) vec <- vec + 1
  vec <- round(vec / sum(vec) * M)
  deviation <- M - sum(vec)
  for (. in seq_len(abs(deviation))) {
    vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
  }
  if (pos.only) while (any(vec < 0)) {
    negs <- vec < 0
    pos  <- vec > 0
    vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
    vec[pos][i]  <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
  }
  vec
}

r = rand_vect(3, 72)
[1] 26 23 23

Next, I tried to create these datasets using these random numbers:

data_1 = data_exp[sample(nrow(data_exp), r[1]), ]
data_2 = data_exp[sample(nrow(data_exp), r[2]), ]
data_3 = [sample(nrow(data_exp), r[3]), ]

The problem with this approach is that data_1, data_2, data_3 have common rows, and not all the rows from data_exp are necessarily present within data_1, data_2, data_3 .

Is there a way to fix this problem?

Thank you!

ganapap1 · April 20, 2022, 5:25am

Hope this is of some use to you

factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")

data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4)
data_exp$id = 1:nrow(data_exp)

set.seed(1234)
idx <- sample(3, size = nrow(data_exp), replace = TRUE, prob = c(0.33, 0.33,0.34))
df1 <- data_exp[idx == 1,]
df2 <- data_exp[idx == 2,]
df3 <- data_exp[idx == 3,]

omario · April 20, 2022, 3:16pm

Thank you so much! Ideally I would like the number of tows in df1, df2, and df3 to be fully random and still add up to nrow(data_exp) ...is this possible? Thank you so much!

ganapap1 · April 20, 2022, 3:28pm

Hi
I checked before posting, it is random and rows in all three add up to the no. of rows in original dataset

system · May 11, 2022, 3:28pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.