RStudio- split data with setting a seed

R_Chhabra · March 3, 2025, 10:17pm

Hi All,

I am having trouble splitting the data for 10 datasets using 100 seeds. I need to create a for loop. However, my code isn't taking the "set.seed[i]" statement. Here's my code for creating 100 splits for one dataset, mod_M:

for (i in 1:length(Sd)){

set.seed(Sd[i])
M_train_s_Sd[i]=
mod_M[sample(nrow(mod_M), size=nrow(mod_M)*0.7, replace=F),]
M_test_s_Sd[i]=filter (mod_M, !subject.id %in% M_train_s_Sd[i]$subject.id)

i++

}

Could anyone please help? Thank you!

Radhika

AlexisW · March 4, 2025, 12:36am

There are a few potential problems here.

First, are you sure you want to manually set a seed within a loop? The random number generator is supposed to generate random numbers after it started from a seed, if you keep changing the seed I don't think you still have good guarantees that the numbers are random. In general it's a better idea to set the seed once at the beginning, and then let the random number generator do its thing.

Second problem: i++ does not exist in R, you would use i <- i+1. And even in C you would need to overwrite i in some way, e.g. i = i++ (if you just have i++, a value of i+1 is created and discarded).
But that's not a problem here: in R the for loop already increments i automatically.

Third problem:

 mod_M[sample(nrow(mod_M), size=nrow(mod_M)*0.7, replace=F),]

again, you are computing a value, but not saving it. This line has no effect on your script. Maybe you wanted to write something like this?

mod_M_subset <- mod_M[sample(nrow(mod_M), size=nrow(mod_M)*0.7, replace=F),]
M_test_s_Sd[i]=filter (mod_M_subset , !subject.id %in% M_train_s_Sd[i]$subject.id)

not sure, I don't really understand what the code is attempting to do.

Speaking of, if your goal is to split a dataset, taking a dataframe and creating a list of subsets, you might be interested in the split() function. If your goal is to create random subsamples, maybe you're better off with something like:

n_groups <- 10
groups <- sample(1:n_groups, size = nrow(mod_M), replace = TRUE)

# check the groups
table(groups)

# create list of subsamples
list_of_subsamples <- split(mod_M, f = groups)

or other variation on this.

R_Chhabra · March 4, 2025, 7:28pm

Hi Alexis,

Thank you for your advice. I want to split my data into 70% training and 30% test datasets. I want to set seed by index. For example, the list for seeds is:

Sd<- list(42,97,123,......)

I want to pull these specific numbers from the list to automatically assign to the set.seed() for 100 seeds for splitting the data using a for loop. But set.seed() function is not accepting and assigning value of an index form this list.

If you could please help me with that, that would be great! Thanks,
radhika

AlexisW · March 4, 2025, 7:50pm

Are you trying to get several 70-30 splits, or just a single one?

I think this does what you want?

dat <- data.frame(x = rnorm(100))

seeds <- 1:10

train_sets <- list()
test_sets <- list()

for(i in seq_along(seeds)){
  
  set.seed( seeds[[i]] )
  
  all_indices <- seq_len(nrow(dat))
  
  train_indices <- sample(all_indices, .7 * length(all_indices))
  test_indices <- setdiff(all_indices, train_indices)
  
  train_sets[[i]] <- dat[train_indices,]
  test_sets[[i]] <- dat[test_indices,]
}

str(train_sets)
str(test_sets)


# double check seed was used as expected

set.seed(6)
all_indices <- seq_len(nrow(dat))

train_indices <- sample(all_indices, .7 * length(all_indices))
test_indices <- setdiff(all_indices, train_indices)

train_6 <- dat[train_indices,]
test_6 <- dat[test_indices,]

all.equal(train_sets[[6]], train_6)
all.equal(test_sets[[6]], test_6)

Scarletios · March 5, 2025, 10:07am

Your issue is likely due to incorrect indexing and the use of i++, which is not valid in R. Try this corrected version:

M_train_s_Sd <- list()
M_test_s_Sd <- list()

for (i in 1:length(Sd)) {
  set.seed(Sd[i])
  train_indices <- sample(nrow(mod_M), size = nrow(mod_M) * 0.7, replace = FALSE)
  M_train_s_Sd[[i]] <- mod_M[train_indices, ]
  M_test_s_Sd[[i]] <- mod_M[!mod_M$subject.id %in% M_train_s_Sd[[i]]$subject.id, ]
}

This ensures that M_train_s_Sd and M_test_s_Sd store the splits correctly in lists while fixing the indexing issue.

R_Chhabra · March 12, 2025, 1:42pm

Thank you Alexis for the code, it works!

R_Chhabra · March 12, 2025, 1:43pm

Thank you Scarletios, your code works!

system · March 19, 2025, 1:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.