How to random split gene array 50/50

Hi,
I have a gene array in form of a matrix [y, x] where y is a vector of outcome and x is a data frame with the predictors as columns. It is a matrix because it is to be used in glmnet() for ridge and lasso. However, before I use it (glmnet(y~x)…), I need to split it randomly 50/50 into training and testing datasets. How do I do that?
Many thanks!

Suppose the matrix is called m. You can choose rows with 50/50 chance of being chosen by

inTraining  <- sample(c(0,1), size = nrow(m), replace = TRUE)
training <- m(inTraining, )
testing <- m(1-inTraining, )
1 Like

My matrix named 'data' looks like this:
image
@startz, when I put size = nrow(data), I get an error that size argument is invalid.

I wonder whether R knows it's a matrix. Try nrow(as.matrix(data))

The best thing that worked so far was to put x in nrow(x) since x is the data frame part. However, now y is too long for the models. Any advice is much appreciated.

That's helpful. I see now that data is not a matrix, it is a list of a matrix and a vector. Try something like

inTraining  <- sample(c(0,1), size = nrow(data$x), replace = TRUE)
training <- list(x = data$x[inTraining, ], y = data$y[inTraining])
testing <-  list(x = data$x[1 - inTraining, ], y = data$y[1 - inTraining])
1 Like

Thank you so much! The code you suggested worked. One more thing - do I need to add set.seed(1) before the code to make this partition reproducible?

Yes. Doesn't have to be 1. All that matters is that set.seed() always gets the same argument.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.