How to split a dataset with tidyverse functions

aos · August 26, 2020, 1:29pm

I'd like to split a data set in order to obtain a train and test set. The slice_sample function helps me to split by n or prop and take into account groups, which is great. Then, anti_join helps me to get the other half of the data, given the sliced data. However, anti_join removes all identical rows, such that if there are duplicates in the data, it might remove all of those, rather then only the sliced ones.

data_test <- iris %>% slice_sample(n = 40)
data_train <- iris %>% anti_join(data_test)

Is there a tidyverse way to fix this?

nirgrahamuk · August 26, 2020, 1:50pm

set.seed(42)
#a possible tidy way but
myiris <-  iris %>% mutate(rn=row_number())
data_test <- myiris %>% slice_sample(n = 40)
data_train <-myiris %>% slice(-pull(data_test,rn))

#not sure its much better than the base way
test_index <- sample.int(n = nrow(iris),
                         size = 40)
dtest <- iris[test_index,]
dtrain <- iris[-test_index,]

phiggins · August 27, 2020, 12:30pm

The {rsample} package (part of tidymodels) seems to be designed for this use case, with the initial_split() function.
Does this help?

library(tidymodels)
set.seed(42)
iris_split <- initial_split(iris, prop = 0.7)
train_data <- training(iris_split )
test_data <- testing(iris_split )

system · September 3, 2020, 12:30pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.