Creating a Validation Set specified by the user -not random-.

dario-rod · May 8, 2021, 10:23pm

I have a "rsplit" object created by

rsample::initial_time_split()

Now I want a create just one validation set based on one column or order. I tried "validation_split()" but it just allows a random sampling. I went to "group_vfold_cv()" which gave the appropiate grouping but, as the name says, it will make a cross-validation and as such will give me 2 resamples.

folds = group_vfold_cv(training(df_split), group = 'column')
# Group 2-fold cross-validation 
# A tibble: 2 x 2
  splits                 id       
  <list>                 <chr>    
1 <rsplit [40912/72608]> Resample1
2 <rsplit [72608/40912]> Resample2

I would like to make something like this:

folds = group_vfold_cv(training(df_split), group = 'column') %>%
          filter(id == "Resample2")

But this breaks its class and converts it to a tibble that will not be recognized by the tuning function (tune_grid()).

Does anyone knows a way to accomplish this?

Here is a REPREX on what i would like to do:

library(tidymodels)

df = tibble( x = runif(100, 0 ,1), y = runif(100, 0,1), group_column = rep(c(1,0), 50))

df_split = initial_split(df, prop = 3/4)

#the filter changes the class that is needed for the tune_grid function
folds = group_vfold_cv(training(df_split), group = 'group_column') %>%
  filter(id == "Resample2")

boost_spec <- parsnip::boost_tree(
  trees = tune(),
  tree_depth = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")
  
recipe <- recipe(y ~ ., data = head(training(df_split)))

boost_workflow = workflow() %>% 
  add_recipe(recipe) %>%
  add_model(boost_spec)

set.seed(123)
boost_grid <- grid_max_entropy(
  trees(),
  tree_depth(),
  size = 2)

boost_res = boost_workflow %>%
  tune_grid(resamples = folds,
            grid = boost_grid,
            metrics = metric_set(rmse))

Thanks a lot!

GreyMerchant · May 8, 2021, 11:12pm

Can you make a simple dummy version with data? Just makes it easier with a reprex (FAQ: How to do a minimal reproducible example ( reprex ) for beginners) so I can create the exact objects on my side.

dario-rod · May 8, 2021, 11:28pm

Thanks, I just added a reprex.

GreyMerchant · May 9, 2021, 9:48am

Hi,

So looks as if people have asked for the ability to manually split their data based on a column. See if the below can work? Also have a look here: feature request - manual split creation · Issue #158 · tidymodels/rsample · GitHub

library(tidymodels)

df = tibble( x = runif(100, 0 ,1), y = runif(100, 0,1), group_column = rep(c(1,0), 50))


df <- df %>% 
  arrange(group_column) %>% 
  mutate(.row = row_number())


split_prop <- (last(which(df$group_column == 1))) / nrow(df)

indices <-
  list(analysis   = df$.row[df$group_column == 1], 
       assessment = df$.row[df$group_column ==  0]
  )

split <- make_splits(indices, df %>% select(-.row))
training(split)
#> # A tibble: 50 x 3
#>         x     y group_column
#>     <dbl> <dbl>        <dbl>
#>  1 0.684  0.958            1
#>  2 0.469  0.304            1
#>  3 0.870  0.535            1
#>  4 0.107  0.899            1
#>  5 0.537  0.212            1
#>  6 0.0980 0.553            1
#>  7 0.0834 0.257            1
#>  8 0.0133 0.790            1
#>  9 0.0419 0.888            1
#> 10 0.0560 0.576            1
#> # ... with 40 more rows

testing(split)
#> # A tibble: 50 x 3
#>         x      y group_column
#>     <dbl>  <dbl>        <dbl>
#>  1 0.977  0.802             0
#>  2 0.839  0.0102            0
#>  3 0.979  0.0793            0
#>  4 0.0670 0.815             0
#>  5 0.573  0.287             0
#>  6 0.152  0.672             0
#>  7 0.203  0.373             0
#>  8 0.587  0.635             0
#>  9 0.709  0.446             0
#> 10 0.0289 0.198             0
#> # ... with 40 more rows

^{Created on 2021-05-09 by the reprex package (v2.0.0)}

dario-rod · May 9, 2021, 2:21pm

Thank you very much for your response but I am not looking to split the data into training and testing. I want to make a validation set from an already made training split. Does this makes sense?

system · May 30, 2021, 2:21pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.