Combining Rolling Origin Forecast Resampling and Group V-Fold Cross-Validation in rsample

RichiW · August 20, 2018, 7:44am

I would like to use the R package rsample to generate resamples of my data.

The package offers the function rolling_origin to produce resamples that keep the time series structure of the data. This means that training data (in the package called analysis) are always in the past of test data (assessment).

On the other hand I would like to perform block samples of the data. This means that groups of rows are kept together during sampling. This can be done using the function group_vfold_cv. As groups one could think of are months. Say, we want to do time series cross validation always keeping months together.

Is there a way to combine the two approaches in rsample?

I give examples for each procedure on its own:

## generate some data
library(tidyverse)
library(lubridate)
library(rsample)
my_dates = seq(as.Date("2018/1/1"), as.Date("2018/8/20"), "days")
some_data = data_frame(dates = my_dates) 
some_data$values = runif(length(my_dates))
some_data = some_data %>% mutate(month = as.factor(month(dates)))

This gives data of the following form

 A tibble: 232 x 3
   dates      values month 
   <date>      <dbl> <fctr>
 1 2018-01-01 0.235  1     
 2 2018-01-02 0.363  1     
 3 2018-01-03 0.146  1     
 4 2018-01-04 0.668  1     
 5 2018-01-05 0.0995 1     
 6 2018-01-06 0.163  1     
 7 2018-01-07 0.0265 1     
 8 2018-01-08 0.273  1     
 9 2018-01-09 0.886  1     
10 2018-01-10 0.239  1

Then we can e.g. produce samples that take 20 weeks of data and test on future 5 weeks (the parameter skip skips some rows extra):

rolling_origin_resamples <- rolling_origin(
  some_data,
  initial    = 7*20,
  assess     = 7*5,
  cumulative = TRUE,
  skip       = 7
)

We can check the data with the following code and see no overlap:

rolling_origin_resamples$splits[[1]] %>% analysis %>% tail
# A tibble: 6 x 3
  dates       values month 
  <date>       <dbl> <fctr>
1 2018-05-15 0.678   5     
2 2018-05-16 0.00112 5     
3 2018-05-17 0.339   5     
4 2018-05-18 0.0864  5     
5 2018-05-19 0.918   5     
6 2018-05-20 0.317   5 

### test data of first split:
rolling_origin_resamples$splits[[1]] %>% assessment
# A tibble: 6 x 3
  dates      values month 
  <date>      <dbl> <fctr>
1 2018-05-21  0.912 5     
2 2018-05-22  0.403 5     
3 2018-05-23  0.366 5     
4 2018-05-24  0.159 5     
5 2018-05-25  0.223 5     
6 2018-05-26  0.375 5

Alternatively we can split by months:

## sampling by month:
gcv_resamples = group_vfold_cv(some_data, group = "month", v = 5)
gcv_resamples$splits[[1]]  %>% analysis %>% select(month) %>% summary
gcv_resamples$splits[[1]] %>% assessment %>% select(month) %>% summary

PS: I admit this is crossposted. It just came to my mind that his forum here is the better place for this question ...
This is the link to the SO post with an answer that offers a solution but does not directly use rsample.

jcblum · August 21, 2018, 5:32pm

Thanks for admitting it, at least . Per this community’s guidelines on cross-posting, can you please also link to your other post? That way nobody wastes effort if your question gets a good answer elsewhere.

RichiW · August 21, 2018, 6:14pm

Thank you! I edited the post and included the link to the stackoverflow post.