how to use rsample for multilevel resampling

tjmahr · June 20, 2022, 7:14pm

In multilevel modeling, we have observations nested in grouping variables. For example, the lme4::sleepsludy dataset has 10 observations each from 18 subjects. For bootstrapping this data for modeling, it makes sense to resample whole subjects. The best workflow for this procedure using rsample, as far as I know, is the following:

library(rsample)
library(tidyverse)

lme4::sleepstudy |> 
 #resample unique ids 
  distinct(Subject) |> 
  bootstraps(times = 10) |> 
  # attach the original data to the ids
  mutate(
    analysis = lapply(
      splits, 
      function(x) left_join(analysis(x), lme4::sleepstudy, by = "Subject")
    )
  )

Note that this copies the original data several times and is wasteful.

I have tried to make a function that does low-level manipulation of the rset object (replacing the data and in_id fields) but this feels like cheating.

Is there a better way to use bootstraps() to bootstrap chunks of data where the units being resampled may represent multiple rows of data?

mattwarkentin · June 20, 2022, 8:06pm

I don't have a more elegant solution than what you've done. This is basically the same thing I've done in the past when doing resampling on a multi-level data set. I am only chiming in to say that I would love for {rsample} (or an adjacent package) to perhaps support multi-level resampling in a similar way as they have supported time-series sampling in {spatialsample}.

This type of hierarchical resampling occurs a lot for me, and some tidymodels-friendly functions would be a great addition to the ecosystem. Just throwing in my 2 cents in case @Max appears.

Relevant:

github.com/tidymodels/rsample

more group-based splitting methods

opened 02:33PM - 14 Jan 21 UTC

closed 12:16AM - 30 Jun 22 UTC

topepo

feature

It would be good to have an `initial_group_split(data, group, strata, prop)` met…hod that can split the data when there are groups (perhaps patients). The `strata` option might be difficult when the outcome (or other stratification variable) is not constant within each group. We could also use the median or mode on the stratification variable and use that. Similarly, a `mc_group_cv()` function would also be a good idea (using the splitting method as above).

mattwarkentin · June 20, 2022, 8:13pm

Oh, and my only other contribution is that group_vfold_cv() can moonlight for mutli-level loo_cv(), if you group on the multi-level grouping variable. But this doesn't help us for other forms of resampling, such as bootstraps.

Max · June 20, 2022, 8:55pm

I agree that we need more functions like these in rsample. I would go add thumbs up to the GH issues in those repos.

I suspect that group_vfold_cv()`is the best that we have at the moment.

smouksassi · June 26, 2022, 4:07pm

@Devin_Pastoor has a nice function here:

github.com

metrumresearchgroup/PKPDmisc/blob/master/R/resampling_functions.R#L94

      
        
            #' # check to see that stratification is maintained
            #' rep_dat %>% group_by(Gender) %>% tally
            #' resample_df(rep_dat, key_cols=c("ID", "REP"), strat_cols="Gender") %>%
            #'   group_by(Gender) %>% tally
            #'   
            #' rep_dat %>% group_by(Gender, Race) %>% tally
            #' 
            #' resample_df(rep_dat, key_cols=c("ID", "REP"), strat_cols=c("Gender", "Race")) %>%
            #'   group_by(Gender, Race) %>% tally
            #' @export
            resample_df <- function(df, 
                                    key_cols,
                                    strat_cols = NULL, 
                                    n = NULL,
                                    key_col_name = "KEY",
                                    replace = TRUE) {
              # checks
              if (is.numeric(strat_cols)) {
                message("It looks you are trying to give a numeric value for strat_cols, 
             perhaps you were trying to specify the number to sample instead? 
             If no strat_cols are specified you must explicitly specify 'n = ...'

It handled IDs/Keys and strata.

Not tidy models compatible but it got the job done.

MikeMahoney218 · June 30, 2022, 1:24pm

Hi all!

I just wanted to share that this is now in the development version of rsample:

library(rsample)
library(tidyverse)

set.seed(123)
boot1 <- lme4::sleepstudy |> 
  group_bootstraps(times = 10, Subject)

boot1
#> # Bootstrap sampling 
#> # A tibble: 10 × 2
#>    splits           id         
#>    <list>           <chr>      
#>  1 <split [180/60]> Bootstrap01
#>  2 <split [180/70]> Bootstrap02
#>  3 <split [180/80]> Bootstrap03
#>  4 <split [180/80]> Bootstrap04
#>  5 <split [180/70]> Bootstrap05
#>  6 <split [180/60]> Bootstrap06
#>  7 <split [180/60]> Bootstrap07
#>  8 <split [180/70]> Bootstrap08
#>  9 <split [180/60]> Bootstrap09
#> 10 <split [180/60]> Bootstrap10

unique(analysis(boot1$splits[[1]])$Subject)
#>  [1] 308 309 330 333 335 337 349 350 352 369 370 372
#> 18 Levels: 308 309 310 330 331 332 333 334 335 337 349 350 351 352 369 ... 372
unique(assessment(boot1$splits[[1]])$Subject)
#> [1] 310 331 332 334 351 371
#> 18 Levels: 308 309 310 330 331 332 333 334 335 337 349 350 351 352 369 ... 372

^{Created on 2022-06-30 by the reprex package (v2.0.1)}

This won't be on CRAN for a few months, but is in the GitHub version.

tjmahr · July 7, 2022, 2:00pm

Woo hoo! Thanks everyone!

system · July 14, 2022, 2:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.