In multilevel modeling, we have observations nested in grouping variables. For example, the lme4::sleepsludy dataset has 10 observations each from 18 subjects. For bootstrapping this data for modeling, it makes sense to resample whole subjects. The best workflow for this procedure using rsample, as far as I know, is the following:
lme4::sleepstudy |>
#resample unique ids
distinct(Subject) |>
bootstraps(times = 10) |>
# attach the original data to the ids
analysis = lapply(
function(x) left_join(analysis(x), lme4::sleepstudy, by = "Subject")
Note that this copies the original data several times and is wasteful.
I have tried to make a function that does low-level manipulation of the rset object (replacing the data and in_id fields) but this feels like cheating.
Is there a better way to use bootstraps() to bootstrap chunks of data where the units being resampled may represent multiple rows of data?
I don't have a more elegant solution than what you've done. This is basically the same thing I've done in the past when doing resampling on a multi-level data set. I am only chiming in to say that I would love for {rsample} (or an adjacent package) to perhaps support multi-level resampling in a similar way as they have supported time-series sampling in {spatialsample}.
This type of hierarchical resampling occurs a lot for me, and some tidymodels-friendly functions would be a great addition to the ecosystem. Just throwing in my 2 cents in case @Max appears.
Oh, and my only other contribution is that group_vfold_cv() can moonlight for mutli-level loo_cv(), if you group on the multi-level grouping variable. But this doesn't help us for other forms of resampling, such as bootstraps.