My understanding is that the analysis splits each contain indices for the total size of the dataset (n=7043) but sampled from it with replacement. The assessment split contains indices for the unused rows (in bootstrap 1, there are 2590 of those) — but don't most (? all?) uses of bootstrap involve calculating the statistics in the bootstrapped (the analysis) dataset? When is the assessment dataset relevant in a bootstrapping framework?

Thank you!

PS: I'd also love an example of a summary statistic that requires the use of apparent = TRUE as that one is also new to me. Thanks!

Originally, the method was used to estimate the sampling distribution of some statistic without having to make specific (parametric) distributional assumptions about the data. For that application, the assessment data are not used. However, in this application, some of the confidence interval techniques rely on the value of the statistic from the source data set (e.g. the BCa method). That's where the apparent flag comes into play.

The main application of the bootstrap where the assessment data are used is for model performance estimation. Here, the model is fit with the analysis data and the assessment data are used to compute some measure of model efficacy (e.g. RMSE, accuracy etc).

Similarly, in some tree ensembles, the bootstrap is used to create the ensembles and the assessment sets are used to generate the out-of-bag error estimates.

Thank you so much for your detailed reply! I did not know how shallow my understanding of the bootstrap was until your reply. I've since read more from FES and the tidymodels documentation (especially the learn section) and have a few follow-up questions. For reference, I was really only familiar with the original method prior to your reply, although not in any depth. My questions about the bootstrap are about its used in model performance estimation.

A. I've heard things like "use the bootstrap to estimate training AUC". This is my understanding: to get stable estimates of model performance metrics, you can use the bootstrap. If you use the bootstraps(times = x) as the resample in tidymodels tuning, then the model is built on x bootstrapped analysis datasets for each parameter combination. The model performance metric is then calculated on x assessment sets, so the performance metric for each parameter combination has both a point estimate and a standard error. This results in more stability in choosing the "best model" when tuning and also more precision in reporting a training AUC (or other metric).

Questions:

Are there other resamples that give you a standard error? Ie if you have enough folds of CV, do you get a std_err when looking at training performance? show_best(formula_res, metric = "roc_auc")for tune_results object formula_res which used bootstrap resamples (times = 30) and has both estimate of the point estimate of roc_auc and also std_err.

If you are using CV to tune an, eg, xgb model, is it possible to get a bootstrapped estimate of the performance metrics of interest? Do you do that by simply fitting the model using the best parameters from CV to bootstrapped datasets?

What numbers are reasonable for "times" for bootstrapped model performance estimations? I'm used to seeing numbers like 1,000-10,000 for bootstrapped estimates of statistics in the non-model performance estimation world. That seems implausible for many models, and in Learn - Model tuning via grid search, y'all use times = 30, although in Learn - Bootstrap resampling and tidy regression models, y'all use times = 2000. I tend to work in computationally limited environments so 2000 is implausible but 30 is not.

Do you ever do bootstrapping on the test data (ie final_fit) to estimate model performance with a CI?

B. Back to apparent = TRUE. Thank you for the link.

beer_boot <- bootstraps(brewing_materials, times = 1e3, apparent = TRUE)

even though the application doesn't seem to rely on the value of the statistics from the source data set either, as far as I can tell. Is there a reason why y'all like keeping the source data set? It seems like the times when it is actually used are probably infrequent? Or am I overlooking something?

For Q1: you can compute them but AFAIK the bootstrap is the only resampling method that has been theoretically shown to have the type of statistical properties that allow you to do those computations with validity.

Q2: You can "pick the winner" using the resampling statistics. That is what I typically do. There is the issue of optimization bias that may lead to some overfitting. IMO it is a real but negligible bias in most data sets. Your milage may vary.

Q3: If you are trying to get percentile intervals for statistics, you'll need to use a very large number of resamples (since we are trying to estimate the area of the distributions tail). For model tuning, I usually use 25-50, depending of the characteristics of the data set. In this application, we are trying to get an estimate of the distribution mean with reasonable uncertainty.

Q4: Never. Leave the test set alone until you really need it at the end. Do not train or optimize with it.

Q5: For intervals, we do compute it. We can get rid of it if we want and certain intervals need it. We wouldn't re-run the resamples again for different types of intervals with the option changed.

TBH there are a lot of resources out there in regards to model tuning. Here are some on-line ones that may help FES, chapter 3 and TMwR. There are many others.