Case Weights- Details & Documentation

Hi Posit Community.

I'm currently evaluating case weights in tidymodels, and my review of the publicly available documentation has led to the following questions:

  • I've used show_model_info() thus far to investigate individual model types and their ability to support case weights. However, is there a comprehensive document that outlines all models that support case weights? Rather than checking each model type individually, I wanted to make sure I wasn't missing a resource that outlines all model types that support case weights.
  • Secondly, how do importance_weights() impact model estimation? I haven't been able to uncover this detail yet as much of the documentation I've reviewed describes why case weights should be used. What I'm most interested in is how weights impact the model construction process. I assume the implementation may change slightly by model type, but what is happening under the hood with importance_weights()? For example, how do observations with higher weight values impact splits in tree-based methods?
  • Lastly, the answer above will likely influence this question, but what advantages are there to leveraging observations with low case weights? Depending upon how importance_weights() are used during the model construction process, I'm wondering if it would be wiser to simply drop the observations with low weights.

For added context, the data I am working with is still developing. That is, there is a time component similar to what is outlined here. The importance weights we plan to use would emphasize fully developed observations and minimize the impact of newer records. In summary, case weights would allow our data science team to leverage as much data as possible while also emphasizing observations that have fully matured.

Any thoughts from the community and/or the contributors to tidymodels would be greatly appreciated.

Thanks,
Mitch

I don't think so. We will add it to this page

In the meantime, here's some code that will make the list.

has_weights <- function(model) {
  x <- get_from_env(paste0(model, "_fit"))
  x %>% 
    mutate(
      protect = map(value, ~ .x$protect),
      has_weights = map_lgl(protect, ~ any(.x == "weights")),
      model = model
    ) %>% 
    select(model, engine, mode, has_weights)
}

all_case_weights <- function() {
  require(parsnip)
  require(dplyr)
  require(purrr)
  get_from_env("models") %>% 
    map_dfr(has_weights) %>% 
    filter(has_weights)
}

all_case_weights()
#> Loading required package: parsnip
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: purrr
#> # A tibble: 25 × 4
#>    model            engine  mode           has_weights
#>    <chr>            <chr>   <chr>          <lgl>      
#>  1 boost_tree       xgboost regression     TRUE       
#>  2 boost_tree       xgboost classification TRUE       
#>  3 boost_tree       C5.0    classification TRUE       
#>  4 decision_tree    rpart   regression     TRUE       
#>  5 decision_tree    rpart   classification TRUE       
#>  6 decision_tree    C5.0    classification TRUE       
#>  7 gen_additive_mod mgcv    regression     TRUE       
#>  8 gen_additive_mod mgcv    classification TRUE       
#>  9 linear_reg       lm      regression     TRUE       
#> 10 linear_reg       glm     regression     TRUE       
#> # … with 15 more rows

Created on 2023-01-04 with reprex v2.0.2

This really depends on the model. Most of them use the weights in their objective function.

For example, linear regression tries to reduce the sum of squared errors (errors = observed - predicted). With case weights, the errors are multiplied by the case weight before the summation. The higher the weight, the more influence it has on the objective function (and thus also the resulting parameter estimates).

For trees, the weights affect the objective function there too. So for CART, the tree is grown to maximize the purity of the classes in the candidate partitions. The cross-tabulation counts that measure purity are inflated or deflated based on the case weights.

You could; for me, it feels like I'm breaking the law when I completely remove an observation. If there is a way to algorithmically increase or decrease weights, that is probably the best way to do it. This happens with some booting methods (like C5.0), causal inference, and some robust regression tools. You milage may vary.

That's not a bad strategy and is probably better than wholesale removing new data. Again, see if you get good results. That's what is so great about empirical validation of models.

Thanks, @Max. I appreciate the prompt reply and the detail. Thus far, I've implemented importance_weights() on a sample dataset only, but I'm looking forward to incorporating your feedback on training data in the coming weeks. Thanks again.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.