tidymodels custom metric for multi class classification yardstick machine learning

nealec · November 27, 2021, 3:30pm

Hi.

I am struggling somewhat with implementing a custom metric within the tidymodels environment.

This project is looking to train several ML models to predict the outcome of football games. The 'truth' is a 3 class factor of H = Home team win, D = draw, A = Away team win. There are ~20 variables being used as predictors. The front to back flow works absolutely fine using accuracy and roc_auc as metrics within a 'stack' collection of models.

Unfortunately 'accuracy' and 'roc_auc' are not ideal measures in this instance. In this workflow, the class probability prediction that is generated from the model needs to be compared to the 'odds' available from bookmakers. If the model predicted probability is > the implied probability from the odds we bet and the outcome of that bet determines the PnL. So I'd like to train the models on a metric that maximises Profit and Loss. This means the metric function needs to also consider two other variables - the odds for a Home Win and the Odds for an Away Win. None of the online literature I can find on custom metrics within the tidymodels environment includes an example or guidance on a metric that uses additional variables (than truth and estimate).

Code so far: create recipe, truth is 'labelWin, 'pOddsUsed...' are the numeric (0 to 1) implied probabilities for respective outcomes

> trainData.recipe <- trainLgDataWin.split %>%
>   recipe(labelWin ~ .) %>%
>   step_normalize(all_predictors(), -c(pOddsUsedHomeWin, pOddsUsedDraw, pOddsUsedAwayWin)) %>%
>   step_upsample(labelWin, over_ratio = 1)
>

I begin to build the custom metric, with the actual function seeking to ask; if the class estimate is "H", then
if predH is > implied probability from the odds, then bet, with the outcome being the decimal odds minus 1
(profit) for a win and -1 for a loss. If predH is less than implied odds, then no bet and return 0. If class
estimate was not H, do the same for A.


>  oddsScore_vec <-
>   function(truth,
>            estimate,
>            pOddsUsedHomeWin,
>            pOddsUsedAwayWin,
>            estimator = NULL,
>            event_level = "first",
>            ...)  {
>
>     estimator <- finalize_estimator(truth, estimate)  
>        oddsScore_impl <-
>       function(truth,
>                estimate,
>                pOddsUsedHomeWin,
>                pOddsUsedAwayWin) {
>         ifelse(
>           estimate == "H",
>           ifelse(predH > pOddsUsedHomeWin,
>                  ifelse(truth == "H", ((1 / pOddsUsedHomeWin) - 1
>                  ), -1), 0),
>           ifelse(
>             estimate == "A",
>             ifelse(predA > pOddsUsedAwayWin,
>                    
>                    ifelse(truth == "A", ((1 / pOddsUsedAwayWin) - 1
>                    ), -1), 0),
>             0
>           )
>         )
>       }
>     metric_vec_template(
>       metric_impl = oddsScore_impl,
>       truth = truth,
>       estimate = estimate,
>       cls = "factor",
>       estimator = estimator,
>       ...
>     )
>   }
> 
> oddsScore <- function(data) {
>   UseMethod("oddsScore")
> }
> 
> oddsScore <- new_class_metric(oddsScore, direction = "maximize")
> 
> oddsScore_df <- function(data, truth, estimate, na_rm = TRUE, ...) {
>   
>   metric_summarizer(
>     metric_nm = "oddsScore",
>     metric_fn = oddsScore_vec,
>     data = data,
>     truth = !! enquo(truth),
>     estimate = !! enquo(estimate), 
>     na_rm = na_rm,
>   )
> }
> 
> #folds 
> trainData.folds <-
>   vfold_cv(trainLgDataWin.split,
>            v = 5,
>            repeats = 2)
>
>
> # build models and workflows 
>
> xGTrainData.model <-
>   parsnip::boost_tree(
>     mode = "classification",
>     trees = tune(),
>     min_n = tune(),
>     tree_depth = tune(),
>     learn_rate = tune(),
>     loss_reduction = NULL,
>     stop_iter = NULL
>   ) %>%
>   set_engine("xgboost")
> 
> # create workflow
> xGTrainData.wflow <-
>   workflow() %>% add_recipe(trainData.recipe) %>% add_model(xGTrainData.model)
>

when I run the model I am getting the following error;

> x Fold5, Repeat2: internal: Error: In metric: `oddsScore`
> unused arguments (truth = ~labelWin, estimate = ~.pred_class, estimator = ~estimator, na_rm = ~na_rm, event_level = ~event_level)

I appreciate this is a more in depth question but I am unable to find anything online to help. If there is somewhere you can recommend for me to take a look I'd be very happy to. Otherwise please let me know what additional information is needed for you to be able to assist.

Many thanks
Chris

nealec · November 27, 2021, 3:32pm

Apologies for my formatting

system · December 18, 2021, 3:32pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.