Bouncing off of tidymodels

arthur.t · May 25, 2021, 4:58am

I’m sad to say I’ve made a few attempts to start bringing tidymodels into my work, but it’s been a frustrating and disappointing experience. I’ve used caret for years and while it has a few flaws perhaps, I continue to find it much simpler and intuitive.

I hope that my criticism is constructive. While I haven’t been able to develop much expertise in tidymodels, I can offer some feedback.

I make this post not to complain but to try to understand if I the only one who feels this way. Is there a dialog about whether tidymodels should be the dominant supported platform for ML and cross validation in R? Or whether this is a good way to teach modeling in R? Sorry if these questions seem inflammatory, but I’ve been along for much of the ride with R and the tidyverse, made a whole career on it, and it’s been amazing. This is the first time I’ve felt left behind or had so many questions about the evolution.

The recipe combines both a formula and data preprocessing. I can’t see the benefit of combining these two mostly independent things. Recipes also abstracts the pre-processing in a potentially dangerous way and streamlines something that in my experience always requires a custom treatment. Pre-processing is something I have to collaborate on with my clients, and always requires something specific to the problem. (Outlier screening within subgroups, etc.). dplyr is an amazing tool for pre-processing and I don’t think it makes sense to create a more limited and abstract alternative.
The delayed execution of the recipe and the prep and the juice in my opinion just makes it more difficult to work with and inflates the number of functions a user has to juggle...
tidymodels simply requires too many functions. It’s a burden to have to keep them all in your head and it’s difficult to understand what they do individually. Many of these functions set_engine, set_mode, set_args just seem more naturally arguments than functions, like they are in caret.
When the operation of creating a tidymodel is spread across so many functions, it is difficult to consult documentation for help. Contrast this with caret::train. You may need to refer to the caret::trainControl documentation as well, but between those two functions and their documentation, you have the whole workflow right in front of you. With tidymodels, the operation is spread across so many functions, I have to keep them all in my head, perhaps look up lots of documentation, try to wrap my head around what options are available. I find myself reading the vignettes again and again to understand the workflow. It’s just too scattered. It doesn’t stick in my head. I would struggle to have confidence that I see all my engines and options correctly.
tidymodels is taught with pipes and I love pipes. But when the intermediate output between the pipes is abstracted, the sequence of steps feels like something you just have to memorize rather than a coherent sequence of individual operations.

What’s the alternative? Well, I do understand what tidymodels is trying to do, and there is a need for a tool that can handle train test split, cross validation, best model selection, prediction on test set...the whole workflow. I just think that the workflow in tidymodels is too abstract and scattered across too many new functions to learn.

And there is one important thing missing from both caret and tidymodels: so many times, I have to create not just one model, but dozens. Different response variables, different cross validation strategies, different pre-processing. With a tibble and map functions, it’s possible and extremely powerful to create one row per model, columns for test set, train set, trControl, train settings (method, tuneLength, ...). Probably many of us have our own version of this approach. I just think there’s an opportunity for a new package to bring the model workflow into tibbles with purrr functions and modelr and caret (leveraging the existing tidyverse capabilities) without having to learn a new abstract framework.

Emilhvitfeldt · May 25, 2021, 6:29am

I have a couple of notes for some of the points raised.

Fitting many models:
Have you taken a look at workflowsets? If I read your last paragraph correctly this package should do what you want. You can read more about it in 15 Screening many models | Tidy Modeling with R.

Recipes:
The use of {recipes} to do preprocessing is by no means a requirement, but a tool to help you ensure that the same transformation is happening on the training and testing data set to avoid the data leakage that would happen if you are not careful otherwise.
{recipes} also provides {dplyr} recipe steps if you need to do non-standard transformations.

Many functions:
I feel this is a common concern for new users. And it can feel overwhelming if you are coming from functions with many arguments. You will end up writing a lot of boilerplate code. Have you taken a look at usemodels? It provides general boilerplate code for common models.

Max · May 25, 2021, 4:41pm

Sorry that you feel this way. It's great feedback for us.

Comments are below are more for the rationale for why it was designed this way along with some helpful tips.

tl;dr You should use what works best for you. I'm not really thinking about dominance; I just want a more modern framework for fitting models.

We/I don't want to make that obsolete so, if that what works for you, then keep going with it. tidymodels came about for two reasons:

caret, due to its design, was becoming extremely difficult when it came to adding new features (like survival models as an example).
The api is not great if you are used to more modular approaches (e.g. the tidyverse).

For people who find base R kinda kludgy, tidymodels would elicit a more favorable reaction that caret would. For the more pythonic among us, mlr3 would be a great fit. R is the opposite of the pythonic idea that "There should be one – and preferably only one – obvious way to do it.". Use what you like most; R has a lot of options.

About the first bullet point above, some of the tidymodels complexity allows use to do a lot more than caret would. The tradeoff is slightly more verbose code to get more features.

Emil mentioned workflowsets and, with caret, the caretEnsemble package might fill this gap for you.

You, and anyone reading this, might find the complete list of tidymodels functions helpful.

It is worth noting that trainControl() has 27 arguments that complement the 11 arguments in train(). The complexity is there too, it is just in two places.

To have a more extensible system, there are going to be more packages and functions. Take a look at the mlr3 package diagram as a second data point.

You should rarely have to use prep() and bake() in the same way that you probably never have to use caret::preProcess().

The workflow objects do this all behind the scenes in the same way as caret does. As you probably know, caret can take recipes too and that also avoids the intermediate functions.

arthur.t · May 27, 2021, 10:24am

Thanks both for your comments. I don't want to belabor the point, but there are three things I would like to add.

First of all, I think it's a lot easier to learn how to use new arguments for a function you're familiar with than it is to learn to use a new function. The arguments are listed in the function documentation the user is already familiar with, and they're approachable to the user as modifications to the operations that he knows. For example, I used caret for a long time until I found the index argument to trainControl, which is a simple and powerful way to control the cv folds. Once I found it though, its use was immediately obvious.

Second, I appreciate your assurance that caret will remain supported, but honestly as a user, it feels like it's been superseded, and it feels like the developers' effort has gone into creating a new framework instead of growing the methods available. It's been a few years since lightGBM was introduced and python programmers are using it to win every single kaggle competition, but we don't have a lightGBM method in base caret yet. And in general there are not a lot of options for neural networks with more than one hidden layer.

Finally, I think the tidyverse and caret already provide the functionality to fit models and create a workflow in an intuitive and customizable way. Here's a vignette of how I and my colleagues (at a chemical company) use tibbles and functional programming to fit any number of different models with any number of different methods. expand_grid to define all sorts of input settings and fit any number of models in a single pmap call and store them in a column. This is a minimal example, but it's not that much more to add train/test splits, pre-processing, bootstrapping etc. I'd argue in this example, the rich set of input arguments to trainControl and train actually empower functional programming. Those arguments work so well with the pmap!

Thanks again.

library(tidyverse)
library(caret)
#> Loading required package: lattice
#> 
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#> 
#>     lift

Many <- 
  # define tibble of trainControl and train settings
  expand_grid(
    # data sets
    data = list(iris),
    # formulas
    form = list(Sepal.Length ~ Sepal.Width + Petal.Length, 
                Sepal.Width ~ Petal.Length + Sepal.Length),
    # trainControl arguments
    cvmethod = "repeatedcv",
    number = 4,
    # train arguments
    method = c("glmnet", "rpart", "rf"),
    tuneLength = 2,
  ) %>%
  mutate(
    # create trControl objects 
    trControl = select(., method = cvmethod, number) %>% pmap(trainControl)
  ) %>% 
  mutate(
    # fit models
    model = select(., form, data, method, tuneLength, trControl) %>% pmap(train)
  ) %>%
  # add quality of fit on validation set
  bind_cols(
    map_dfr(.$model, ~left_join(.$bestTune, .$result, by = names(.$bestTune))
    )
  )
#> Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
#> There were missing values in resampled performance measures.
#> note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
#> Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
#> There were missing values in resampled performance measures.
#> note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

Many %>% select_if(negate(is.list)) %>% print()
#> # A tibble: 6 x 14
#>   cvmethod number method tuneLength alpha   lambda  RMSE Rsquared   MAE RMSESD
#>   <chr>     <dbl> <chr>       <dbl> <dbl>    <dbl> <dbl>    <dbl> <dbl>  <dbl>
#> 1 repeate~      4 glmnet          2   1    3.10e-3 0.341    0.847 0.275 0.0223
#> 2 repeate~      4 rpart           2  NA   NA       0.494    0.647 0.397 0.0792
#> 3 repeate~      4 rf              2  NA   NA       0.335    0.843 0.269 0.0247
#> 4 repeate~      4 glmnet          2   0.1  8.02e-4 0.329    0.446 0.259 0.0433
#> 5 repeate~      4 rpart           2  NA   NA       0.338    0.410 0.269 0.0822
#> 6 repeate~      4 rf              2  NA   NA       0.322    0.519 0.253 0.0370
#> # ... with 4 more variables: RsquaredSD <dbl>, MAESD <dbl>, cp <dbl>,
#> #   mtry <dbl>

^{Created on 2021-05-27 by the reprex package (v1.0.0)}

Max · May 28, 2021, 12:49am

Good points. Thanks for the thoughtful response.

Regarding things like lightGBM, I usually wait until the R package is stable. Take a look at this thread and this comment. For a very long time I couldn't get it installed. Also, their api's are designed in a very un-R-like way, making the transition to caret or treesnip difficult.

I remember spending a lot of time trying to get mxnet to work (multiple times). That never even made it to CRAN.

My development for that package is more cutting edge and less bleeding edge. If someone wants to add models to caret, I would encourage them. That's not hard to merge (usually).

I like your caret example. I know that workflowsets is new but it is designed to do almost everything that your example does and has some nice api's for plotting, ranking, and using Bayesian analysis to compare models. I don't allow varying of the resampling method; I don't think it is a great idea.

system · June 18, 2021, 12:49am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.