predict() and fit() giving different model performance results when applied to the same dataset

Hi there! Thanks for all your work creating the tidymodels framework, it's been invaluable to my research!

I'm getting different predictions and different results for model performance when using fit() and predict(), despite applying them to the same (training) dataset, and I'm struggling to understand why. I'm sure it relates to nuances between the two that I've not understood, but I'm kind of stumped - any help would be massively appreciated!

Below's my attempt at a reproducible example - I'm using the cells dataset and training a random-forest on the data (rf_fit). The object rf_fit$fit$predictions is one of the sets of predictions I assess the accuracy of. I then use rf_fit to make predictions on the same data via the predict() function (yielding rf_training_pred, the other set of predictions I assess the accuracy of).

My question is - why are these sets of predictions different from each other? And why are they so different?

I presume something must be going on under the hood I'm not aware off, but I'd expected these to be identical, as I'd assumed that fit() trained a model (and has some predictions associated with this trained model) and then predict() takes that exact model and just re-applies it to (in this case) the same data - hence the predictions of both should be identical.

What am I missing? Any suggestions or help in understanding would be hugely appreciated - thank you! :slight_smile:

# Load required libraries 
library(tidymodels); library(modeldata) 
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

# Set seed 
set.seed(123)

# Split up data into training and test
data(cells, package = "modeldata")

# Define Model
rf_mod <- rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

# Fit the model to training data and then predict on same training data
rf_fit <- rf_mod %>% 
  fit(class ~ ., data = cells)
rf_training_pred <- rf_fit %>%
  predict(cells, type = "prob")

# Evaluate accuracy 
data.frame(rf_fit$fit$predictions) %>%
  bind_cols(cells %>% select(class)) %>%
  roc_auc(truth = class, PS)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.903

rf_training_pred %>%   
  bind_cols(cells %>% select(class)) %>%
  roc_auc(truth = class, .pred_PS)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary          1.00

Created on 2021-09-25 by the reprex package (v2.0.1)

A small point of clarification as I cross-posted this on stack overflow (r - tidymodels - predict() and fit() giving different model performance results when applied to the same dataset - Stack Overflow) and there appears to be a point of confusion.

The perfect accuracy in the 2nd model performance evaluation (of rf_training_pred) is example-specific. I have a longer (non-reprex suitable) regression example where performance of predict is still better than fit, but isn't perfect.

Wanted to post that in here in case it helps clarify my question!

See the post which I marked your SO question as a duplicate of.

Tldr: in the specific case of a random forest model, getting fitted values from the training data requires special treatment. You can't just run the predict method like for other datasets, and use the results.

1 Like

They are not the same predictions. ?ranger::ranger has:

predictions: Predicted classes/values, based on out of bag samples (classification and regression only).

(I added the emphasis)

In other words, the predictions for predict() are based on the predictions from the trees. The ones inside of the ranger object are the predictions when that sample was not in the bootstrap sample that created a specific tree. It is more similar to the assessment set predictions when you resample.

1 Like

Thanks very much both, this makes complete sense (and was super clear)! Much appreciated :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.