Hi there! Thanks for all your work creating the tidymodels framework, it's been invaluable to my research!
I'm getting different predictions and different results for model performance when using fit()
and predict()
, despite applying them to the same (training) dataset, and I'm struggling to understand why. I'm sure it relates to nuances between the two that I've not understood, but I'm kind of stumped - any help would be massively appreciated!
Below's my attempt at a reproducible example - I'm using the cells dataset and training a random-forest on the data (rf_fit
). The object rf_fit$fit$predictions
is one of the sets of predictions I assess the accuracy of. I then use rf_fit
to make predictions on the same data via the predict()
function (yielding rf_training_pred
, the other set of predictions I assess the accuracy of).
My question is - why are these sets of predictions different from each other? And why are they so different?
I presume something must be going on under the hood I'm not aware off, but I'd expected these to be identical, as I'd assumed that fit()
trained a model (and has some predictions associated with this trained model) and then predict()
takes that exact model and just re-applies it to (in this case) the same data - hence the predictions of both should be identical.
What am I missing? Any suggestions or help in understanding would be hugely appreciated - thank you!
# Load required libraries
library(tidymodels); library(modeldata)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
# Set seed
set.seed(123)
# Split up data into training and test
data(cells, package = "modeldata")
# Define Model
rf_mod <- rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("classification")
# Fit the model to training data and then predict on same training data
rf_fit <- rf_mod %>%
fit(class ~ ., data = cells)
rf_training_pred <- rf_fit %>%
predict(cells, type = "prob")
# Evaluate accuracy
data.frame(rf_fit$fit$predictions) %>%
bind_cols(cells %>% select(class)) %>%
roc_auc(truth = class, PS)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 0.903
rf_training_pred %>%
bind_cols(cells %>% select(class)) %>%
roc_auc(truth = class, .pred_PS)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 1.00
Created on 2021-09-25 by the reprex package (v2.0.1)