Hi there! Thanks as ever for all the incredible work that's gone into creating the tidymodels framework, can't convey how useful it's been to my research!
My question is about using xgboost
- specifically how can I access the predictions/fit to the training data of the underlying model being trained (without using predict
).
To clarify what I mean, when fitting a Random Forest model, I can explore the fitted model (rf_fit
in the reprex below ) and its predictions on the training data in two ways.
- Using
predict()
- callingpredict(rf_fit, cells, type = "prob"
. (Method 1). - Getting predictions from
rf_fit
directly (rf_fit$fit$predictions
) (Method 2).
These result in different predictions for reasons that have been clarified here.
In this case, I'm particularly interested in the equivalent of rf_fit$fit$predictions
(i.e. Method 2) for boosted regression trees and my xgb_fit
object. My questions are two-fold:
- Where in
xgb_fit
are the predictions from the trained model? (I.e. where is the equivalent ofrf_fit$fit$predictions
that we get for random forest models)? Or, what do I need to add to get those predictions outputted? - If the above is possible, how should I interpret these predictions? Are they different from calling
predict
? If so, what do they represent (I gather out-of-bag estimates are non-trivial for boosted regression trees)?
(Basically, I'd like the predictions from the model that produced the training_logloss
error at iteration 1000 of xgb_fit$fit$evaluation_log
).
# Load required libraries
library(tidymodels); library(modeldata)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
# Set seed
set.seed(123)
# Load in data
data(cells, package = "modeldata")
# Define Random Forest Model
rf_mod <- rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
# Define BRT Model
xgb_mod <- boost_tree(trees = 1000) %>%
set_mode("classification") %>%
set_engine("xgboost",
objective = 'binary:logistic',
eval_metric = 'logloss')
# Fit the models to training data
rf_fit <- rf_mod %>%
fit(class ~ ., data = cells)
xgb_fit <- xgb_mod %>%
fit(class ~ ., data = cells)
xgb_fit$fit$evaluation_log
#> iter training_logloss
#> 1: 1 0.542353
#> 2: 2 0.443275
#> 3: 3 0.382232
#> 4: 4 0.333377
#> 5: 5 0.303415
#> ---
#> 996: 996 0.001918
#> 997: 997 0.001917
#> 998: 998 0.001917
#> 999: 999 0.001916
#> 1000: 1000 0.001915
# Examine output predictions on training data for RANDOM FOREST Model
rf_whole <- predict(rf_fit, cells, type = "prob") # predictions based on whole fitted model
rf_oob <- head(rf_fit$fit$predictions) # predictions based on out of bag samples
## these are different to each other as we would expect
rf_whole$.pred_PS[1]
#> [1] 0.9229111
rf_oob[1, "PS"]
#> PS
#> 0.8503902
# Examine output predictions on training data for BOOSTED REGRESSION TREE Model
xgb_whole <- predict(xgb_fit, cells, type = "prob")
reprex
#> Error in eval(expr, envir, enclos): object 'reprex' not found
Created on 2021-10-05 by the reprex package (v2.0.1)