Hi,
I'm trying to understand what I'm doing wrong in a simple v-fold cross validation of a tree model with tune_grid
. My problem is the very long running time of the grid search compared to what I can get with a direct call to rpart
(50 seconds compared to 1 second, roughly).
Here is a simple example:
library(tidymodels)
library(mlbench)
data(PimaIndiansDiabetes)
my_grid <- expand.grid(min_n=2:50)
cv_folds <- vfold_cv(PimaIndiansDiabetes, v = 5, strata="diabetes")
my_model <- decision_tree(cost_complexity=0, min_n=tune()) %>%
set_engine("rpart",xval=0) %>% set_mode("classification")
## 51 seconds on my hardware
tune_results <- my_model %>% tune_grid(diabetes~.,
resamples=cv_folds,
grid=my_grid,
metrics=metric_set(accuracy))
library(rpart)
## 1 second on my hardware
accuracies <- matrix(NA,ncol=length(my_grid$min_n),nrow=5)
for(k in 1:5) {
train <- analysis(cv_folds$splits[[k]])
test <- assessment(cv_folds$splits[[k]])
for(mi in seq_along(my_grid$min_n)) {
dt <- rpart(diabetes~.,data=train,control=rpart.control(xval=0,minsplit=my_grid$min_n[mi],cp=0))
pred <- predict(dt, test, type="class")
accuracies[k,mi] <- accuracy_vec(test %>% pull(diabetes),pred)
}
}
I know of course that my direct call to rpart
does far less than tune_grid
, but the run time difference is very large and as the results are the same, I am under the impression to be missing something. I'm new to tidy models, so that could be something obvious.
I've tested with and without setting xval
to 0 in set_engine("rpart",xval=0)
in the tidy model part (in similarly in the direct call). In both cases, this increases the total running time by roughly 2 seconds on my computer. It seems to me an indication that tune_grid
is spending most of its time in doing something else than fitting the model, which again tends to point to a mistake from my part.
Thanks for any advice!