Hello to everyone. Utilizing the tidymodels framework in order to implement the CART algorithm and implementing tuning for the available hyperparameters in the rpart package (tree depth, cost complexity), it turned out that a decision stump (tree depth = 1) with a minor significance cost complexity parameter (ca=1e-10) is preffered.
When i extract the importance scores though, i get scores for several (6) of the available 8 predictors. Except the splitting variable, how is possible for the other predictors to present importance scores since they have never selected as the splitting variable? Shouldn't they be zero?
CART sets up surrogate split variables by default. These get used if one of the predictors has missing values in the future. You can turn them off though:
library(rpart)
with_ss <- rpart(mpg ~ ., data = mtcars)
# Should only use cyl and hp
with_ss$variable.importance
#> cyl disp hp wt qsec vs carb gear
#> 724.18935 721.08062 702.29023 573.72817 442.05737 395.01237 31.36333 15.68167
ctrl <- rpart.control(maxsurrogate = 0)
without_ss <- rpart(mpg ~ ., data = mtcars, control = ctrl)
without_ss$variable.importance
#> cyl hp
#> 724.1894 109.7717
I mean the predictors which present significant correlation with the splitting variable are utilized as surrogate ones. That's why I get importance scores for other predictors beyond the splitting one?