Importance scores in CART algorithm through tidymodels and rpart engine

Hello to everyone. Utilizing the tidymodels framework in order to implement the CART algorithm and implementing tuning for the available hyperparameters in the rpart package (tree depth, cost complexity), it turned out that a decision stump (tree depth = 1) with a minor significance cost complexity parameter (ca=1e-10) is preffered.

When i extract the importance scores though, i get scores for several (6) of the available 8 predictors. Except the splitting variable, how is possible for the other predictors to present importance scores since they have never selected as the splitting variable? Shouldn't they be zero?

Thanks in advance.

CART sets up surrogate split variables by default. These get used if one of the predictors has missing values in the future. You can turn them off though:

library(rpart)

with_ss <- rpart(mpg ~ ., data = mtcars)
# Should only use cyl and hp
with_ss$variable.importance
#>       cyl      disp        hp        wt      qsec        vs      carb      gear 
#> 724.18935 721.08062 702.29023 573.72817 442.05737 395.01237  31.36333  15.68167

ctrl <- rpart.control(maxsurrogate = 0)
without_ss <- rpart(mpg ~ ., data = mtcars, control = ctrl)
without_ss$variable.importance
#>      cyl       hp 
#> 724.1894 109.7717

Created on 2024-02-20 with reprex v2.0.2

1 Like

Hello Max, hence the importance scores for the predictors which are (highly?) correlated with the splitting variable will also be concerned?

I'm not sure what you mean.

I mean the predictors which present significant correlation with the splitting variable are utilized as surrogate ones. That's why I get importance scores for other predictors beyond the splitting one?

Yes that is correct.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.