Understanding the classification tree from tree package

I am trying to understand how the tree package in R works. The following code is from the an introduction o Statistical learning textbook.

library (tree)
library (ISLR2)
attach (Carseats)
set.seed(32603)
High <- factor(ifelse(Sales <= 8, "No", " Yes ") )
Carseats <- data.frame(Carseats , High)

fit a model on all variables except Sales

tree.carseats <- tree(High~.-Sales,Carseats)
summary(tree.carseats)
tree.carseats
plot (tree.carseats)
text (tree.carseats , pretty = 0)

My question is how does the algorithm decide when to stop? I see there are 5 observations in the bottom most nodes. Is there a threshold that when the number of observations is equal to that threshold the algorithm stops?

The help for the tree() function shows the following arguments, which include control

tree(formula, data, weights, subset,
na.action = na.pass, control = tree.control(nobs, ...),
method = "recursive.partition",
split = c("deviance", "gini"),
model = FALSE, x = FALSE, y = TRUE, wts = TRUE, ...)

The description of control is

control    A list as returned by tree.control

The help for tree.control shows

Usage
tree.control(nobs, mincut = 5, minsize = 10, mindev = 0.01)

Arguments
nobs   The number of observations in the training set.

mincut   The minimum number of observations to include in either child node. This is a
         weighted quantity; the observational weights are used to compute the ‘number’.
         The default is 5.

minsize  The smallest allowed node size: a weighted quantity. The default is 10.

mindev   The within-node deviance must be at least this times that of the root node for the
         node to be split.

So, it seems that the minimum number of observations to include in a child node is 5 by default but you can adjust that.

1 Like

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.