When using C5.0 Rules in Tidymodels, i know i can use tidy(rules) %>% unnest(statistic)
to view the rules developed in the tuning process. Is there a way to view which rule was used when predicting new data ?
library(tidymodels)
library(stringr)
library(tidyverse)
library(finetune)
library(rules)
data(parabolic)
# prep ----
set.seed(1)
split <- initial_split(parabolic)
train_set <- training(split)
test_set <- testing(split)
set.seed(2)
train_resamples <- vfold_cv(train_set, v = 10)
# recipes ----
rec <-
recipe(class ~ ., data = train_set)
# model speficications ----
rules_spec <-
C5_rules(
trees = tune(),
min_n = tune()
)%>%
# set_engine("C5.0") %>%
set_engine("C5.0",
rules = TRUE, # should the tree be decomposed into a rule-based model?
control = C5.0Control(
earlyStopping = TRUE # logical to toggle whether the internal method for stopping boosting should be used.
)) %>%
set_mode("classification")
# worflows
workflow <-
workflow_set(
preproc = list(rec = rec),
models = list(rules = rules_spec),
cross = TRUE
)
# tune worflow
sim_anneal_ctrl <-
control_sim_anneal(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE,
restart = 5L,
verbose = FALSE
)
tune_results <-
workflow %>%
workflow_map(
seed = 3566,
verbose = FALSE,
resamples = train_resamples,
control = sim_anneal_ctrl,
fn = "tune_sim_anneal",
iter = 10,
metrics = yardstick::metric_set(accuracy, roc_auc)
)
#> Warning: package 'C50' was built under R version 4.3.2
#> Optimizing accuracy
#> Initial best: 0.90128
#> 1 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 2 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 3 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 4 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 5 ✖ restart from best accuracy=0.90128 (+/-0.009847)
#> 6 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 7 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 8 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 9 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 10 ✖ restart from best accuracy=0.90128 (+/-0.009847)
# finalize
best_rules <-
tune_results %>%
extract_workflow_set_result("rec_rules") %>%
select_best(metric = "accuracy")
final_rules <-
finalize_workflow(
tune_results %>%
extract_workflow("rec_rules"),
best_rules
)
rules <-
final_rules %>%
fit(data = train_set)
# examine rules
tidy(rules) %>% unnest(statistic)
#> # A tibble: 44 × 7
#> trial rule_num rule num_conditions coverage lift class
#> <int> <int> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 ( X2 > 1.1817154 ) 1 80 1.98 Clas…
#> 2 1 2 ( X1 < -1.2845802 ) 1 72 1.94 Clas…
#> 3 1 3 ( X1 < 1.302755 ) & ( X2 … 2 74 1.68 Clas…
#> 4 1 4 ( X1 > 1.302755 ) & ( X2 … 2 37 1.82 Clas…
#> 5 1 5 ( X1 > -1.2845802 ) & ( X… 2 179 1.82 Clas…
#> 6 2 1 ( X2 > 1.1817154 ) 1 64.6 1.86 Clas…
#> 7 2 2 ( X1 < -0.9625622 ) 1 102. 1.71 Clas…
#> 8 2 3 ( X1 > -0.9625622 ) & ( X… 2 209. 1.61 Clas…
#> 9 3 1 ( X2 > 0.79147011 ) 1 139. 1.40 Clas…
#> 10 3 2 ( X1 < -0.726008 ) 1 127. 1.25 Clas…
#> # ℹ 34 more rows
# apply model to test set - how can i see the rule that determines each .pred_class?
head(cbind(test_set, predict(rules, test_set)))
#> X1 X2 class .pred_class
#> 1 2.16500329 3.1655701 Class1 Class1
#> 2 -0.58820180 -0.9772997 Class2 Class2
#> 3 -0.95104979 1.3986329 Class1 Class1
#> 4 0.27469415 0.3704497 Class2 Class2
#> 5 -1.12750589 -1.1395844 Class1 Class1
#> 6 0.08351285 0.5854921 Class2 Class2
Max
December 28, 2023, 3:15pm
2
You can convert the character string that defines the rule into an expression. From there, you can evaluate it on any data set (see additions below)
library(tidymodels)
library(stringr)
#>
#> Attaching package: 'stringr'
#> The following object is masked from 'package:recipes':
#>
#> fixed
library(tidyverse)
library(finetune)
library(rules)
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
data(parabolic)
# prep ----
set.seed(1)
split <- initial_split(parabolic)
train_set <- training(split)
test_set <- testing(split)
set.seed(2)
train_resamples <- vfold_cv(train_set, v = 10)
# recipes ----
rec <-
recipe(class ~ ., data = train_set)
# model speficications ----
rules_spec <-
C5_rules(
trees = tune(),
min_n = tune()
)%>%
# set_engine("C5.0") %>%
set_engine("C5.0",
rules = TRUE, # should the tree be decomposed into a rule-based model?
control = C5.0Control(
earlyStopping = TRUE # logical to toggle whether the internal method for stopping boosting should be used.
)) %>%
set_mode("classification")
# worflows
workflow <-
workflow_set(
preproc = list(rec = rec),
models = list(rules = rules_spec),
cross = TRUE
)
# tune worflow
sim_anneal_ctrl <-
control_sim_anneal(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE,
restart = 5L,
verbose = FALSE
)
tune_results <-
workflow %>%
workflow_map(
seed = 3566,
verbose = FALSE,
resamples = train_resamples,
control = sim_anneal_ctrl,
fn = "tune_sim_anneal",
iter = 10,
metrics = yardstick::metric_set(accuracy, roc_auc)
)
#> Optimizing accuracy
#> Initial best: 0.90128
#> 1 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 2 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 3 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 4 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 5 ✖ restart from best accuracy=0.90128 (+/-0.009847)
#> 6 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 7 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 8 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 9 ◯ accept suboptimal accuracy=0.90128 (+/-0.009847)
#> 10 ✖ restart from best accuracy=0.90128 (+/-0.009847)
# finalize
best_rules <-
tune_results %>%
extract_workflow_set_result("rec_rules") %>%
select_best(metric = "accuracy")
final_rules <-
finalize_workflow(
tune_results %>%
extract_workflow("rec_rules"),
best_rules
)
rules <-
final_rules %>%
fit(data = train_set)
# We'll do it for each point in the test set We convert the text of the rule
# into a usable expression and evaluate it
rule_info_per_sample <-
tidy(rules) %>%
mutate(
rule_expr = map(rule, rlang::parse_expr),
is_active = map(rule_expr, ~ rlang::eval_tidy(.x, test_set)),
# convert to a data frame
actives = map(is_active, ~ tibble(active = .x) %>% add_rowindex())
) %>%
select(committee = trial, rule_num, actives) %>%
unnest(actives)
# very few rules per sample
rule_info_per_sample %>%
summarize(pct_active = mean(active), .by = c(.row)) %>%
summary()
#> .row pct_active
#> Min. : 1 Min. :0.2273
#> 1st Qu.: 32 1st Qu.:0.2500
#> Median : 63 Median :0.2500
#> Mean : 63 Mean :0.2573
#> 3rd Qu.: 94 3rd Qu.:0.2727
#> Max. :125 Max. :0.3636
rule_info_per_sample %>%
summarize(num_active = mean(active), .by = c(committee)) %>%
mutate(cumulative = cumsum(num_active)) %>%
ggplot(aes(committee, cumulative)) +
geom_line()
Created on 2023-12-28 with reprex v2.0.2
1 Like
this is great, not sure i ever would have come up with this so thank you!
Am i correct that the following line of code would tell me how many rules are active for each predicted sample?
rule_info_per_sample %>%
filter(active ==TRUE) %>%
group_by(.row) %>%
tally(name = 'num_active_rules')
Also I assume the way to get the rule string is to join rule_info_per_sample
with the output of tidy(rules)
, something like:
tidy(rules) %>%
inner_join(., rule_info_per_sample %>% filter(active == TRUE),
by = c("rule_num" = "rule_num", "trial" = "committee")) %>%
arrange(.row)
Sfdude
January 4, 2024, 3:20pm
4
Hi Max and Brendan,
Thanks for the great answer
and to you too, Brendan Graham for the great question.
Tried Max's code (above) in my Rstudio,
but using IRIS
( I have no access to the "parabolic" data in the example...),
to determine the rules for each : "Species" col.
Runs flawless!.
Q:
But how do I simply display
the final ( "most decisive"/ generic/ single ) Rule
which mostly determines each Species col?.
Output similar to:
Rule# Condition(s) TargetClass
1 Petal.Length < 1.9 & .... Setosa
2 Sepal.Length > 0.5 & .... Versicolor
3 Sepal.Width > 0.1 & .... VIrginica
Hope my question makes any sense.
If not,
pls ignore me or set me straight...
SFdude
San Francisco
latest Rstudio & R.
Ubuntu Linux 20.04
I had this same question and I think because boosting is used, 1 or more rule can be "active" per sample across each boosting trial. i suppose if you turned boosting off that would result in 1 rule per observation, but i'm not sure .
based on this documentation i think you can turn boosting off by setting trials = 1
. However when i tried this i got a message that the trials parameter cannot be altered, so im not entirely sure how to disable boosting in the C5.0 implementation in tidymodels.
1 Like
system
Closed
January 11, 2024, 9:16pm
6
Sorry for the lag (the post is closed) but apparently I have "powers" to reply.
I had to look this up! In APM we wrote:
Each boosted model calculates the confidence values for each class as described above and a simple average of these values is calculated. The class with the largest confidence value is selected.
So it is not a simple vote on classes. I don't know that you will be able to reliably recompute the class predictions from the rule information.
brendan.graham:
I had this same question and I think because boosting is used, 1 or more rule can be "active" per sample across each boosting trial. i suppose if you turned boosting off that would result in 1 rule per observation, but i'm not sure .
You can still have multiple active rules with a single trial (= boosting iteration). C5.0 can simplify the rules so that they are not the same as the rules from the original tree (which are unique). For that reason, 1+ can be active for a particular sample.
If you are using the tidymodels interface, we have "harmonized" the parameters. The main argument for this parameter is trees
and that gets translated to trials
. The manual pages for each engine show the translation between the parsnip argument names and the ones used by the underlying engine package.