How can i see the rules used to determine the predicted class (C5.0 Rules)

brendan.graham · December 27, 2023, 10:03pm

When using C5.0 Rules in Tidymodels, i know i can use tidy(rules) %>% unnest(statistic) to view the rules developed in the tuning process. Is there a way to view which rule was used when predicting new data ?

library(tidymodels)
library(stringr)
library(tidyverse)
library(finetune)
library(rules)

data(parabolic)

# prep ----
set.seed(1)
split <- initial_split(parabolic)
train_set <- training(split)
test_set <- testing(split)

set.seed(2)
train_resamples <- vfold_cv(train_set, v = 10)

# recipes ----
rec <-
  recipe(class ~ ., data = train_set)

# model speficications ----
rules_spec <-
  C5_rules(
    trees = tune(),
    min_n = tune()
  )%>%
  # set_engine("C5.0") %>%
  set_engine("C5.0",
             rules = TRUE, # should the tree be decomposed into a rule-based model?
             control = C5.0Control(
               earlyStopping = TRUE # logical to toggle whether the internal method for stopping boosting should be used.
             )) %>%
  set_mode("classification")

# worflows 

workflow <- 
  workflow_set(
    preproc = list(rec = rec),
    models = list(rules = rules_spec), 
    cross = TRUE
  )

# tune worflow 
sim_anneal_ctrl <- 
  control_sim_anneal(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE,
    restart = 5L,
    verbose = FALSE
  ) 

tune_results <-
  workflow %>%
  workflow_map(
    seed = 3566,
    verbose = FALSE,
    resamples = train_resamples,
    control = sim_anneal_ctrl,
    fn = "tune_sim_anneal", 
    iter = 10,
    metrics = yardstick::metric_set(accuracy, roc_auc)
  )
#> Warning: package 'C50' was built under R version 4.3.2
#> Optimizing accuracy
#> Initial best: 0.90128
#> 1 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 2 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 3 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 4 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 5 ✖ restart from best  accuracy=0.90128 (+/-0.009847)
#> 6 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 7 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 8 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 9 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 10 ✖ restart from best  accuracy=0.90128 (+/-0.009847)

# finalize 
best_rules <- 
  tune_results %>%
  extract_workflow_set_result("rec_rules") %>%
  select_best(metric = "accuracy")

final_rules <- 
  finalize_workflow(
    tune_results %>% 
      extract_workflow("rec_rules"),
    best_rules
  )

rules <- 
  final_rules %>%
  fit(data = train_set) 

# examine rules
tidy(rules) %>% unnest(statistic) 
#> # A tibble: 44 × 7
#>    trial rule_num rule                       num_conditions coverage  lift class
#>    <int>    <int> <chr>                               <dbl>    <dbl> <dbl> <chr>
#>  1     1        1 ( X2 > 1.1817154 )                      1     80    1.98 Clas…
#>  2     1        2 ( X1 < -1.2845802 )                     1     72    1.94 Clas…
#>  3     1        3 ( X1 < 1.302755 ) & ( X2 …              2     74    1.68 Clas…
#>  4     1        4 ( X1 > 1.302755 ) & ( X2 …              2     37    1.82 Clas…
#>  5     1        5 ( X1 > -1.2845802 ) & ( X…              2    179    1.82 Clas…
#>  6     2        1 ( X2 > 1.1817154 )                      1     64.6  1.86 Clas…
#>  7     2        2 ( X1 < -0.9625622 )                     1    102.   1.71 Clas…
#>  8     2        3 ( X1 > -0.9625622 ) & ( X…              2    209.   1.61 Clas…
#>  9     3        1 ( X2 > 0.79147011 )                     1    139.   1.40 Clas…
#> 10     3        2 ( X1 < -0.726008 )                      1    127.   1.25 Clas…
#> # ℹ 34 more rows

# apply model to test set - how can i see the rule that determines each .pred_class?
head(cbind(test_set, predict(rules, test_set)))
#>            X1         X2  class .pred_class
#> 1  2.16500329  3.1655701 Class1      Class1
#> 2 -0.58820180 -0.9772997 Class2      Class2
#> 3 -0.95104979  1.3986329 Class1      Class1
#> 4  0.27469415  0.3704497 Class2      Class2
#> 5 -1.12750589 -1.1395844 Class1      Class1
#> 6  0.08351285  0.5854921 Class2      Class2

Max · December 28, 2023, 3:15pm

You can convert the character string that defines the rule into an expression. From there, you can evaluate it on any data set (see additions below)

library(tidymodels)
library(stringr)
#> 
#> Attaching package: 'stringr'
#> The following object is masked from 'package:recipes':
#> 
#>     fixed
library(tidyverse)
library(finetune)
library(rules)
#> 
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#> 
#>     max_rules

data(parabolic)

# prep ----
set.seed(1)
split <- initial_split(parabolic)
train_set <- training(split)
test_set <- testing(split)

set.seed(2)
train_resamples <- vfold_cv(train_set, v = 10)

# recipes ----
rec <-
  recipe(class ~ ., data = train_set)

# model speficications ----
rules_spec <-
  C5_rules(
    trees = tune(),
    min_n = tune()
  )%>%
  # set_engine("C5.0") %>%
  set_engine("C5.0",
             rules = TRUE, # should the tree be decomposed into a rule-based model?
             control = C5.0Control(
               earlyStopping = TRUE # logical to toggle whether the internal method for stopping boosting should be used.
             )) %>%
  set_mode("classification")

# worflows 

workflow <- 
  workflow_set(
    preproc = list(rec = rec),
    models = list(rules = rules_spec), 
    cross = TRUE
  )

# tune worflow 
sim_anneal_ctrl <- 
  control_sim_anneal(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE,
    restart = 5L,
    verbose = FALSE
  ) 

tune_results <-
  workflow %>%
  workflow_map(
    seed = 3566,
    verbose = FALSE,
    resamples = train_resamples,
    control = sim_anneal_ctrl,
    fn = "tune_sim_anneal", 
    iter = 10,
    metrics = yardstick::metric_set(accuracy, roc_auc)
  )
#> Optimizing accuracy
#> Initial best: 0.90128
#> 1 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 2 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 3 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 4 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 5 ✖ restart from best  accuracy=0.90128 (+/-0.009847)
#> 6 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 7 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 8 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 9 ◯ accept suboptimal  accuracy=0.90128 (+/-0.009847)
#> 10 ✖ restart from best  accuracy=0.90128 (+/-0.009847)

# finalize 
best_rules <- 
  tune_results %>%
  extract_workflow_set_result("rec_rules") %>%
  select_best(metric = "accuracy")

final_rules <- 
  finalize_workflow(
    tune_results %>% 
      extract_workflow("rec_rules"),
    best_rules
  )

rules <- 
  final_rules %>%
  fit(data = train_set) 

# We'll do it for each point in the test set We convert the text of the rule 
# into a usable expression and evaluate it
rule_info_per_sample <- 
  tidy(rules) %>% 
  mutate(
    rule_expr = map(rule, rlang::parse_expr),
    is_active = map(rule_expr, ~ rlang::eval_tidy(.x, test_set)),
    # convert to a data frame
    actives = map(is_active, ~ tibble(active = .x) %>% add_rowindex())
    ) %>% 
  select(committee = trial, rule_num, actives) %>% 
  unnest(actives)

# very few rules per sample
rule_info_per_sample %>% 
  summarize(pct_active = mean(active), .by = c(.row)) %>% 
  summary()
#>       .row       pct_active    
#>  Min.   :  1   Min.   :0.2273  
#>  1st Qu.: 32   1st Qu.:0.2500  
#>  Median : 63   Median :0.2500  
#>  Mean   : 63   Mean   :0.2573  
#>  3rd Qu.: 94   3rd Qu.:0.2727  
#>  Max.   :125   Max.   :0.3636

rule_info_per_sample %>% 
  summarize(num_active = mean(active), .by = c(committee)) %>% 
  mutate(cumulative = cumsum(num_active)) %>% 
  ggplot(aes(committee, cumulative)) +
  geom_line()

^{Created on 2023-12-28 with reprex v2.0.2}

brendan.graham · December 28, 2023, 9:45pm

this is great, not sure i ever would have come up with this so thank you!

Am i correct that the following line of code would tell me how many rules are active for each predicted sample?

rule_info_per_sample %>% 
  filter(active ==TRUE) %>% 
  group_by(.row) %>% 
  tally(name = 'num_active_rules')

Also I assume the way to get the rule string is to join rule_info_per_sample with the output of tidy(rules), something like:

tidy(rules) %>% 
  inner_join(., rule_info_per_sample %>% filter(active == TRUE), 
             by = c("rule_num" = "rule_num", "trial" = "committee")) %>%
  arrange(.row)

Sfdude · January 4, 2024, 3:20pm

Hi Max and Brendan,

Thanks for the great answer
and to you too, Brendan Graham for the great question.

Tried Max's code (above) in my Rstudio,
but using IRIS
( I have no access to the "parabolic" data in the example...),
to determine the rules for each : "Species" col.
Runs flawless!.

Q:
But how do I simply display
the final ( "most decisive"/ generic/ single ) Rule
which mostly determines each Species col?.

Output similar to:
Rule# Condition(s) TargetClass
1 Petal.Length < 1.9 & .... Setosa
2 Sepal.Length > 0.5 & .... Versicolor
3 Sepal.Width > 0.1 & .... VIrginica

Hope my question makes any sense.
If not,
pls ignore me or set me straight...

SFdude
San Francisco
latest Rstudio & R.
Ubuntu Linux 20.04

brendan.graham · January 4, 2024, 9:15pm

I had this same question and I think because boosting is used, 1 or more rule can be "active" per sample across each boosting trial. i suppose if you turned boosting off that would result in 1 rule per observation, but i'm not sure .

based on this documentation i think you can turn boosting off by setting trials = 1. However when i tried this i got a message that the trials parameter cannot be altered, so im not entirely sure how to disable boosting in the C5.0 implementation in tidymodels.

system · January 11, 2024, 9:16pm

Sorry for the lag (the post is closed) but apparently I have "powers" to reply.

I had to look this up! In APM we wrote:

Each boosted model calculates the confidence values for each class as described above and a simple average of these values is calculated. The class with the largest confidence value is selected.

So it is not a simple vote on classes. I don't know that you will be able to reliably recompute the class predictions from the rule information.

You can still have multiple active rules with a single trial (= boosting iteration). C5.0 can simplify the rules so that they are not the same as the rules from the original tree (which are unique). For that reason, 1+ can be active for a particular sample.

If you are using the tidymodels interface, we have "harmonized" the parameters. The main argument for this parameter is trees and that gets translated to trials. The manual pages for each engine show the translation between the parsnip argument names and the ones used by the underlying engine package.