Interpreting MARs for classification

john.smith · September 27, 2019, 5:24pm

Hi,

I am trying to interpret a MARs model for classification.
Below is an example taken from the recipes package where I'm trying to predict the factor Status
Taking the field Expenses as an example, how do I read/interpret the co-efficent attached to Expenses-113 - 0.0102 and where in my diagram can I see this? Is this everything to the right of my red vertical line (set at 113).

The next question is related to the intercept. If my model below the intercept is set to 0.985
Does this mean on average someone in my dataset is 98.5% likely to be a good customer?
In that case are the co-efficents in reference to the intercept or am I talking out of my hat

Thank you very much for your time. Have a nice weekend

suppressPackageStartupMessages(library(tidyverse))
#> Warning: package 'tidyverse' was built under R version 3.5.2
#> Warning: package 'ggplot2' was built under R version 3.5.3
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'readr' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'dplyr' was built under R version 3.5.2
#> Warning: package 'stringr' was built under R version 3.5.2
#> Warning: package 'forcats' was built under R version 3.5.2
suppressPackageStartupMessages(library(earth))
#> Warning: package 'earth' was built under R version 3.5.2
#> Warning: package 'plotmo' was built under R version 3.5.2
#> Warning: package 'plotrix' was built under R version 3.5.2
suppressPackageStartupMessages(library(recipes))
#> Warning: package 'recipes' was built under R version 3.5.1

mydf <- credit_data %>% 
  na.omit()

table(mydf$Status)
#> 
#>  bad good 
#> 1026 3013

mars_mod <- earth(Status~., mydf)

coefficients(mars_mod) %>% 
  enframe() %>% 
  arrange(desc(abs(value)))
#> # A tibble: 19 x 2
#>    name                  value
#>    <chr>                 <dbl>
#>  1 (Intercept)       0.985    
#>  2 Recordsyes       -0.283    
#>  3 Jobpartime       -0.243    
#>  4 Maritalseparated -0.159    
#>  5 Homepriv         -0.0838   
#>  6 Homeparents       0.0618   
#>  7 h(4-Seniority)   -0.0328   
#>  8 h(Expenses-113)   0.0102   
#>  9 h(Seniority-18)  -0.00984  
#> 10 h(Seniority-4)    0.00840  
#> 11 h(Expenses-75)   -0.00375  
#> 12 h(130-Income)    -0.00319  
#> 13 h(Age-19)        -0.00137  
#> 14 h(113-Expenses)   0.00133  
#> 15 h(Amount-475)    -0.000276 
#> 16 h(1175-Price)    -0.000221 
#> 17 h(Price-1175)     0.0000745
#> 18 h(2432-Debt)      0.0000368
#> 19 h(5500-Assets)   -0.0000291

# Now we build up the PDP Plots
yhat_earth <- function(object, newdata) as.numeric(predict(object, newdata, type = "response"))

# TO confirm we are predicting the Yes label
predict(mars_mod, mydf, type = "response") %>% head(10)
#>            good
#>  [1,] 0.8289485
#>  [2,] 0.9477466
#>  [3,] 0.4210298
#>  [4,] 0.8097659
#>  [5,] 0.7307951
#>  [6,] 0.9320309
#>  [7,] 0.9131152
#>  [8,] 0.9330565
#>  [9,] 0.6851392
#> [10,] 0.2086033

# Expenses-113 - 0.0102, 113-Expenses - 0.00133
pdp::partial(mars_mod, pred.var = "Expenses", grid.resolution = 50) %>% 
  autoplot() +
  geom_vline(xintercept = 113, colour = "red")

^{Created on 2019-09-27 by the reprex package (v0.2.1)}

Max · September 27, 2019, 6:44pm

The coefficients are on the log-odds scale (basically logistic regression with MARS features). So an intercept of 0.985 translate to a probability of binomial()$linkinv(0.985) = 0.7280992.

The plot is also on the log-odds scale (otherwise the lines would be curved).

john.smith · September 27, 2019, 7:21pm

Hi @Max,

Thank you for the quick reply. If i may just follow up with two questions

The coefficient Expenses-113 0.0102 says that where the Expenses are greater than 113, the log odds of being a good customer are increased by 0.0102 and this is represented by the right hand side of the graph? Where does the coefficent 14 h(113-Expenses) 0.00133 sit on the graph?
I have set yhat for the y axis, but if i wanted to set the probability instead of the log odds, I could set that in my yhat_earth formula? I assume this holds for all other models I use as well?

Thank you very much for your time

Max · September 27, 2019, 10:42pm

So before we talk about that, you need to fit this model differently for a qualitative outcome. I just noticed that you should have used the option glm = list(family = binomial). The splits are about the same but the coefficients are different:

suppressPackageStartupMessages(library(earth))
suppressPackageStartupMessages(library(tidymodels))

data(credit_data)

mydf <- credit_data %>% na.omit()

mars_mod <- earth(Status ~ ., mydf, glm = list(family = binomial))

tibble(term = names(coef(mars_mod)), value = unname(coef(mars_mod)))
#> # A tibble: 19 x 2
#>    term                 value
#>    <chr>                <dbl>
#>  1 (Intercept)       3.14    
#>  2 h(Seniority-4)    0.0822  
#>  3 h(4-Seniority)   -0.186   
#>  4 Recordsyes       -1.76    
#>  5 h(130-Income)    -0.0218  
#>  6 Jobpartime       -1.29    
#>  7 h(Amount-475)    -0.00244 
#>  8 h(Price-1175)     0.000918
#>  9 h(1175-Price)    -0.00184 
#> 10 h(5500-Assets)   -0.000222
#> 11 h(Expenses-113)   0.0894  
#> 12 h(113-Expenses)   0.00845 
#> 13 Maritalseparated -1.07    
#> 14 h(2432-Debt)      0.000314
#> 15 Homeparents       0.431   
#> 16 h(Seniority-18)  -0.0891  
#> 17 Homepriv         -0.597   
#> 18 h(Expenses-75)   -0.0261  
#> 19 h(Age-19)        -0.0102

# hold everything constant except Expenses and vary that over its range
profile_data <- 
  recipe(Status ~ ., data = credit_data) %>% 
  step_profile(-Status, -Expenses, profile = vars(Expenses)) %>% 
  prep() %>% 
  juice()

probs <- 
  predict(mars_mod, profile_data, type = "response") %>% 
  as_tibble() %>% 
  setNames("probability")
lin_pred <- 
  predict(mars_mod, profile_data, type = "link") %>% 
  as_tibble() %>% 
  setNames("linear_predictor")

profile_data <- 
  profile_data %>% 
  bind_cols(probs) %>% 
  bind_cols(lin_pred)

ggplot(profile_data, aes(x = Expenses, y = probability)) + geom_path()

ggplot(profile_data, aes(x = Expenses, y = linear_predictor)) + geom_path()

^{Created on 2019-09-27 by the reprex package (v0.2.1)}

That gets complicated. The profile for Expenses uses the intercept and three of the terms (you have two breakpoints at 113 and 75). The different segments are combinations of different slopes for the features.

h(Expenses-75) has a negative slope of -0.0261 which affects values > 75
h(Expenses-113) has a positive slope of 0.0894 which affects values > 113
h(113-Expenses) has a positive slope of 0.00845 which affects values < 113

so that :

< 75 slope is -0.0261
75 - 113 slope is 0.00845 - 0.0261 = -0.01765
> 113 slope is 0.0894 - 0.0261 = 0.0633

Finally, note that the intercept term in these plots is arbitrary and would change for different values of the other predictors. The slopes would not change though (since this is an additive model - no interactions).

john.smith · September 28, 2019, 4:42am

Thank you so much for taking the time out to explain this. I knew i was over simplifying it once i started charting it

I did not realize recipes does profiling now. Is this analogous with Partial Dependency Plots? Do ALE plots also exist within recipes?

Thanks again for your time

Max · September 29, 2019, 12:21pm

All step_profile() does is to isolate the effect of a single variable while holding the others constant. It doesn't do any plots or anything fancy.

system · October 6, 2019, 12:21pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.