Permutation Variable Importance using AUC

Max · December 30, 2018, 3:46pm

Sort of. If you are measuring predictors outside of the model (that's what my code does), then there is no sense of a "drop", just how rare the observed effect is.

For a model-related permutation approach, you need a holdout sample that is not being used to fit the model. Random forests do this naturally since each tree is based on a bootstrap sample. Many models, including MARS, do not use any resampling when fitting the model so there is no internal method for doing so.

However, we can do something similar. Here's an example that doesn't use permutations but fits a full model and compares the drop in performance against a model without each predictor.

If you want something analogous to what random forests does, there is code below. It's fairly heavy on the parsnip, rsample, and recipes packages but it's a good example to work through.

Yes. It standardizes the results and sorting from high to low would help identify the best predictors.

Here's some code:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────────── tidymodels 0.0.2 ──
#> ✔ broom     0.5.1     ✔ purrr     0.2.5
#> ✔ dials     0.0.2     ✔ recipes   0.1.4
#> ✔ dplyr     0.7.8     ✔ rsample   0.0.3
#> ✔ ggplot2   3.1.0     ✔ tibble    1.4.2
#> ✔ infer     0.4.0     ✔ yardstick 0.0.2
#> ✔ parsnip   0.0.1
#> ── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ rsample::fill()  masks tidyr::fill()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()

# Some more functions ----------------------------------------------------------

# For a specific resample, fit the model to the in-sample
# data and return the fitted object
get_model <- function(split, ...) {
  # Setup the MARS model using the parsnip package
  mars_mod <-
    mars(mode = "classification") %>%
    set_engine("earth", glm = list(family = binomial, maxit = 100))

  # Fit using the bootstrap sample
  model_fit <-
    fit_xy(
      mars_mod,
      x = analysis(split) %>%  dplyr::select(-dep),
      y = analysis(split) %>%  dplyr::select(dep) %>% pull(1)
    )
  model_fit
}

# Get predictors for the data held out from the model fit (aka the "out-of-bag"
# data. Optionally, shuffle any predictor whose name is given to the
# `permute` argument. Otherwise do things as normal.
# This returns the metrics that are requested.
metric_per_resample <- function(split, model, metrics, permute = NA) {
  rec <- recipe(dep ~ ., data = assessment(split))
  if (!is.na(permute)) {
    rec <- rec %>% step_shuffle(!!permute)
    est <- permute
  } else est <- "original"
  rec <- prep(rec, training = analysis(split), fresh = TRUE)
  holdout_data <- bake(rec, new_data = assessment(split))

  # Compute performance
  predict(model, new_data = holdout_data, type = "prob") %>%
    bind_cols(holdout_data %>% dplyr::select(dep)) %>%
    metrics(truth = dep, .pred_setosa) %>%
    # spread(.metric, .estimate) %>%
    mutate(
      .estimator = est,
      .resample = labels(split) %>% pull(id)
    )
}

# Do the same as `metric_per_resample` but reorder the argument names to make
# it easier to map over.
by_variable <- function(variable, split, model, metrics) {
  metric_per_resample(split, model, metrics, variable)
}


# For each resample and model, get the original and permuted estimates of
# performance.
get_metrics <- function(split, model, metrics) {
  # Get predictor names
  predictors <- assessment(split) %>% dplyr::select(-dep) %>% names()

  # performance without permuting
  original <-
    metric_per_resample(split, model, metrics) %>%
    dplyr::select(-.estimator) %>%
    dplyr::rename(.original = .estimate)

  # merge both sets of estimates together
  map_dfr(predictors, by_variable, split, model, metrics) %>%
    full_join(original, by = c(".metric", ".resample"))
}

# Do the computations ----------------------------------------------------------


mydf <- iris %>%
  mutate(
    # Changed this to a factor representation
    dep = ifelse(Species == 'setosa', "setosa", "other"),
    dep = factor(dep, levels = c("setosa", "other"))
  ) %>%
  dplyr::select(-Species)

# Show the results for more than one metric. I'll using areas under the ROC
# and PR curves as examples
auc_metrics <- metric_set(roc_auc, pr_auc)

# This may take a while:
set.seed(452)
boot_samples <-
  bootstraps(mydf, times = 50) %>%
  mutate(
    models = map(splits, get_model),
    metric_values = map2(splits, models, get_metrics, metrics = auc_metrics)
  )
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

<snip>

# Compute the performance loss then take averages
boot_samples %>%
  pull(metric_values) %>%
  bind_rows() %>%
  mutate(drop = .original - .estimate) %>%
  group_by(.metric, .estimator) %>%
  summarize(mean_drop = mean(drop)) %>%
  spread(.metric, mean_drop)
#> # A tibble: 4 x 3
#>   .estimator       pr_auc roc_auc
#>   <chr>             <dbl>   <dbl>
#> 1 Petal.Length  0.335      0.271 
#> 2 Petal.Width   0.0751     0.0444
#> 3 Sepal.Length  0.0000368  0     
#> 4 Sepal.Width  -0.000452   0

Created on 2018-12-30 by the reprex package (v0.2.1)

Note that the zeros occur when MARS doesn't use the predictor in the model at all.