Linear SVM/ Extracting the Top "Predictive" Ngram Features by Weight Assigned in the Linear SVM Fitting

benphua · August 21, 2020, 6:30am

Hi Community,

I'm working on a binary text classification problem using the tagged packages of this post and it turns out the scrappy linear Support Vector Machine (SVM) is doing well.

As a next step I am hoping to extract the most "predictive" ngram features based on the positive and negative weights the linear SVM has assigned to them as per Chang & Lin (2008 link below) and then manually annotating a set of topic labels based on the ngrams that appear.

This would ideally be a query to extract the top X (e.g. 20) ngrams ordered by their SVM weights both positive and negative.

I'm currently working with Parsnip's General Interface for Polynomial SVMs and I understand that one can calculate the weights using Kernlab directly (but might not be a straightforward task... ): https://stackoverflow.com/questions/1899008/weights-from-linear-svm-model-in-r

Before digging deeper I was wondering if the community might be aware of any methods or classes I might use to perform the above task?

Thank you and have a nice day,

Kind regards,

Ben

Ref Chang & Lin 2008 (http://proceedings.mlr.press/v3/chang08a/chang08a.pdf)

benphua · August 21, 2020, 1:46pm

Hi Folks,

I've figured it out thanks to a few posts, very glad for Rebecca Barter's detailed tutorial regarding variable importance of another model class random forests (See ref 2 below).

Outlining the main steps here but please review the links at the end for detail for why it was done this way.

1. Get Your Final Model

set.seed(2020)

# Assuming kernlab linear SVM

# Grid Search Parameters
tune_rs <- tune_grid(
  model_wf,
  train_folds,
  grid = param_grid,
  metrics = classification_measure,
  control = control_grid(save_pred = TRUE)
)

# Finalise workflow with the parameters for best accuracy
best_accuracy <- select_best(tune_rs, "accuracy")

svm_wf_final <- finalize_workflow(
  model_wf,
  best_accuracy
)

# Fit on your final model on all available data at the end of experiment
final_model <- fit(svm_wf_final, data)
# fit takes a model spec and executes the model fit routine (Parsnip)
  # model_spec, formula and data to fit upon

2. Extract the KSVM Object, Pull Required Info, Calculate Variable Importance

ksvm_obj <- pull_workflow_fit(final_model)$fit
# Pull_workflow_fit returns the parsnip model fit object
# $fit returns the object produced by the fitting fn (which is what we need! and is dependent on the engine)

coefs <- ksvm_obj@coef[[1]]
# first bit of info we need are the coefficients from the linear fit

mat <- ksvm_obj@xmatrix[[1]]
# xmatrix that we need to matrix multiply against

var_impt <- coefs %*% mat
# var importance

Ref:

Extracting the Weights of Support Vectors using Caret: https://stackoverflow.com/questions/56515373/linear-svm-and-extracting-the-weights?noredirect=1&lq=1
Variable Importance (Last Section of this post): http://www.rebeccabarter.com/blog/2020-03-25_machine_learning/#finalize-the-workflow

system · September 11, 2020, 1:46pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.