I'm working on a binary text classification problem using the tagged packages of this post and it turns out the scrappy linear Support Vector Machine (SVM) is doing well.
As a next step I am hoping to extract the most "predictive" ngram features based on the positive and negative weights the linear SVM has assigned to them as per Chang & Lin (2008 link below) and then manually annotating a set of topic labels based on the ngrams that appear.
This would ideally be a query to extract the top X (e.g. 20) ngrams ordered by their SVM weights both positive and negative.
I've figured it out thanks to a few posts, very glad for Rebecca Barter's detailed tutorial regarding variable importance of another model class random forests (See ref 2 below).
Outlining the main steps here but please review the links at the end for detail for why it was done this way.
1. Get Your Final Model
set.seed(2020)
# Assuming kernlab linear SVM
# Grid Search Parameters
tune_rs <- tune_grid(
model_wf,
train_folds,
grid = param_grid,
metrics = classification_measure,
control = control_grid(save_pred = TRUE)
)
# Finalise workflow with the parameters for best accuracy
best_accuracy <- select_best(tune_rs, "accuracy")
svm_wf_final <- finalize_workflow(
model_wf,
best_accuracy
)
# Fit on your final model on all available data at the end of experiment
final_model <- fit(svm_wf_final, data)
# fit takes a model spec and executes the model fit routine (Parsnip)
# model_spec, formula and data to fit upon
ksvm_obj <- pull_workflow_fit(final_model)$fit
# Pull_workflow_fit returns the parsnip model fit object
# $fit returns the object produced by the fitting fn (which is what we need! and is dependent on the engine)
coefs <- ksvm_obj@coef[[1]]
# first bit of info we need are the coefficients from the linear fit
mat <- ksvm_obj@xmatrix[[1]]
# xmatrix that we need to matrix multiply against
var_impt <- coefs %*% mat
# var importance