Hi all,
I've had some serious success using tidy principles in my text classification project. Following some guides I've been able to produce a classification model that has some pretty strong performance ( > .8 on sensitivity, specificity, recall, and precision). I've gotten as far as creating my predictors / features and putting them into a recipe and juicing it. I've been using these resources:
Here is a truncated version of the R code that shows the model definition:
#cross-validation object
folds <- vfold_cv(train)
#declare a RF classification model
rf_spec <- rand_forest( trees = 500 ) %>%
set_mode("classification") %>%
set_engine("ranger")
rf_spec
#build a 'workflow' by passing the model and the recipe
svm_wf <- workflow() %>%
add_recipe(preprocessing_recipe) %>%
add_model(svm_spec)
svm_wf
#fit the model!
svm_rs <- fit_resamples(
svm_wf,
folds,
metrics = metric_set(recall, precision, sensitivity, specificity, accuracy),
control = control_resamples(save_pred = TRUE)
)
svm_rs
So after defining this model and fitting it, I am able to use it to classify my text! I feel great about the performance metrics so far and am working on tuning my model. But here's what I really want to know:
How can I report the 'goodness of fit' for each record? Or in other words is there a way to know how well a record matches the given classification?
For example, if the model labels a text record as "positive" based on the features / predictors... how can I describe this particular record's fit to the "positive" class? In conventional statistics there are confidence values, intervals, p values, and so on. Any advice or resources would be helpful, thank you.