I currently have 4 supervised text classifiers, namely: Naive Bayes, Semi Naive Bayes, Bernoulli Naive Bayes, and SVM. As a performance metric, I am currently using the F1 score. I also wish to deduce the best model statistically due to the F1 score of all the classifiers being all within the same margin. According to online literature, I have found 3 types of tests:
ANOVA test
Friedman test
Mcnemar`s test (only applies for comparing 2 classifiers)
Which test do you suggest that I utilize, or are there any better suited test for my task?
Mcnemar`s test isn't great since it only uses the off-diagonal entries in the confusion matrix and tests to see if those are about equal.
The tests listed are typically used on the test set. You should only use that data at the end after you've selected one or two models to keep.
You might want to test the resampled performance metrics using tidyposterior. It fits a Bayesian model to the resampling statistics and you can use that to get probabilities for superiority or practical equivalence. Take a look at Section 11.4 of Tidy Models with R.