Which machine learning method should choose to predict binary outcome based on several binary predictors?

The data has a outcome varaible (healthy or cancer) and several binary predictors (yes or no). I tried logistic regression, SVM, KNN, xgboost, lightGBM, random forest algorithms, and found that the best model was logistic regression. The AUC and accuracy index were close to that of logistic regression when using xgboost and lightGBM even though I tuned the parameters. So which machine learning method should choose to predict binary outcome based on several binary predictors?

Is it suitable for using SVM, KNN, xgboost, lightGBM, random forest algorithms in this case? Or logistic regression is the only method?

Hi Liang_Z.

Given you have several binary predictors, you may wish to try something like a logistic LASSO?

I have tried lasso using tidymodels:

lasso_model <- 
  logistic_reg(mode = "classification",
               penalty = tune(), 
               mixture = 1,
               engine = "glmnet"

lasso_wf <-
  workflow() %>%
  add_model(lasso_model) %>% 

lasso_results <-
  lasso_wf %>% 
  tune_grid(resamples = dat_cv,
            control = control_grid(save_pred = TRUE),
            grid = tibble(penalty = 10 ^ seq(-5, 0, length.out = 50)),
            metrics = metric_set(accuracy,roc_auc)

The roc_aucand accuracy values equal to those obtained from logistic regression.
I mean if other machine learning methods are feasible for this kind of data?

Hi @Liang_Z.

Fair point. I mean, it seems you've pretty much exhausted the options if your primary concerns are AUC.

I went through a similar exploration journey and found very little discrepancy in AUCs. But I ended up settling on something like an optimal decision tree for good visualisation/prediction. I.e. a more white box method.


Thanks for your reply. @AC3112

The results calculated by several ML methods are listed:

algorithm auc accuracy f_meas precision recall
lasso 0.880 0.846 0.789 0.789 0.789
knn 0.825 0.808 0.722 0.765 0.684
svm 0.835 0.827 0.743 0.812 0.684
random forest 0.864 0.827 0.743 0.812 0.684
naive bayes 0.875 0.865 0.788 0.929 0.684
decision trees 0.860 0.827 0.743 0.812 0.684
bag trees 0.883 0.846 0.789 0.789 0.789
mlp 0.880 0.846 0.789 0.789 0.789
xgboost 0.875 0.827 0.769 0.75 0.789
lightgbm 0.870 0.846 0.789 0.789 0.789

It seems that the indices of lasso, bag trees, mlp, and lightgbm are similar. I don't know which one I should choose for the final ML model.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.