KSVM model in R (tidymodels) that uses leave one out? Help!

Hi Everyone,

I am a pretty low-level R user but I have become very interested in classifiers recently. I'd like to learn how to properly use KSVM's with a leave one out cross validation. Specifically, I want to end up being able to see how many times it was correct (and create a % accuracy value). That way, I can see if the data is useful for predicting.

Thus far, I've been able to make a model with a 75:25 training:testing split, but I am now stuck. I cannot seem to use that model with fit_resample and a dataset made from loo_cv. Any advice on how to make this work? Script below. Thank you in advance!!

packages


library(tidyverse)

library(tidymodels)

Random Data


set.seed(10131)

X = matrix(rnorm(40), 60,2)

Y = rep(c(-1,1), c(30,30))

X[Y == 1,] = X[Y == 1,] + 1

plot(X, col = Y + 3, pch = 19)

Dataframe


dat = data.frame(X, Y = as.factor(Y))

dat_split = initial_split(dat, strata = Y)

trainD = training(dat_split)

testD = testing(dat_split)

data_loo = loo_cv(dat)

define model


loo_kSVM_test <-

svm_linear(cost = 1) %>%

# This model can be used for classification or regression, so set mode

set_mode("classification") %>%

set_engine("kernlab")

loo_kSVM_test

Fit model


set.seed(1)

Ksvm_fit <- loo_kSVM_test %>% fit(Y ~., data = trainD)

Ksvm_fit

using LOO


Ksvm_refit <-

loo_kSVM_test %>%

fit_resamples(data_loo)

When you call functions like fit_resamples(), you need to add a pre-processor like a recipe or a formula:

Ksvm_refit <-
     loo_kSVM_test %>%
     fit_resamples(Y ~ X, data_loo)

Unfortuentely, we don't support LOOCV; it has fairly bad properties (both statistical and computational) and would require its won infrastructure. If you use it, you get the error:

! Leave-one-out cross-validation is not currently supported with tune.

I would use the bootstrap instead.

If you really need LOO, you could do it yourself though:

library(tidyverse)
library(tidymodels)

set.seed(10131)

X = matrix(rnorm(40), 60,2)

Y = rep(c(-1,1), c(30,30))

X[Y == 1,] = X[Y == 1,] + 1

plot(X, col = Y + 3, pch = 19)


dat = data.frame(X, Y = as.factor(Y))

dat_split = initial_split(dat, strata = Y)

trainD = training(dat_split)

testD = testing(dat_split)

data_loo = loo_cv(dat)


loo_kSVM_test <-
  
  svm_linear(cost = 1) %>%
  
  # This model can be used for classification or regression, so set mode
  
  set_mode("classification") %>%
  
  set_engine("kernlab")


data_loo_res <- 
  data_loo %>% 
  mutate(
    fits = map(splits, ~ fit(loo_kSVM_test, Y ~ ., data = analysis(.x))),
    predicted = map2(splits, fits, ~ predict(.y, assessment(.x), type = "prob"))
  )
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters  
#>  Setting default kernel parameters

all_pred <- 
  bind_rows(!!!data_loo_res$predicted) %>% 
  bind_cols(dat %>% select(Y))

roc_auc(all_pred, Y, `.pred_-1`)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.512

Created on 2023-02-24 by the reprex package (v2.0.1)

Hi Max, thank you so much for your response, I really appreciate your input. Would you mind describing the differences between LOO and bootstrapping a KSVM? My overall goal is to see how well 1 or more variables in my research data predict the condition the data is from; do both of these methods achieve this? I seem to recall bootstrapping only uses a subset of the data. Thank you again!

They both only use a subset of the data.

The bootstrap leaves more out (about 36% on average) and has higher bias but very low variance. LOO is almost no bias but very large variance and there is no way to replicate it to drive down the variance. If the bias is an issue for you, I would do repeated V fold cross-validation with a large V and many replicates.

We don't know how much data you have; people usually want LOO when they have very little data.

For what you want, LOO makes sense as a diagnostic tool but it is still not very good at measuring performance of models. You can get good estimates of accuracy on future samples by avoiding it and using another resampling tool.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.