In order to properly train a support vector regression model, I'd like to normalize the numeric predictors. Since the kernlab engines also enabled scaling of predictors internally (using the scaled arg), I am wondering: What is the optimal/most efficient way for normalization.
IIRC kernlab will fail if there are any zero variance predictors so step_zv() may be helpful either way.
We often use step_normalize() in training materials even though the underlying model may do the same. This is mostly because we don't want to have a ton of model-specific recipes.
Even tough you say both approaches lead to the same result: I observe that the predictions look different. In particular if I run the last chunk in my previous post. Any idea why that is?
While I don't have an answer, this is pretty interesting.
First, always set the seed when running ksvm with a RBF kernel and no specific value of sigma. It uses sigest() to estimate it. I thought that this would be the issue but it is not.
Here is my detective work that makes me think that it is a kernlab issue:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
tidymodels_prefer()
theme_set(theme_bw())
rec_svr <-
recipe(mpg ~ ., data = mtcars)
spec_svr <-
svm_rbf(mode = "regression") %>%
set_engine(engine = "kernlab", scaled = TRUE)
rec_svr2 <-
recipe(mpg ~ ., data = mtcars) %>%
step_normalize(all_numeric_predictors())
spec_svr2 <-
svm_rbf(mode = "regression") %>%
set_engine(engine = "kernlab", scaled = FALSE)
set.seed(1)
res_1 <-
workflow(rec_svr, spec_svr) %>%
fit(mtcars)
set.seed(1)
res_2 <-
workflow(rec_svr2, spec_svr2) %>%
fit(mtcars)
# Was the issue different sigma estimates?
# These should be the same now that we set the seeds:
res_1$fit$fit$fit@kernelf@kpar$sigma
#> [1] 0.07258094
res_2$fit$fit$fit@kernelf@kpar$sigma
#> [1] 0.07258094
# Are predictions equal?
all.equal(
predict(res_1, mtcars),
predict(res_2, mtcars)
)
#> [1] "Component \".pred\": Mean relative difference: 0.07870634"
# Nope!
# The data that go into the model have different rows.
res_1 %>% extract_fit_engine() %>% pluck("xmatrix") %>% dim()
#> [1] 28 10
res_2 %>% extract_fit_engine() %>% pluck("xmatrix") %>% dim()
#> [1] 32 10
# waldo::compare() shows that res_2 is missing rows 1, 2, 7, and 9 of the
# original data.
# Here are the data coming out of the recipe:
rec_svr2 %>% prep() %>% bake(new_data = NULL, all_predictors()) %>% dim()
#> [1] 32 10
# The data in the workflow objects have the same number of rows:
dim(res_1$pre$mold$predictors)
#> [1] 32 10
dim(res_2$pre$mold$predictors)
#> [1] 32 10
## At this point ¯\\_(ツ)_/¯
thanks for looking further into the issue, very interesting, indeed... Based on your insights, I would be more confident to go with the second approach as observations do not appear to be dropped when scaled = FALSE.
By the way, this variant also leads to the four observations being excluded:
If I got it correctly, this is essentially the workflow you use in most training/doc materials right (since scaled = TRUE is the kernlab default)? This however produces the same odd behavior as workflow 1, effectively dropping four rows from the data...
Any final recommendation what is the best recipe and model spec from your point of view?
In general it doesn't matter if the data are rescaled after having been scaled. If you want to have model-specific recipes (which we do to some extent) then you can omit step_normalize().
In this particular case, it is likely a bug so I would use scaled = FALSE.
EDIT: It appears that xmatrix simply returns the matrix of support vectors of the fitted model and not the actual training data points (ksvm-class: Class "ksvm" in kernlab: Kernel-Based Machine Learning Lab). That is, res1 has 28 support vectors while res2 has 32 support vectors. Also the model errors look vastly different:
# res1
Support Vector Machine object of class "ksvm"
SV type: eps-svr (regression)
parameter : epsilon = 0.1 cost C = 1
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.0716163705770673
Number of Support Vectors : 28
Objective Function Value : -8.4211
Training error : 0.134343
# res2
Support Vector Machine object of class "ksvm"
SV type: eps-svr (regression)
parameter : epsilon = 0.1 cost C = 1
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.0716163705770673
Number of Support Vectors : 32
Objective Function Value : -100.672
Training error : 14.141889
As a result, the second model with scaled = F performs much worse on mtcars and my data as well. Super confused right now, think I will revert to the res1 workflow with having scaled = T...
EDIT2: Instead ymatrix should return the same number of observations. However, if scaled=T it also scales the reponse variable. Could this be the reason why we obtain different predictions since when we apply step_normalize(all_numeric_predictors()) we clearly do not scale the response?