I have been using caret to build a logistic regression to classify a binary outcome. I ultimately would like to get to the probability of a success. I am using rfe as I want to utilize automated feature selection. I have run into several errors and have been able to troubleshoot them until I got to this point. I think I am having issues with defining the rank function. Below is a reproducible example using the mtcars dataset. Thank you in advance for any advice.
set.seed(2624)
percent <- 0.80
in_train <- createDataPartition(mtcars$am, p = percent, list = FALSE)
train_data <- mtcars[in_train,]
test_data <- mtcars[-in_train,]
log_recipe <- recipe(formula = am ~ .,
data = train_data) %>%
step_other(all_nominal(), -all_outcomes(), threshold = 0.02, other = "other_assigned ") %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_pca(all_numeric(), num_comp = nrow(train_data)) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric()) %>%
step_lincomb(all_numeric()) %>%
step_naomit(all_predictors())
log_prepped <- prep(log_recipe, training = train_data, verbose = TRUE, retain = TRUE)
train_prepped_log <- juice(log_prepped)
am <- as.factor(train_data$am)
train_prepped_log <- cbind(am, train_prepped_log)
test_prepped_log <- bake(log_prepped, new_data = test_data)
am <- as.factor(test_data$am)
test_prepped_log <- cbind(am, test_prepped_log)
log_model_info <- getModelInfo("glm")[[3]]
set.seed(2624)
log.glmRFE <- list(summary = twoClassSummary,
type = "Classification",
fit = function(x, y, data, first, last, ...){
tmp_x <- as.data.frame(x)
regressors <- colnames(x)
equation <- paste0(regressors, collapse = '+')
full_equation <- paste0('y ~ ', equation)
glm(as.formula(full_equation),
data = tmp_x,
family = "binomial")
},
pred = function(object, x){
predict(object, newdata = x, type = "response")
},
# rank = log_model_info$varImp,
rank = function(object, x, y) {
vimp <- varImp(object)
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = pickSizeBest,
selectVar = pickVars)
log_ctrl <- rfeControl(functions = log.glmRFE,
method = "repeatedcv",
number = 10,
repeats = 5,
saveDetails = TRUE,
verbose= TRUE)
log_model <- rfe(x = train_prepped_log[,2:ncol(train_prepped_log)], y = train_prepped_log$am, rfeControl = log_ctrl)
Error in { : task 1 failed - "undefined columns selected"
I have also tried replacing the rank function with rank = log_model_info$varImp,
and received the following error: Error in { : task 1 failed - "'match' requires vector arguments"