rfe error - logistic classification - "undefined columns selected" and "'match' requires vector arguments"

I have been using caret to build a logistic regression to classify a binary outcome. I ultimately would like to get to the probability of a success. I am using rfe as I want to utilize automated feature selection. I have run into several errors and have been able to troubleshoot them until I got to this point. I think I am having issues with defining the rank function. Below is a reproducible example using the mtcars dataset. Thank you in advance for any advice.

set.seed(2624)
percent <- 0.80
in_train <- createDataPartition(mtcars$am, p = percent, list = FALSE)
train_data <- mtcars[in_train,]
test_data <- mtcars[-in_train,]

log_recipe <- recipe(formula = am ~ ., 
                     data = train_data) %>% 
  step_other(all_nominal(),  -all_outcomes(), threshold = 0.02, other = "other_assigned ") %>% 
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  step_pca(all_numeric(), num_comp = nrow(train_data)) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_nzv(all_predictors()) %>% 
  step_corr(all_numeric()) %>% 
  step_lincomb(all_numeric()) %>% 
  step_naomit(all_predictors())

log_prepped <- prep(log_recipe, training = train_data, verbose = TRUE, retain = TRUE)

train_prepped_log <- juice(log_prepped)
am <- as.factor(train_data$am)
train_prepped_log <- cbind(am, train_prepped_log)

test_prepped_log <- bake(log_prepped, new_data = test_data)
am <- as.factor(test_data$am)
test_prepped_log <- cbind(am, test_prepped_log)

log_model_info <- getModelInfo("glm")[[3]]
set.seed(2624)
log.glmRFE <-  list(summary = twoClassSummary,
                    type = "Classification",
                    fit = function(x, y, data, first, last, ...){
                      tmp_x <- as.data.frame(x)
                      regressors <- colnames(x)
                      equation <- paste0(regressors, collapse = '+')
                      full_equation <- paste0('y ~ ', equation)
                      glm(as.formula(full_equation),
                          data = tmp_x,
                          family = "binomial")
                      },
                    pred = function(object, x){
                      predict(object, newdata = x, type = "response")
                      },
#                    rank = log_model_info$varImp,
                    rank = function(object, x, y) {
                      vimp <- varImp(object)
                      vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
                      vimp$var <- rownames(vimp)                  
                      vimp
                      },
                    selectSize = pickSizeBest,
                    selectVar = pickVars)

log_ctrl <- rfeControl(functions = log.glmRFE,
                       method = "repeatedcv", 
                       number = 10,
                       repeats = 5,
                       saveDetails = TRUE,
                       verbose= TRUE)

log_model <- rfe(x = train_prepped_log[,2:ncol(train_prepped_log)], y = train_prepped_log$am, rfeControl = log_ctrl)

Error in { : task 1 failed - "undefined columns selected"

I have also tried replacing the rank function with rank = log_model_info$varImp, and received the following error: Error in { : task 1 failed - "'match' requires vector arguments"

1 Like