Error when load Recipes to Caret:: Train

Dat · April 5, 2020, 11:04am

Hi

I have this problem when load recipes into caret:: train

There something wrong with the NA imputation, but I don't know how to fix it. If I remove the cross validation everything work fine.

Thanks in advance,

library(caret)
library(tidyverse)
library(rsample)
library(moments) 
library(visdat)
library(recipes)

data(airquality)
set.seed(33) 
air_split <- initial_split(airquality, prop = 0.7) 
air_train <- training(air_split)
air_test <- testing(air_split)

# Feature engineering - final recipe
air_recipe <- recipe(Ozone ~ ., data = air_train) %>% 
  step_zv(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_knnimpute(all_numeric(), neighbors = 6) %>% 
  step_log(Ozone, Wind) %>%
  step_other(Day, threshold = 0.01, other = "other") %>%
  step_dummy(all_nominal(), -all_outcomes())


# Validation
cv5 <- trainControl( method = "repeatedcv", 
                     number = 5,
                     repeats = 5, allowParallel = TRUE)

# Fit an lm model
set.seed(12) 
lm_fit <- train(
  air_recipe,
  data = air_train, 
  method = "lm", 
  trControl = cv5, 
  metric = "RMSE")

Error message
Error in quantile.default(y, probs = seq(0, 1, length = cuts)) : missing values and NaN's not allowed if 'na.rm' is FALSE

R.version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 6.1
year 2019
month 07
day 05
svn rev 76782
language R
version.string R version 3.6.1 (2019-07-05)
nickname Action of the Toes

Abderrahim · April 5, 2020, 7:45pm

It seems the step_knnimpute() function does not work well. My suggestion is to use the missForest() function to deal with the missing values.

library(moments) 
library(visdat)
library(recipes)
library(caret)
library(missForest)

data(airquality)
set.seed(33) 

# Replacing the missing values with missForest() function 
imputation_result  <- missForest(airquality, verbose = TRUE )
# new data without missing values
air_quality_2 <- imputation_result$ximp
# check if there is still missing values in the data 
skimr::skim(air_quality_2 )

air_split <- initial_split(air_quality_2 , prop = 0.7) 

air_train <- training(air_split)
air_test <- testing(air_split)

# Feature engineering - final recipe
air_recipe <- recipe(Ozone ~ ., data = air_train) %>% 
  # step_knnimpute(all_numeric(), neighbors = 6) %>% 
  step_log(Ozone, Wind) %>%
  step_other(Day, threshold = 0.01, other = "other") %>%
  step_dummy(all_nominal(), -all_outcomes())%>% 
  step_zv(all_predictors()) %>% 
  step_nzv(all_predictors())


# Validation
cv5 <- trainControl( method = "repeatedcv", 
                     number = 5,
                     repeats = 5, allowParallel = TRUE)

# Fit an lm model
set.seed(12) 
lm_fit <- train(
  air_recipe,
  data = air_train, 
  method = "lm", 
  trControl = cv5, 
  metric = "RMSE")

lm_fit

technocrat · April 5, 2020, 10:10pm

Hi, and welcome!

Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers. It's much easier to cut and paste a single block.

If @Abderrahim's suggestion of an alternative toolkit is an obstacle, you may (haven't checked it) be able to correct by taking all_numeric() from step_knnimpute and using it as an argument to recipe

recipe(Ozone ~ all_numeric() ... )

This is untested by me, just something suggested by the documentation for all_numeric

Dat · April 6, 2020, 8:06am

Thanks @Abderrahim for your help

Dat · April 6, 2020, 8:07am

Thanks @technocrat for your suggestion & idea

system · April 13, 2020, 8:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.