Dat
April 5, 2020, 11:04am
1
Hi
I have this problem when load recipes into caret:: train
There something wrong with the NA imputation, but I don't know how to fix it. If I remove the cross validation everything work fine.
Thanks in advance,
library(caret)
library(tidyverse)
library(rsample)
library(moments)
library(visdat)
library(recipes)
data(airquality)
set.seed(33)
air_split <- initial_split(airquality, prop = 0.7)
air_train <- training(air_split)
air_test <- testing(air_split)
# Feature engineering - final recipe
air_recipe <- recipe(Ozone ~ ., data = air_train) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_knnimpute(all_numeric(), neighbors = 6) %>%
step_log(Ozone, Wind) %>%
step_other(Day, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal(), -all_outcomes())
# Validation
cv5 <- trainControl( method = "repeatedcv",
number = 5,
repeats = 5, allowParallel = TRUE)
# Fit an lm model
set.seed(12)
lm_fit <- train(
air_recipe,
data = air_train,
method = "lm",
trControl = cv5,
metric = "RMSE")
Error message
Error in quantile.default(y, probs = seq(0, 1, length = cuts)) : missing values and NaN's not allowed if 'na.rm' is FALSE
R.version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 6.1
year 2019
month 07
day 05
svn rev 76782
language R
version.string R version 3.6.1 (2019-07-05)
nickname Action of the Toes
It seems the step_knnimpute() function does not work well. My suggestion is to use the missForest() function to deal with the missing values.
library(moments)
library(visdat)
library(recipes)
library(caret)
library(missForest)
data(airquality)
set.seed(33)
# Replacing the missing values with missForest() function
imputation_result <- missForest(airquality, verbose = TRUE )
# new data without missing values
air_quality_2 <- imputation_result$ximp
# check if there is still missing values in the data
skimr::skim(air_quality_2 )
air_split <- initial_split(air_quality_2 , prop = 0.7)
air_train <- training(air_split)
air_test <- testing(air_split)
# Feature engineering - final recipe
air_recipe <- recipe(Ozone ~ ., data = air_train) %>%
# step_knnimpute(all_numeric(), neighbors = 6) %>%
step_log(Ozone, Wind) %>%
step_other(Day, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal(), -all_outcomes())%>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
# Validation
cv5 <- trainControl( method = "repeatedcv",
number = 5,
repeats = 5, allowParallel = TRUE)
# Fit an lm model
set.seed(12)
lm_fit <- train(
air_recipe,
data = air_train,
method = "lm",
trControl = cv5,
metric = "RMSE")
lm_fit
2 Likes
Hi, and welcome!
Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers. It's much easier to cut and paste a single block.
If @Abderrahim 's suggestion of an alternative toolkit is an obstacle, you may (haven't checked it) be able to correct by taking all_numeric()
from step_knnimpute
and using it as an argument to recipe
recipe(Ozone ~ all_numeric() ... )
This is untested by me, just something suggested by the documentation for all_numeric
1 Like
Dat
April 6, 2020, 8:06am
4
Thanks @Abderrahim for your help
Dat
April 6, 2020, 8:07am
5
Thanks @technocrat for your suggestion & idea
1 Like
system
Closed
April 13, 2020, 8:07am
6
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.