I created a tidymodel pipeline that works perfectly fine until I change the method of creating the rsplit object from initial_split()
to initial_time_split()
. When I do this, the last_fit()
function breaks and shows the error:
x : internal: Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0
Warning message:
All models failed in [fit_resamples()]. See the.notes
column.
I need to use initial_time_split()
because I have an specific set for training and another set for testing (they are different time periods) so a random pull of rows doesnt work.
Do someone know how to fix this problem or knows a better way to do this? Thank you very much for your time!
Here is the code that produces the error. Notice that if you change the df_split line after the ATTENTION comment it will work perfectly fine.
library(tidymodels)
library(dplyr)
library(fastDummies)
#download the data
data = read.csv("https://uc4f84ae07955bebed2c3804d381.dl.dropboxusercontent.com/cd/0/get/A-PcNiWAKII0M8OlwmYxE1fYXFhtTUPnLw2x_AvL3IlUR_HE8_IPdTVPaYj1mtQwByPgcq2qpj-bfb4O8-wgW4rqgAnff4cLbNSSGe44FewPUmxenJZBpvxXikDQUyVVXXY/file?_download_id=89899030331414668534005769335371927551009313297349360886651029534&_notify_domain=www.dropbox.com&dl=1")
data = data %>%
select(-X) %>%
dummy_cols(remove_selected_columns = TRUE) %>%
mutate(label = as.factor(label))
#Creation of train and test splits
set.seed(123)
proportion = sum(data$train)/nrow(data) #proportion of train observations in the data
# ATTENTION #
df_split = initial_time_split(data, prop = proportion) #IT SEEMS THAT THIS CREATES THE PROBLEM
#df_split = initial_split(data, prop = 3/4, strata = label) #changing this line with the previous fixes the problem
df_train <- training(df_split)
df_test <- testing(df_split)
#Recipe
recipe <-
recipe(label ~ ., data = df_train) %>%
update_role(x, y, train, new_role = "ID")
#model specificaction
cores = parallel::detectCores()
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune(),
) %>%
set_engine("xgboost", nthread = cores) %>%
set_mode("classification")
#Hiperparameters
set.seed(123)
xgb_grid <- grid_max_entropy(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), df_train),
learn_rate(),
size = 7
)
#Cross validation
set.seed(123)
folds <- vfold_cv(df_train, v = 2, strata = label)
#Model workflow
xgb_workflow <-
workflow() %>%
add_model(xgb_spec) %>%
add_recipe(recipe)
### Training###
set.seed(123)
xgb_res = xgb_workflow %>%
tune_grid(resamples = folds,
grid = xgb_grid,
metrics = metric_set(roc_auc))
#selection of best model
best_auc <- select_best(xgb_res, "roc_auc")
#adding best model to workflow
final_xgb <- finalize_workflow(
xgb_workflow,
best_auc)
#Last fit
final_res <- last_fit(object = final_xgb, split = df_split)
x : internal: Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.
Session Info
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252attached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:
[1] fastDummies_1.6.1 xgboost_1.1.1.1 vip_0.2.2 themis_0.1.2 yardstick_0.0.7 workflows_0.1.3 tune_0.1.1 tidyr_1.1.1 tibble_3.0.3 rsample_0.0.7 recipes_0.1.13 purrr_0.3.4 parsnip_0.1.3 modeldata_0.0.2 infer_0.5.3 ggplot2_3.3.2 dplyr_1.0.2 dials_0.0.8 scales_1.1.1 broom_0.7.0 tidymodels_0.1.1