I have data that happens to be sequential through time. I’m using {tidymodels} to build a simple classifier using a random forest.
How I’m modeling
I’m only trying to classify the future step based on the immediately prior step, and therefore I’m treating the classifier like a snapshot, analogous to classifying a photo; ie I’m ignoring any structure or trends
The data is being divided as follows:
-
Isolate first 3/4 of the observations (by time) (“FIRST”)
-
Reserve the complementary last 1/4 of the obs by time (“LAST”)
We then further split FIRST into TRAIN and TEST sets, giving us the following sets to use:
i. TRAIN (for modeling and cross validation)
ii. TEST (to test)
iii. LAST (an additional test set)
What I’m finding, in brief:
Cross validation on TRAIN accuracy > TEST set accuracy >> LAST accuracy
So, I’ve looked around for this, but can someone with knowledge diagnose what’s going on? Is this specific to random forests?
I’m obviously missing something basic, something fundamental about information leakage, but here’s a reprex in case it’s helpful:
library(tidymodels)
library(tidyverse)
library(ranger)
library(lubridate)
#set.seed(123)
## some ugly code just to generate some sample data
DATA_OBJECT <- tibble(date = as.Date("2018-01-01") + 0:750) %>%
mutate(sunshine = jitter(6*sin(pi*as.numeric(yday(date))/365) + 5, amount = 3),
temperature = jitter(62*sin(pi*as.numeric(yday(date))/365) + 30, amount = 12),
temp_change = log(temperature/ lag(temperature)),
TARGET_change = lead(temp_change),
TARGET_bin = as_factor(if_else(TARGET_change > 0, "up", "down"))) %>%
filter(!is.na(temp_change))
## take the first 3/4
FIRST <- DATA_OBJECT %>%
slice_head(prop = .75)
## reserve last 1/4
LAST <- DATA_OBJECT %>%
filter(date > max(FIRST$date))
## tidymodels package
split_data <- initial_split(FIRST)
train_data <- training(split_data)
test_data <- testing(split_data)
### first a recipe
model_rec <- recipe(TARGET_bin ~ ., data = train_data) %>%
update_role(date, new_role = "the_date") %>%
step_normalize(sunshine) %>%
step_normalize(temperature) %>%
update_role(TARGET_change, new_role = "TARGET_temp")
rand_forest_model <- rand_forest(trees = 5) %>%
set_engine("ranger",
importance = "impurity",
num.threads = 10) %>%
set_mode("classification")
rand_forest_workflow <- workflow() %>%
add_recipe(model_rec) %>% # same recipe
add_model(rand_forest_model)
model_fit <- rand_forest_workflow %>%
fit(data = train_data)
#### Test using cross validation
folds <- vfold_cv(train_data, v = 10, repeats = 5)
rf_resamp <- fit_resamples(model_fit,
resamples = folds,
control = control_resamples(save_pred = TRUE,
verbose = TRUE))
rf_resamp %>%
collect_metrics(summarize = T)
#### Try TEST data
test_predictions <- predict(model_fit, test_data) %>%
bind_cols(test_data) %>%
bind_cols(predict(model_fit, test_data, type = "prob"))
test_predictions %>%
roc_curve(truth = TARGET_bin, .pred_up) %>%
autoplot()
model_fit %>%
pull_workflow_fit() %>%
vip::vip()
test_predictions %>%
conf_mat(truth = TARGET_bin, .pred_class)
test_predictions %>%
metrics(truth = TARGET_bin, .pred_class)
#### Try LAST data
test_predictions_LAST <- predict(model_fit, LAST) %>%
bind_cols(LAST)
test_predictions_stub %>%
metrics(truth = TARGET_bin, .pred_class)