classifier behavior (random forest)

I have data that happens to be sequential through time. I’m using {tidymodels} to build a simple classifier using a random forest.

How I’m modeling

I’m only trying to classify the future step based on the immediately prior step, and therefore I’m treating the classifier like a snapshot, analogous to classifying a photo; ie I’m ignoring any structure or trends

The data is being divided as follows:

  • Isolate first 3/4 of the observations (by time) (“FIRST”)

  • Reserve the complementary last 1/4 of the obs by time (“LAST”)

We then further split FIRST into TRAIN and TEST sets, giving us the following sets to use:

i. TRAIN (for modeling and cross validation)

ii. TEST (to test)

iii. LAST (an additional test set)

What I’m finding, in brief:

Cross validation on TRAIN accuracy > TEST set accuracy >> LAST accuracy

So, I’ve looked around for this, but can someone with knowledge diagnose what’s going on? Is this specific to random forests?

I’m obviously missing something basic, something fundamental about information leakage, but here’s a reprex in case it’s helpful:

library(tidymodels)
library(tidyverse)
library(ranger)
library(lubridate)

#set.seed(123)

## some ugly code just to generate some sample data

DATA_OBJECT <- tibble(date = as.Date("2018-01-01") + 0:750) %>%
                  mutate(sunshine = jitter(6*sin(pi*as.numeric(yday(date))/365) + 5, amount = 3),
                         temperature = jitter(62*sin(pi*as.numeric(yday(date))/365) + 30, amount = 12),
                         temp_change = log(temperature/ lag(temperature)),
                         TARGET_change = lead(temp_change),
                         TARGET_bin = as_factor(if_else(TARGET_change > 0, "up", "down"))) %>%
  filter(!is.na(temp_change))

## take the first 3/4
FIRST <- DATA_OBJECT %>%
  slice_head(prop = .75)

## reserve last 1/4
LAST <- DATA_OBJECT %>%
  filter(date > max(FIRST$date))

## tidymodels package
split_data <- initial_split(FIRST)

train_data <- training(split_data)

test_data <- testing(split_data)


### first a recipe

model_rec <- recipe(TARGET_bin ~ ., data = train_data) %>%
  update_role(date, new_role = "the_date") %>%
  step_normalize(sunshine) %>%
  step_normalize(temperature) %>%
  update_role(TARGET_change, new_role = "TARGET_temp")

rand_forest_model <- rand_forest(trees = 5) %>%
  set_engine("ranger",
             importance = "impurity",
             num.threads = 10) %>%
  set_mode("classification")

rand_forest_workflow <- workflow() %>%
  add_recipe(model_rec) %>% # same recipe
  add_model(rand_forest_model)

model_fit <- rand_forest_workflow %>%
  fit(data = train_data)

#### Test using cross validation

folds <- vfold_cv(train_data, v = 10, repeats = 5)

rf_resamp <- fit_resamples(model_fit, 
                           resamples = folds, 
                           control = control_resamples(save_pred = TRUE, 
                                                       verbose = TRUE))


rf_resamp %>%
  collect_metrics(summarize = T) 

#### Try TEST data

test_predictions <- predict(model_fit, test_data) %>%
  bind_cols(test_data) %>%
  bind_cols(predict(model_fit, test_data, type = "prob"))

test_predictions %>%
  roc_curve(truth = TARGET_bin, .pred_up) %>%
  autoplot()

model_fit %>%
  pull_workflow_fit() %>%
  vip::vip()

test_predictions %>%
  conf_mat(truth = TARGET_bin, .pred_class)

test_predictions %>%
  metrics(truth = TARGET_bin, .pred_class)



#### Try LAST data

test_predictions_LAST <- predict(model_fit, LAST) %>%
  bind_cols(LAST)

test_predictions_stub %>%
  metrics(truth = TARGET_bin, .pred_class)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.