Hierarchical Forecasting with a Binary Regressor (Public Holiday)

Hi,

I am working for an F&B multi-national company.

I am trying to forecast sales of each outlet using ARIMA hierarchical forecasting, using a binary variable 'holiday_flag' as a regressor (1 = public holiday or weekend, 0 = otherwise). This is the code that I used:

sales_data_agg_fit <- train_data |>
                      fill_gaps() |>
                      model(ARIMA(gross_sales ~ holiday_flag))

When I run the model, I got an error that says "Provided exogenous regressors are rank deficient, removing regressors: holiday_flag". The binary variable only consists of 1 and 0, and proportion of '1' is around 33%. Is it not possible to run a dynamic hierarchical forecasting model with a binary regressors? If it's not, any recommendations on how to proceed?

Thanks!


Referred here by Forecasting: Principles and Practice, by Rob J Hyndman and George Athanasopoulos

It is definitely possible, provided you have a mix of 1s and 0s for the binary variable for each series. Perhaps some of your time series contain missing values when weekends or holidays occur, so that holiday_flag is zero whenever there is data available.

If you want further help, you will need to post a reproducible example of the problem.

Having thoroughly studied your book "Forecasting: Principles and Practice," specifically Chapter 11 on Forecasting Hierarchical and having tried your suggestion to check whether there is missing value on the holiday_flag or not, already exclude missing value from the data.
However, I encountered another issue that I would like to inquire about. My time series data comprises columns such as business_date, store_code, gross_sales, and holiday_flag. The challenge arises from the varying start dates for each store_code. Consequently, when I create models, some data points end up as NA due to certain store_codes not being operational within the selected date range, for instance store_code= "XXX1" open on 1 April 2022, "XXX2" open on 1 March 2022, while dataset from that I used from 1 March 2023 until 30 Nov 2023.
This is the code that I used:

fit_1 <- train_data|>
                   fill_gaps() |> 
                   model(
    ARIMA(gross_sales ~ PDQ(0, 0, 0) + pdq(d = 0) + holiday_flag)
)
fc_1 <- forecast(fit_1,new_data=test_data) |>
                  fill_gaps()
head(fc_1)

Message error:
Error in 'mutate()':
Caused by error in 'check_gaps()':
.data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using::'tsibble::fill_gaps9)' if required.

Any recommendations on how to proceed?

Thanks!

Please provide a reproducible example of the problem. It is hard to provide help when we can't replicate it. If you can't share the actual train_data object, then share some small version of it with the non-missing gross_sales values replaced by random numbers.

At a guess, I would think you need to apply fill_gaps() to the test_data object, not to the forecasts.

Hi,
Thank you for the suggestion!I have already applied fill_gaps() to both my test_data and train_data before creating the forecasts, ensuring there is no blank data in the dataset, it works. While my first model with Dynamic Regression with Holiday and Promotion is working well, I have encountered a warning message in sqrt(diag(best$var.coef)): "NaNs produced." Additionally, Auto ARIMA is not working and is displaying an error message.

Could you please help to explain those errors :

  1. What is the meaning and the reason of warning message in sqrt(diag(best$var.coef)):"NaNs produced"? is it impacting the result of forecast when this message still appears?
  2. Why didn't it work with Auto Arima?
  3. Do you have any other recommended models for forecasting cases that involve combining exogenous regressors, such as holiday_flag, promotion_flag, and economic_issues?
# Load the packages
library(fpp3)
library(googleAuthR)
library(bigrquery)
library(scales)
library(GGally)
library(forecast)
library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)
library(urca)
library(hts)
library(lubridate)
library(tidyr)
library(readxl)

#Get data 
train_data = read_excel("train_data.xlsx")
test_data = read_excel("test_data.xlsx")
public_holiday = read_excel("regressor.xlsx")
head(public_holiday)

#Set business_date as date format
train_data <- train_data |>
              mutate(business_date = as.Date(business_date))
test_data <- test_data |>
              mutate(business_date = as.Date(business_date))
public_holiday <- public_holiday |>
              mutate(business_date = as.Date(business_date))

#List Existing Store
list_existing_stores <- test_data |> 
                        filter(business_date =='2023-12-31')
list_existing_stores <- unique(list_existing_stores$store_code) 
length(list_existing_stores)
head(list_existing_stores)

# Convert character date to tsibble
train_data_ts <- train_data %>% 
                 filter(store_code %in% list_existing_stores) |>
    as_tsibble(
    index =business_date,
    key=store_code
  )
head(train_data_ts)

# Convert character date to tsibble
test_data_ts <- test_data %>% 
                 filter(store_code %in% list_existing_stores) |>
    as_tsibble(
    index =business_date,
    key=store_code
  )
head(test_data_ts)

#Fill the gap/blank data
train_data_1 <- train_data_ts |>
                fill_gaps() |>
                fill(gross_sales, .direction = 'down')
head(train_data_1)

test_data_1 <-  test_data_ts |>
                fill_gaps() |>
                fill(gross_sales, .direction = 'down')
head(test_data_1)

# Convert character date to tsibble for public_holiday
public_holiday_1 <- public_holiday %>%
                    as_tsibble(
                    index =business_date,
  )
head(public_holiday_1)

#Combine the main data with exogenous regressor(holiday_flag + promotion_flag)
train_data_2 <- inner_join(train_data_1, public_holiday_1, by='business_date')
head(train_data_2)
test_data_2 <- inner_join(test_data_1, public_holiday_1, by='business_date')
head(test_data_2)

# Test and Train Data 
# Define the date ranges for training and test sets
train_start <- as.Date("2022-02-28")
train_end <- as.Date("2023-12-01")  
test_start <- as.Date("2023-11-30")
test_end <- as.Date("2024-01-01")

# Divide the data into training and test sets
train_data_2<- subset(train_data_2,business_date > train_start & business_date < train_end)
test_data_2<- subset(test_data_2,business_date > test_start & business_date < test_end)

# Convert character date to tsibble
train_data_ts2 <- train_data_2 %>%
                 as_tsibble(
                 index =business_date,
                 key=store_code
                )
head(train_data_ts2)

# Convert character date to tsibble
test_data_ts2 <- test_data_2 %>%
                 as_tsibble(
                 index =business_date,
                 key=store_code
                )
head(test_data_ts2)

#1. Dynamic Harmonic Regression with Promotion and Holiday
fit_1 <- train_data_ts2|> 
    model(
    ARIMA(gross_sales ~ PDQ(0, 0, 0) + pdq(d = 0) + holiday_flag + promotion_flag)
)
fc_1 <- forecast(fit_1,new_data=test_data_ts2)
head(fc_1)

2. Auto Arima
fit_2<- train_data_ts2 |>
         model(ARIMA(gross_sales ~ holiday_flag + promotion_flag)
               )
fc_2 <- forecast(fit_c2,new_data= test_data_ts2 |>
head(fc_2)

I attached the code and data click here

Thank you

Your code had a lot of duplication, and loaded many unnecessary packages. I've re-written it to make it simpler and clearer.

# Load the packages
library(fpp3)

# Get data
train_data <- readxl::read_excel("forecast_data.xlsx", sheet = "train_data")
test_data <- readxl::read_excel("forecast_data.xlsx", sheet = "test_data")
public_holiday <- readxl::read_excel("forecast_data.xlsx", sheet = "regressor")

complete_data <- bind_rows(train_data, test_data) |>
  inner_join(public_holiday, by = "business_date") |>
  mutate(business_date = as.Date(business_date))

# Existing Stores
existing_stores <- complete_data |>
  filter(business_date == max(business_date)) |>
  pull(store_code) |>
  unique()

# Convert to tsibble
complete_data <- complete_data |>
  filter(store_code %in% existing_stores) |>
  as_tsibble(index = business_date, key = store_code) |>
  fill_gaps() |>
  fill(gross_sales, .direction = "down")

# Split back into train and test data
train_data <- complete_data |>
  filter(business_date <= as.Date(max(train_data$business_date)))
test_data <- complete_data |>
  filter(business_date > as.Date(max(train_data$business_date)))

# Fit a dynamic regression model
fit <- train_data |>
  model(
    dynreg = ARIMA(gross_sales ~ holiday_flag + promotion_flag),
    ets = ETS(gross_sales)
  ) |>
  # Add combination of the two models
  mutate(
    comb = (dynreg + ets)/2
  )

# Produce forecasts
fc <- fit |>
  forecast(new_data = test_data)

# Compare model accuracy
accuracy(fc, complete_data) |>
  group_by(.model) |>
  summarise(
    RMSE = sqrt(mean(RMSE^2)),
    MAE = mean(MAE),
    RMSSE = sqrt(mean(RMSSE^2)),
    MASE = mean(MASE)
  ) |>
  arrange(RMSSE)

Note that I also included an ETS model, and a combination (or ensemble) of the dynamic regression and ETS model. While the ETS model is not particularly good, and doesn't take account of the covariates, the combination performs slightly better than the dynamic regression model alone.

You were fitting two dynamic regression models. One where you specified no seasonality, and one where you allowed seasonality. As the data does appear to be seasonal, the first model is not particularly good. The second model worked well.

The reason for the two warnings, were that it struggled to fit models to two of the series, and the resulting standard errors were missing. Possible this was due to having too many missing values at the same time as promotions or public holidays, making it hard to estimate the corresponding coefficients. Or it may have been due to some weird outliers, making the model difficult to estimate. I have not investigated this issue further, and I don't know which two series were causing the problems. You can check which series are producing the worst forecasts using

# Compare model accuracy
accuracy(fc, complete_data) |>
  group_by(store_code) |>
  summarise(
    RMSE = sqrt(mean(RMSE^2)),
    MAE = mean(MAE),
    RMSSE = sqrt(mean(RMSSE^2)),
    MASE = mean(MASE)
  ) |>
  arrange(desc(RMSSE))

Then try looking at the data for the worst 2 series to see if there are issues there.

You could also consider using a hierarchical forecasting approach -- by looking at aggregate sales across all stores, and then reconciling the forecasts of the individual stores with the forecasts of the aggregates. See Chapter 11 Forecasting hierarchical and grouped time series | Forecasting: Principles and Practice (3rd ed) for how to do this.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.