Plotting Multiple Models in R

  • Suppose there is a farm in which there are two variables: Crop Yield and Fertilizer Cost
  • I am interesting in making a time series model to model the relationship between Crop Yield and Fertilizer Cost (I.e how much Fertilizer I choose to buy, assuming I fix the amount of money I am willing to spend on Fertilizer as constant)
  • Specifically, I believe that Crop Yield is influenced by the (lag-1) and (lag-2) values of Crop Yield and Fertilizer Cost.
  • After doing some research, it seems to me that an ARIMAX model is a good choice for this problem.
  • In the end, I want to answer the following questions:
    • What is the (estimated) impact of each extra dollar of fertilizer cost on crop yield?
    • Suppose I have 3 options over the next 12 months : Invest 1000 dollars each month on fertilizer, Invest 2000 dollars each month on fertilizer, Invest 3000$ each month on fertilizer. Which of these options will result in the largest (estimated) cumulative crop yield over 12 months?

I tried to express this in mathematical terms:

Let Y_t represent the crop yield at time t , and let F_t represent the fertilizer cost at time t . We assume that the price of fertilizer remains constant, so variations in F_t reflect the quantity of fertilizer purchased.

The crop yield Y_t is influenced by:

  • The crop yield at lag-1 ( Y_{t-1} ) and lag-2 ( Y_{t-2} ),
  • The fertilizer cost at lag-1 ( F_{t-1} ) and lag-2 ( F_{t-2} ).

The relationship between crop yield and fertilizer cost can be modeled using an ARIMAX model as follows:

Y_t = \alpha_0 + \alpha_1 Y_{t-1} + \alpha_2 Y_{t-2} + \beta_1 F_{t-1} + \beta_2 F_{t-2} + \epsilon_t

where: \alpha_0 is the intercept term. \alpha_1, \alpha_2 are the coefficients for the lagged crop yields, \beta_1, \beta_2 are the coefficients for the lagged fertilizer costs, \epsilon_t is the error term.

Thus, it seems to me that \beta would represent the estimated impact of fertilizer on crop yield ... and I could add the estimated 12 month forecasts for each option to find out the net benefit.

I tried to make an R simulation to represent my problem.

First I simulate data:

library(forecast)
library(ggplot2)
library(dplyr)
library(tsibble)

set.seed(123)
n <- 100
dates <- seq(as.Date("2020-01-01"), by = "month", length.out = n)
fertilizer <- runif(n, 500, 1500)
seasonal_effect <- sin(2 * pi * (1:n) / 12) * 200
trend <- seq(0, 500, length.out = n)
crop_yield <- 0.5 * fertilizer + seasonal_effect + trend + arima.sim(list(ar = c(0.7, 0.2)), n) + rnorm(n, 0, 50)

data <- tibble(
    date = dates,
    crop_yield = crop_yield,
    fertilizer = fertilizer
) %>%
    as_tsibble(index = date)

Then, I fit the model:

model <- auto.arima(data$crop_yield, xreg = cbind(data$fertilizer))

From there, I prepared the data for the forecasting:

future_dates <- seq(from = max(dates), by = "month", length.out = 13)

generate_forecast <- function(cost) {
    future_fertilizer <- c(tail(data$fertilizer, 1), rep(cost, 12))
    forecast_result <- forecast(model, xreg = future_fertilizer, h = 13)
    
    list(
        date = future_dates,
        crop_yield = as.numeric(forecast_result$mean),
        fertilizer = future_fertilizer,
        lower = as.numeric(forecast_result$lower[, 2]),
        upper = as.numeric(forecast_result$upper[, 2]),
        scenario = paste0("Fertilizer $", cost)
    )
}

Then, I fit the forecasts and plotted them:

forecast_1000 <- generate_forecast(1000)
forecast_2000 <- generate_forecast(2000)
forecast_3000 <- generate_forecast(3000)

ggplot() +
    geom_line(data = data, aes(x = date, y = crop_yield, color = "Historical")) +
    geom_point(data = data, aes(x = date, y = crop_yield, color = "Historical")) +
    geom_line(data = tibble(date = forecast_1000$date, crop_yield = forecast_1000$crop_yield), aes(x = date, y = crop_yield, color = "Fertilizer $1000")) +
    geom_line(data = tibble(date = forecast_2000$date, crop_yield = forecast_2000$crop_yield), aes(x = date, y = crop_yield, color = "Fertilizer $2000")) +
    geom_line(data = tibble(date = forecast_3000$date, crop_yield = forecast_3000$crop_yield), aes(x = date, y = crop_yield, color = "Fertilizer $3000")) +
    geom_ribbon(data = tibble(date = forecast_1000$date, ymin = forecast_1000$lower, ymax = forecast_1000$upper), aes(x = date, ymin = ymin, ymax = ymax, fill = "Fertilizer $1000"), alpha = 0.2) +
    geom_ribbon(data = tibble(date = forecast_2000$date, ymin = forecast_2000$lower, ymax = forecast_2000$upper), aes(x = date, ymin = ymin, ymax = ymax, fill = "Fertilizer $2000"), alpha = 0.2) +
    geom_ribbon(data = tibble(date = forecast_3000$date, ymin = forecast_3000$lower, ymax = forecast_3000$upper), aes(x = date, ymin = ymin, ymax = ymax, fill = "Fertilizer $3000"), alpha = 0.2) +
    scale_color_manual(values = c("Historical" = "black", 
                                  "Fertilizer $1000" = "blue", 
                                  "Fertilizer $2000" = "red", 
                                  "Fertilizer $3000" = "green")) +
    scale_fill_manual(values = c("Fertilizer $1000" = "blue", 
                                 "Fertilizer $2000" = "red", 
                                 "Fertilizer $3000" = "green")) +
    labs(title = "Crop Yield Forecast - Fertilizer Cost Scenarios",
         x = "Date",
         y = "Crop Yield",
         color = "Scenario",
         fill = "Forecast Interval") +
    theme_minimal() +
    theme(legend.position = "bottom")

Does my coding make sense?