Create a regression model from a time series dataset

Hi everyone!

I have a basic time series dataset named "lynx", which is included in R. This dataset shows the number of catches of lynxes per year, over a period of 114 years.

Although this is a time series, my teacher asks me to use this dataset to create a regression model capable of predicting the number of catches that have taken place in a given year taking into account only the catches of the previous two or three years.

Unfortunately, I have no idea how to proceed. I do not even know how to process the data to apply, for example, a basic linear regression model to them.
Someone can tell me how to proceed?

Thank you so much.

Your professor might mean a super-simple regression model. In which case, your model's formula will look like this:

L_t = \beta_0 + \beta_1 L_{t-1} + \beta_2 L_{t - 2}

where L_y is the number of lynxes in year y.

But if you want to do a "proper" regression model by accounting for autocorrelation, use the arima() function from the stats package. Rob J. Hyndman, author of the popular time-series modeling package forecast, coauthored an online guide for time series analysis and forecasting. There's a section on autoregressive models.

Because this is for an assignment, remember that "proper" here means "Best practice for the real world." Your teacher probably considers it to mean "only what I asked for."


The "Model" chapter of R4DS walks through some examples of setting up a regression and exploring it.

For your particular problem, I'd suggest two steps:

  1. Transform your data set to get the variables you want. In this case, you may want variables that show a few of the prior years' values.
fake_lynx_data <- 
  data.frame(year   = 1900:2017,
             lynx = runif(118, 500, 1000)) # 118 years of random fake counts

# New data frame with more variables we can refer to later
fake_lynx_data_addl_vars <-
  fake_lynx_data %>%
  mutate(lynx_1yr_prior = lag(lynx, 1),
         lynx_2yr_prior = lag(lynx, 2),
         lynx_3yr_prior = lag(lynx, 3))
  1. Fit a model to those variables. Since I used totally random data, this model doesn't predict well:
# This uses the 'lm' function to make a linear model to predict 'lynx' based on
# independent relationships to the 3 prior years' counts.
lm(lynx ~ lynx_1yr_prior + lynx_2yr_prior + lynx_3yr_prior, 
           data = fake_lynx_data_addl_vars) %>%

# Given that the counts are partly dependent on the counts in prior years, you
# could also try a model that also uses the interactions between those prior counts.
lm(lynx ~ lynx_1yr_prior * lynx_2yr_prior * lynx_3yr_prior, 
           data = fake_lynx_data_addl_vars) %>%

Thank you so much :wink:

Thank you so much too :wink: