Hi, I am relatively new to RStudio and I am a data analyst working with an apprenticeship company. Does anyone know of a good model I could use to forecast apprentice retention? Thank you!
Depends on the variables you are working with (continuous, categorical, binary) , the number of previous outcomes available and how outcome is measured (say no. months, finished or not, etc.)
so the binary variable is retained or not, there is about 2 years worth of historical data to use for modelling and there are factors that affect retention such as age, salary, industry etc.
OK.
If these were not time series data, the obvious model is logistic regression, because the response (dependent) variable is binary. In simplest form
fit <- glm(retention ~ ., data = mydata, family = "binomial"
And this may work, provided that there is no autocorrelation (imagine apprentices quitting each winter to become ski bums, for example). In that case, you'd benefit from reviewing this text, on time series in general and, in particular, on dealing with autocorrelation.
If the data permit, expressing the response variable as continuous, such as months enrolled, would allow the use of the TSLM()
model discussed in the text. If not, the logistic case for time varying data can be handled in various advanced packages. Many of these are described as "panel data" models, which is how the econometricians characterize them.
Here's a worked example using your data with some of the tools described in the Hyndman text. It's a useful template, beginning with an initial visualization, look at how the three provider series differ, and for one of them, A
, some more detailed work, including a comparison of years, a look at autocorrelation, decomposition into seasonal, trend and residual components, division into training and test sets, development of baseline models, a forecast, evaluation and diagnostics. Then one of the many more advanced models, ARIMA, is selected to see if it can do better.
Going through the text following the example with your own data will help understand all the pieces, and if you get stuck, come back in a new thread with a reprex
like I've used and questions.
Despite the equations, the core of forecasting is that the future looks like the past (until it doesn't, because something in the processes that created the historical data has changed).
library(fpp3)
#> ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
#> ✔ tibble 3.2.1 ✔ tsibble 1.1.3
#> ✔ dplyr 1.1.1 ✔ tsibbledata 0.4.1
#> ✔ tidyr 1.3.0 ✔ feasts 0.3.1
#> ✔ lubridate 1.9.2 ✔ fable 0.3.3
#> ✔ ggplot2 3.4.1 ✔ fabletools 0.3.2
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> ✖ lubridate::date() masks base::date()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tsibble::intersect() masks base::intersect()
#> ✖ tsibble::interval() masks lubridate::interval()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tsibble::setdiff() masks base::setdiff()
#> ✖ tsibble::union() masks base::union()
s_data <- structure(list(
provider_name = c(
"A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C"
), start_month = c(
"2021 Jan",
"2021 Feb", "2021 Mar", "2021 Apr", "2021 May", "2021 Jun", "2021 Jul",
"2021 Aug", "2021 Sep", "2021 Oct", "2021 Nov", "2021 Dec", "2022 Jan",
"2022 Feb", "2022 Mar", "2022 Apr", "2022 May", "2022 Jun", "2022 Jul",
"2022 Aug", "2022 Sep", "2022 Oct", "2022 Nov", "2022 Dec", "2023 Jan",
"2023 Feb", "2021 Jan", "2021 Feb", "2021 Mar", "2021 Apr", "2021 May",
"2021 Jun", "2021 Jul", "2021 Aug", "2021 Sep", "2021 Oct", "2021 Nov",
"2021 Dec", "2022 Jan", "2022 Feb", "2022 Mar", "2022 Apr", "2022 May",
"2022 Jun", "2022 Jul", "2022 Aug", "2022 Sep", "2022 Oct", "2022 Nov",
"2022 Dec", "2023 Jan", "2023 Feb", "2021 Jan", "2021 Feb", "2021 Mar",
"2021 Apr", "2021 May", "2021 Jun", "2021 Jul", "2021 Aug", "2021 Sep",
"2021 Oct", "2021 Nov", "2021 Dec", "2022 Jan", "2022 Feb", "2022 Mar",
"2022 Apr", "2022 May", "2022 Jun", "2022 Jul", "2022 Aug", "2022 Sep",
"2022 Oct", "2022 Nov", "2022 Dec", "2023 Jan", "2023 Feb"
),
volume = c(
122L, 222L, 300L, 216L, 173L, 177L, 174L, 196L,
201L, 214L, 227L, 218L, 211L, 214L, 269L, 228L, 288L, 319L,
296L, 323L, 320L, 331L, 295L, 237L, 325L, 298L, 52L, 74L,
69L, 46L, 44L, 49L, 45L, 52L, 68L, 79L, 61L, 56L, 64L, 58L,
75L, 47L, 63L, 78L, 64L, 93L, 68L, 54L, 78L, 55L, 106L, 85L,
135L, 231L, 278L, 246L, 289L, 404L, 304L, 288L, 270L, 298L,
334L, 274L, 362L, 278L, 315L, 310L, 328L, 413L, 369L, 430L,
442L, 429L, 439L, 336L, 413L, 383L
)
), class = c(
"tbl_df",
"tbl", "data.frame"
), row.names = c(NA, -78L))
ts_data <- s_data %>%
mutate(start_month = tsibble::yearmonth(start_month)) %>%
as_tsibble(index = start_month,
key = provider_name)
# first step is always to eyeball the data
autoplot(ts_data) + theme_minimal()
#> Plot variable not specified, automatically selected `.vars = volume`
ts_data |>
pivot_wider(values_from=volume, names_from=provider_name) |>
GGally::ggpairs(columns = 2:4)
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
# pick provider A to begin
tsA <- ts_data %>% filter(provider_name == "A")
tsA$month_no <- 1:26
autoplot(tsA) + theme_minimal()
#> Plot variable not specified, automatically selected `.vars = volume`
tsA |>
gg_tsdisplay(difference(volume, 12),
plot_type='partial', lag=12) +
labs(title="Seasonally differenced", y="")
#> Warning: Removed 12 rows containing missing values (`geom_line()`).
#> Warning: Removed 12 rows containing missing values (`geom_point()`).
# comparison of years
tsA %>% gg_season(volume) + theme_minimal()
# lags
tsA |> gg_lag(volume, geom = "point") +
labs(x = "lag(volume, k)") +
theme_minimal()
# autocorrelation
tsA %>%
ACF(volume) %>%
autoplot() + labs(title="Autocorrelation analysis") +
theme_minimal()
# decomposition
tsA |>
model(
classical_decomposition(volume, type = "additive")
) |>
components() |>
autoplot() +
labs(title = "Classical additive decomposition") +
theme_minimal()
#> Warning: Removed 6 rows containing missing values (`geom_line()`).
# STL decomposition
tsA |>
model(
STL(volume ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) |>
components() |>
autoplot() ;+
theme_minimal()
#> Error in `e1 %+% e2`:
#> ! Cannot use `+` with a single argument
#> ℹ Did you accidentally put `+` on a new line?
#> Backtrace:
#> ▆
#> 1. └─GGally:::`+.gg`(theme_minimal())
#> 2. └─e1 %+% e2
#> 3. └─cli::cli_abort(...)
#> 4. └─rlang::abort(...)
# example features summary
tsA |> features(volume, quantile)
#> # A tibble: 1 × 6
#> provider_name `0%` `25%` `50%` `75%` `100%`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 122 212. 228. 298. 331
# baseline models
# Set training data from 2021 Jan to 2022 Oct
train <- tsA|> filter_index("2021 Jan" ~ "2022 Oct")
# test data from 2022 Nov to 2023 Feb
test <- tsA|> filter_index("2022 Nov" ~ "2023 Feb")
# Fit the models
fit <- train |>
model(
Mean = MEAN(volume),
`Naïve` = NAIVE(volume),
`Seasonal naïve` = SNAIVE(volume)
)
# Generate forecasts for 4 months
fc <- fit |> forecast(h = 4)
# Plot forecasts against actual values
fc |>
autoplot(train, level = NULL) +
autolayer(
filter_index(tsA, "2023 Feb" ~ .),
colour = "black"
) +
labs(
y = "volume",
title = "Forecasts for volume"
) +
guides(colour = guide_legend(title = "Forecast")) +
theme_minimal()
#> Plot variable not specified, automatically selected `.vars = volume`
#> `geom_line()`: Each group consists of only one observation.
#> ℹ Do you need to adjust the group aesthetic?
augment(fit)
#> # A tsibble: 66 x 7 [1M]
#> # Key: provider_name, .model [3]
#> provider_name .model start_month volume .fitted .resid .innov
#> <chr> <chr> <mth> <int> <dbl> <dbl> <dbl>
#> 1 A Mean 2021 Jan 122 238. -116. -116.
#> 2 A Mean 2021 Feb 222 238. -16.1 -16.1
#> 3 A Mean 2021 Mar 300 238. 61.9 61.9
#> 4 A Mean 2021 Apr 216 238. -22.1 -22.1
#> 5 A Mean 2021 May 173 238. -65.1 -65.1
#> 6 A Mean 2021 Jun 177 238. -61.1 -61.1
#> 7 A Mean 2021 Jul 174 238. -64.1 -64.1
#> 8 A Mean 2021 Aug 196 238. -42.1 -42.1
#> 9 A Mean 2021 Sep 201 238. -37.1 -37.1
#> 10 A Mean 2021 Oct 214 238. -24.1 -24.1
#> # ℹ 56 more rows
# the naive model tests best on more measures
accuracy(fc, test)[c(1,4:8)]
#> # A tibble: 3 × 6
#> .model ME RMSE MAE MPE MAPE
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mean 50.6 59.9 51.2 16.4 16.6
#> 2 Naïve -42.2 53.0 42.2 -16.2 16.2
#> 3 Seasonal naïve 71.2 79.1 71.2 23.6 23.6
fit <- train |> model(naif = NAIVE(volume))
report(fit)
#> Series: volume
#> Model: NAIVE
#>
#> sigma^2: 1728.7476
# fit some ARIMA models
fit <- tsA |>
model(arima210 = ARIMA(volume ~ pdq(2,1,0)),
arima013 = ARIMA(volume ~ pdq(0,1,3)),
stepwise = ARIMA(volume),
search = ARIMA(volume, stepwise=FALSE))
# diagnostics
glance(fit) |> arrange(AICc) |> select(.model:BIC)
#> # A tibble: 4 × 6
#> .model sigma2 log_lik AIC AICc BIC
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 stepwise 1992. -130. 263. 263. 264.
#> 2 search 1992. -130. 263. 263. 264.
#> 3 arima210 1771. -128. 263. 264. 266.
#> 4 arima013 1758. -128. 263. 265. 268.
fit |>
select(search) |>
gg_tsresiduals()
augment(fit) |>
filter(.model=='search') |>
features(.innov, ljung_box, lag = 10, dof = 3)
#> # A tibble: 1 × 4
#> provider_name .model lb_stat lb_pvalue
#> <chr> <chr> <dbl> <dbl>
#> 1 A search 5.78 0.566
fit |>
forecast(h=4) |>
filter(.model=='search') |>
autoplot(tsA) +
theme_minimal()
Created on 2023-04-03 with reprex v2.0.2
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.