Which Predictive analytics model to use to predict client volume in R

mlsops · March 14, 2023, 4:55pm

Hello,

I am new to predictive analytics. I am trying to predict client volume (how many clients will enter a particular education program) over the next 5 years based on client volume data that I have for the past 2 years (2021, 2022).

Here are the data I have available - Client ID, Client Start Date (format: 2021-12-26), Client End Date(same format as start date), Education Location, Education Provider (there are 3 education providers), Start fiscal year, Start fiscal quarter

I have done some research and the seasonally naive model was one of the first one that jumped out but I wanted to ask your expert advice on how to go about doing this kind of work. Based on the data I have available, how can I go about doing the predictive analytics and which model would be the best to use?

Thank you

technocrat · March 14, 2023, 11:09pm

Review Hyndman & Athanasopoulos for an introduction to forecast techniques. A forecast extrapolates past data assuming that future patterns will follow those past patterns. A prediction assumes some change to those patterns.

Most forecasts amount to

the near future will be like the recent past only a little more or a little less so

The statistical nature of forecasts of non-deterministic events is that they are subject to random variation making forecasts also subject to confidence intervals that widen as the forecast horizon is extended. As a consequence, depending on the data, there may be surprises such as a lower confidence level becoming negative within the forecast horizon or the upper bound becoming positive in excess of some hard ceiling.

Depending on data and purpose, initial modeling should usually include four baseline models: mean, naive, seasonally naive and random walk. These would be run on a 75/25 split into training/test sets and [pre]selecting a metric, such as RMSE to select a benchmark against which ARIMA and other models will be judged.

mlsops · March 15, 2023, 5:07pm

@technocrat thank you for providing the link. I would also like to predict duration of the program (from client start date to client end date) the problem with the mean data is that since the current durations can range from 2 days to 600 days, I have to take a look at the median data instead of the mean data. I might have to look at the median client volume instead of the mean since Out of the 3 education providers, one of them has had significantly less clients over the past two fiscal years. Based on your expertise would it be possible to look at the median, naive, seasonally naive and random walk?
So in terms of modelling, would you suggest the ARIMA model to be a good place to start? I would really appreciate your feedback. Thank you

mlsops · March 23, 2023, 6:00pm

if someone could please recommend some models to look into, I would really appreciate that. Please let me know if you would like to view a sample of the data

technocrat · March 23, 2023, 7:17pm

Sample data (concrete is easier than abstract sometimes)

mlsops · March 23, 2023, 7:54pm

would aggregate of the data work help? For example having the data in this format:

please let me know and I appreciate the help

technocrat · March 24, 2023, 2:47am

f(x) = y

where

x is the data frame with variables Provider, Start_month, Start_qr and client_volume

y is the

over some forecast horizon.

f is the function or composite function to convert x to y

The first requirement for f is to calculate the duration of each completed program (A,B,C\dots) from the variables contained in x.

x is clearly insufficient to make that calculation because the only date information provided is a variable with an open ended range—there is a start date but no end date.

The purpose of a reprex. See the FAQ is to provide information sufficient to reflect the problem. The data doesn't have to be all the data at hand, all of the variables, or even the same data. Representative data from a built-in data set is fine or even synthetic data. But it must be complete to the extent needed to set up the problem. Here it is not.

mlsops · March 31, 2023, 8:20pm

@technocrat thank you, I have used the reprex function to reproduce my full data. Right now, I am trying to forecast client volume only based on start date. Here is the full aggregated client volume data:

s_data <-tibble::tribble(
  ~provider_name, ~start_month, ~volume,
             "A",   "2021 Jan",    122L,
             "A",   "2021 Feb",    222L,
             "A",   "2021 Mar",    300L,
             "A",   "2021 Apr",    216L,
             "A",   "2021 May",    173L,
             "A",   "2021 Jun",    177L,
             "A",   "2021 Jul",    174L,
             "A",   "2021 Aug",    196L,
             "A",   "2021 Sep",    201L,
             "A",   "2021 Oct",    214L,
             "A",   "2021 Nov",    227L,
             "A",   "2021 Dec",    218L,
             "A",   "2022 Jan",    211L,
             "A",   "2022 Feb",    214L,
             "A",   "2022 Mar",    269L,
             "A",   "2022 Apr",    228L,
             "A",   "2022 May",    288L,
             "A",   "2022 Jun",    319L,
             "A",   "2022 Jul",    296L,
             "A",   "2022 Aug",    323L,
             "A",   "2022 Sep",    320L,
             "A",   "2022 Oct",    331L,
             "A",   "2022 Nov",    295L,
             "A",   "2022 Dec",    237L,
             "A",   "2023 Jan",    325L,
             "A",   "2023 Feb",    298L,
             "B",   "2021 Jan",     52L,
             "B",   "2021 Feb",     74L,
             "B",   "2021 Mar",     69L,
             "B",   "2021 Apr",     46L,
             "B",   "2021 May",     44L,
             "B",   "2021 Jun",     49L,
             "B",   "2021 Jul",     45L,
             "B",   "2021 Aug",     52L,
             "B",   "2021 Sep",     68L,
             "B",   "2021 Oct",     79L,
             "B",   "2021 Nov",     61L,
             "B",   "2021 Dec",     56L,
             "B",   "2022 Jan",     64L,
             "B",   "2022 Feb",     58L,
             "B",   "2022 Mar",     75L,
             "B",   "2022 Apr",     47L,
             "B",   "2022 May",     63L,
             "B",   "2022 Jun",     78L,
             "B",   "2022 Jul",     64L,
             "B",   "2022 Aug",     93L,
             "B",   "2022 Sep",     68L,
             "B",   "2022 Oct",     54L,
             "B",   "2022 Nov",     78L,
             "B",   "2022 Dec",     55L,
             "B",   "2023 Jan",    106L,
             "B",   "2023 Feb",     85L,
             "C",   "2021 Jan",    135L,
             "C",   "2021 Feb",    231L,
             "C",   "2021 Mar",    278L,
             "C",   "2021 Apr",    246L,
             "C",   "2021 May",    289L,
             "C",   "2021 Jun",    404L,
             "C",   "2021 Jul",    304L,
             "C",   "2021 Aug",    288L,
             "C",   "2021 Sep",    270L,
             "C",   "2021 Oct",    298L,
             "C",   "2021 Nov",    334L,
             "C",   "2021 Dec",    274L,
             "C",   "2022 Jan",    362L,
             "C",   "2022 Feb",    278L,
             "C",   "2022 Mar",    315L,
             "C",   "2022 Apr",    310L,
             "C",   "2022 May",    328L,
             "C",   "2022 Jun",    413L,
             "C",   "2022 Jul",    369L,
             "C",   "2022 Aug",    430L,
             "C",   "2022 Sep",    442L,
             "C",   "2022 Oct",    429L,
             "C",   "2022 Nov",    439L,
             "C",   "2022 Dec",    336L,
             "C",   "2023 Jan",    413L,
             "C",   "2023 Feb",    383L
  )

I have 3 providers and I am trying to forecast the client volume for the next 1 year. I have used these forecast model and used this function to forecast client volume for the next 1 year, however, I am interested in knowing how you would forecast the data since I am having doubts about the accuracy of the model and I am getting errors about missing data:

data_stretch <- s_data %>%
  stretch_tsibble(.init = 12, .step = 1) %>%
  filter(.id !=max(.id))


fit <- data_stretch %>%
  model(
    #seasonal_naive = SNAIVE(volume),
    naive_a = NAIVE(volume),
    drift_a = RW(volume ~ drift()),
    #rw_dr = RW(log(volume) ~ drift()),
    #mean_a = MEAN(volume),
    tslm_a = TSLM(volume ~ trend()), 
    ets = ETS(volume),
    arima = ARIMA(volume)
  )  %>%
  forecast(h = "1 year")

fit %>%
  accuracy(s_data)

technocrat · March 31, 2023, 8:48pm

This is a good reprex except s_data should be a tsibble rather than a tibble. You would have converted s_data$start_month from typeof character to typeof some time type. And you cut and paste the output of dput(s_data) so that there's the tsibble to continue following along.

mlsops · March 31, 2023, 9:06pm

@technocrat can you tell me how I can convert it in reprex? because when I look at the tibble in R for s_data, start_month is shown as mth:

technocrat · March 31, 2023, 11:19pm

At some point s_data must have been a tsibble object, because otherwise stretch_tsibble() would have failed like this

library(tsibble)
#> 
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, union
s_data <-tibble::tribble(
  ~provider_name, ~start_month, ~volume,
  "A",   "2021 Jan",    122,
  "A",   "2021 Feb",    222,
  "A",   "2021 Mar",    300,
  "A",   "2021 Apr",    216,
  "A",   "2021 May",    173,
  "A",   "2021 Jun",    177,
  "A",   "2021 Jul",    174,
  "A",   "2021 Aug",    196,
  "A",   "2021 Sep",    201,
  "A",   "2021 Oct",    214,
  "A",   "2021 Nov",    227,
  "A",   "2021 Dec",    218,
  "A",   "2022 Jan",    211,
  "A",   "2022 Feb",    214,
  "A",   "2022 Mar",    269,
  "A",   "2022 Apr",    228,
  "A",   "2022 May",    288,
  "A",   "2022 Jun",    319,
  "A",   "2022 Jul",    296,
  "A",   "2022 Aug",    323,
  "A",   "2022 Sep",    320,
  "A",   "2022 Oct",    331,
  "A",   "2022 Nov",    295,
  "A",   "2022 Dec",    237,
  "A",   "2023 Jan",    325,
  "A",   "2023 Feb",    298,
  "B",   "2021 Jan",     52,
  "B",   "2021 Feb",     74,
  "B",   "2021 Mar",     69,
  "B",   "2021 Apr",     46,
  "B",   "2021 May",     44,
  "B",   "2021 Jun",     49,
  "B",   "2021 Jul",     45,
  "B",   "2021 Aug",     52,
  "B",   "2021 Sep",     68,
  "B",   "2021 Oct",     79,
  "B",   "2021 Nov",     61,
  "B",   "2021 Dec",     56,
  "B",   "2022 Jan",     64,
  "B",   "2022 Feb",     58,
  "B",   "2022 Mar",     75,
  "B",   "2022 Apr",     47,
  "B",   "2022 May",     63,
  "B",   "2022 Jun",     78,
  "B",   "2022 Jul",     64,
  "B",   "2022 Aug",     93,
  "B",   "2022 Sep",     68,
  "B",   "2022 Oct",     54,
  "B",   "2022 Nov",     78,
  "B",   "2022 Dec",     55,
  "B",   "2023 Jan",    106,
  "B",   "2023 Feb",     85,
  "C",   "2021 Jan",    135,
  "C",   "2021 Feb",    231,
  "C",   "2021 Mar",    278,
  "C",   "2021 Apr",    246,
  "C",   "2021 May",    289,
  "C",   "2021 Jun",    404,
  "C",   "2021 Jul",    304,
  "C",   "2021 Aug",    288,
  "C",   "2021 Sep",    270,
  "C",   "2021 Oct",    298,
  "C",   "2021 Nov",    334,
  "C",   "2021 Dec",    274,
  "C",   "2022 Jan",    362,
  "C",   "2022 Feb",    278,
  "C",   "2022 Mar",    315,
  "C",   "2022 Apr",    310,
  "C",   "2022 May",    328,
  "C",   "2022 Jun",    413,
  "C",   "2022 Jul",    369,
  "C",   "2022 Aug",    430,
  "C",   "2022 Sep",    442,
  "C",   "2022 Oct",    429,
  "C",   "2022 Nov",    439,
  "C",   "2022 Dec",    336,
  "C",   "2023 Jan",    413,
  "C",   "2023 Feb",    383
)


s_data |> tsibble::stretch_tsibble(.init = 12, .step = 2) 
#> Error in UseMethod("key_data"): no applicable method for 'key_data' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

^{Created on 2023-03-31 with reprex v2.0.2}

Run the following command

dput(s_data)

and cut-and-paste the console output.

mlsops · April 1, 2023, 10:02pm

@technocrat thank you so much.

I ran this code below first to turn it into a tsibble so that it is easier for you to copy and paste, please let me know if this works for you

s_data %>%
  mutate(start_month = tsibble::yearmonth(start_month)) %>%
  as_tsibble(index = start_month,
             key = provider_name)

> dput(s_data)
structure(list(provider_name = c("A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", 
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", 
"C", "C", "C", "C", "C", "C", "C"), start_month = c("2021 Jan", 
"2021 Feb", "2021 Mar", "2021 Apr", "2021 May", "2021 Jun", "2021 Jul", 
"2021 Aug", "2021 Sep", "2021 Oct", "2021 Nov", "2021 Dec", "2022 Jan", 
"2022 Feb", "2022 Mar", "2022 Apr", "2022 May", "2022 Jun", "2022 Jul", 
"2022 Aug", "2022 Sep", "2022 Oct", "2022 Nov", "2022 Dec", "2023 Jan", 
"2023 Feb", "2021 Jan", "2021 Feb", "2021 Mar", "2021 Apr", "2021 May", 
"2021 Jun", "2021 Jul", "2021 Aug", "2021 Sep", "2021 Oct", "2021 Nov", 
"2021 Dec", "2022 Jan", "2022 Feb", "2022 Mar", "2022 Apr", "2022 May", 
"2022 Jun", "2022 Jul", "2022 Aug", "2022 Sep", "2022 Oct", "2022 Nov", 
"2022 Dec", "2023 Jan", "2023 Feb", "2021 Jan", "2021 Feb", "2021 Mar", 
"2021 Apr", "2021 May", "2021 Jun", "2021 Jul", "2021 Aug", "2021 Sep", 
"2021 Oct", "2021 Nov", "2021 Dec", "2022 Jan", "2022 Feb", "2022 Mar", 
"2022 Apr", "2022 May", "2022 Jun", "2022 Jul", "2022 Aug", "2022 Sep", 
"2022 Oct", "2022 Nov", "2022 Dec", "2023 Jan", "2023 Feb"), 
    volume = c(122L, 222L, 300L, 216L, 173L, 177L, 174L, 196L, 
    201L, 214L, 227L, 218L, 211L, 214L, 269L, 228L, 288L, 319L, 
    296L, 323L, 320L, 331L, 295L, 237L, 325L, 298L, 52L, 74L, 
    69L, 46L, 44L, 49L, 45L, 52L, 68L, 79L, 61L, 56L, 64L, 58L, 
    75L, 47L, 63L, 78L, 64L, 93L, 68L, 54L, 78L, 55L, 106L, 85L, 
    135L, 231L, 278L, 246L, 289L, 404L, 304L, 288L, 270L, 298L, 
    334L, 274L, 362L, 278L, 315L, 310L, 328L, 413L, 369L, 430L, 
    442L, 429L, 439L, 336L, 413L, 383L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -78L))

mlsops · April 12, 2023, 10:32pm

wondering if anyone could help me with this analysis, I would appreciate the help

nirgrahamuk · April 13, 2023, 8:08am

some googling around tells me quantreg package has function dynrq() for quantile regression over time-series

system · May 25, 2023, 8:08am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.