I am getting started with comparing time series forecasting model performance using tidymodels and have limited experience so please excuse my ignorance.

In my job I need to provide time series forecasts for hundreds of thousands of products (prescription drug claims) to estimate what the cost will be to the business for the coming year.

I wish to compare the performance of a variety of models namely Prophet, ARIMA, and TSLM to name a few. Any of the examples I've seen or read have compared the performance of each of these models on a single time series.

Given the large number of products I need to forecast, I assume I will need to compare the performance of a large sample, say 1000 time series across each of the different models. In this case do I just get average values for the performance metrics for each model and compare or is there a more scientific approach?

A time series model attempts to understand and fit the data-generating process of your series in the best possible way it knows. Then it uses this knowledge for forecasting future values. This is the reason why the examples you came across used several models on a single time series. The authors of these examples were attempting to determine which model understood the particular series the best because they wanted to use it for determining more trustworthy forecasts.

In your case, you have thousands of time series. There are several ways to go about it:

if there is a model you trust (for good reasons), you may want to use it on all your series

you may want test the performance of your models on a couple dozen series and if there is a model that stands out, you may use it (this option makes a huge assumption... that all the series are qualitatively not too different from one another)

run all the models on all series and determine, which model works best for each series, then use that model to perform the forecasting

run all the models on all series and compute the average of all the forecasts obtained.

Is that a single series of 10^5 claims of all types of drugs, 10^5 series of all types of drugs, 10^5 series of claims by drug? Are there exogenous variables, such as price to the claimant, tiering, cost to the company? Is the objective to forecast costs given a mix or to forecast mix given the cost?

@technocrat At the moment, the data consists of daily claim counts for > 100K products over 3 years i.e. three columns consisting of Date, Product and sum of prescription count. I have access to additional predictive features such as tiering info, out of pocket amounts, prior authorization details, drug acquisition costs etc. The object is to forecast > 6 months the expected prescription count for each product. I could eventually include additional predictor variables but for now the plan is to keep it to a univariate prediction for the short term.

If I were ever tasked with this task, here's how I'd approach it after the tranquilizers kicked in.

Measure twice, cut once. In this context that means thorough data cleansing and data preservation. I'd approach this through checking in advance that all series are:

a. Numeric
b. Have the same length
c. Have the same start date and end date (or set up process for handling differing time periods)
d. Have no Nulls, NAs, negative values
e. Have no missing values (else, decide on imputation)
f. Have reasonable valuesâideally, a separate source for monthly, quarterly or annual values to check against. This will avoid data transcription errors such as 1 for 100 or 10000 for 100.

Once cleansed, consider storing it in a SQL or other database with audit capability. This may also help with the next consideration. Also, see the archivist package

2. Build to scale. If it takes 10 seconds to run one model, it will take about 28 hours to run 10^5. Some models take longer. Consider error-trapping routines to shuttle off problem series for exception handling. Consider a containerize and dispatch approach and design to that. Think about how to preserve the model output for diagnostics and judgments about outcome.

Think functionally: f(x) = y. The argument x should be named identically and the return value y similarly. Differentiating the series and results should be encapsulated completely through the job id. Parameterize everything within the model.

Consider the fpp3 package infrastructure,E which has facilities for batch processing models specifically geared to time series.

Consider whether to use time series count methods. See Time series count data. I haven't made my own mind up about the necessity for this.

Exploratory data analysis with a random sample of say, 20 series.

a. Aggregate the daily data to weekly or monthly, because a forecast horizon of six months is h=180 for daily, which will show unusably wide prediction intervals, possibly showing negative forecasts. Weekly, h=26 is a push, but monthly, h=6 should be tractable.
b. Look at 2020 and pre-2020 separately because of the possibility of a cyclic trend last year.
c. For each series: plot the series, plot the ACF and test for stationarity.
d. Run base naive, mean, random walk with drift and seasonal naive models.
e. Choose a metric, e.g.,RMSE or MAE.
f. Divide the series into upto 200 training set observations and run predict with h=6.
g. Test the residuals for autocorrleation of the residuals; consider differencing or transformation if required.
h. See which model is superior. Assess whether it is consistently superior. If it varies by series, decide on a single model as the benchmark. The ARIMA and other models will have to beat this bogey.