I've been doing a project modelling and forecasting time series data. I've found a great new set of tools in tsibble/fable/feast, which make it far easier to write code that allows multiple models to be generated and compared.
What I am finding though is that the model processing time is now an even more significant bottleneck. I've scaled up my code, but not my actual processing power.
A few methods seem immediately useful, but at the cost of model detail. Things like aggregating on the time index (processing daily data is cheaper than half hourly), and shorter training sets (modelling on 1 year of data is cheaper than two), but these are compromises that may invalidate the goals of the model (I don't want a daily model, etc.)
I've moved over to a cloud instance, and am running my code on a bigger machine, but I'm not actually seeing the performance improvements I had hoped. I had assumed that more cpu and ram would be a simple (if maybe not cost effective/optimised) solution, but I'm not certain it's even being utilised by what I'm running.
At the moment I am working with half hourly data for about a year, with a few extra variables outside my prediction target. My code looks something like:
my_data %>%
model(
ARIMA_xmpl = fable::ARIMA(
my_value ~ pdq(1, 0, 1) + PDQ(1, 1, 1, period = "day") + my_variable
),
ARIMA_xmpl_2 = fable::ARIMA(
my_value ~ pdq(1, 0, 1) + PDQ(1, 1, 1, period = "day") + my_variable + my_other_variable
)
)
What is available to me to:
- check if my process is memory bound or cpu bound?
- ensure I'm utilising all my cpus?