Predict future revenue from cohorts

(Open ended, more discussion based question)

I've tried to tackle this problem from several angles and none are entirely 'great'.

We have an app and many people download and install it each day. We call this a cohort and the cumulative revenue from a cohort follows a growth curve that looks somewhat logarithmic in shape (but not completely). E.g.

X axis is days since the cohort installed the app, y axis is the log of growth rate, e.g. log(CUMULATIVE_AMOUNT) - log(CUMULATIVE_AMOUNT_AT_DAY 7)

I tried a regression model of the form GROWTH_RATE ~ log(TENURE) and this actually did OK when testing on out of sample new data. 'OK' is subjective here. The problem is/was that, as you can see in the chart, variance widens with tenure and when plotting with plot(my.mod) residuals Vs. fitted were megaphone shaped - hetroskedasticity.

We want to be able to observe each cohort for 7 days, specifically their total spend in this first week and then use that to predict out to day 365.

Using the model above, we'd take the observed amount after day 7 and multiply by the log growth rate predicted by the model (After necessary exp() transformations).

I did some research due to the hetroskedasticity and attempted a box-cox transformation but this didn't help much.

Then, the 'hint' I got from my research was that I should be using a time series model. I did some self study using Forecasting Principles and Practice book and got as far as ARIMA modeling. But, remembering that we want to observe to day 7 for each cohort before predicting out to year 1, time series approach, as far as I can see, does not facilitate this. Instead you train a model on historic data and then just predict h time periods ahead. E.g. from the book, a typical prediction block might look like this:

fit %>% forecast(h=10) %>%
  autoplot(global_economy) +
  labs(y = "% of GDP", title = "Egyptian exports")

Producing in this case (via linked page in book above):

For my case, I both want to fit a model based on historical data AND use the benefit of watching the first 7 days of spend data to inform any prediction.

I'm not sure what other info to provide, appreciate this is more discussion based.

What would be a 'good' approach to this problem? We have years worth of historic data to train a model on but we also want to be able to use the first 7 days of revenue to inform our predictions. Linear model with log tenure suffers from hetroskedasticity and as far as I can see time series would not make use of the first 7 days of revenue to inform the prediction.

If you want to predict revenue at 365 days from revenue at 7 days, then I think every other "day" of data (1-6, 8-364) is perhaps not useful.

I would be inclined to create the data frame with one row per cohort and columns

  • id
  • start_date
  • revenue7
  • revenue365
  • revenue365_7

I imagine that the app might get more or less popular over time, and there could be a weekly seasonality where the app is used more on weekends, so a time series could be a good approach applied to revenue365 with time series order defined by start_date. You can also try revenue7 as an exogenous variable in an ARIMAX for revenue365.

As @arthur.t suggests, an ARMAX model could be appropriate.

Another thing to consider is using a GARCH_M model, which uses observed variance to help predict the future mean.

Thanks for your suggestion Arthur. What is the goal of only including days 7 and 365 while ignoring everything in between? I don't understand our objective in doing this? To reduce the hetroskedasticity?

Something that went down well with stakeholders was the ability to predict to any interval between days 7 and 365, I just used 365 as an example since it's the most common one.

Your suggested fields:

  • revenue7 - is the total cumulative revenue as of day 7? i.e. days 1:7?
  • revenue365 - is the total cumulative revenue as of day 365? i.e. days 1:365
  • revenue365_7 - Is this total cumulative revenue between days 7:365?

I did a search for ARIMAX in the book I linked to by Rob Hyndman but nothing was returned. I also searched on fable documentation page for 'arimax' but no result.

For clarity then, my model would just be an arima with an added covariate of revenue7?

Regular arima model per the book using fable with default stepwise search to find best pdq:

mod.arima <- my_tsibble %>%
  model( stepwise = ARIMA(revenue365))

I found an SO post that the book author responded to with an example of arimax:

fit_arimax <- data %>% 
  filter(YearMonth <= yearmonth("2019 Aug")) %>%
    ARIMA(value ~ reg1 + reg2 + PDQ(0,0,0))

I guess then one must already know the pdq without relying on the automatic stepwise algorithm to determine pdq :confused:

But underlining, it would be good to be able to have a method that allows for predicting from day 7 all the way through to 365. These looked good on plots showing predicted vs. actual.

Thanks again for your suggestion!

Thank you for this suggestion too. Since I'm just learning I'm going to stick within R's fable framework for the time being

Sticking to ARMAX models is perfectly sensible.

Since it appears that you have a lot of data, let me point out that in an ARMAX model you can include things like revenue365^2, revenue365^3, etc.

Also, the basic ARMA model is linear and you've shown some evidence that the process is nonlinear. Might want to keep this in mind.

1 Like

You can fit ARIMAX model with something like

model <- forecast::auto.arima(
  ts(df$revenue365, frequency = 7),  # weekly seasonality
  xreg = df$revenue7

If you want to model the full profile of revenue over time, I can understand that.

Here are some thoughts though.

  1. It's always good to try a simpler approach in addition to a complex one. You'll have a benchmark to compare to and something you can fall back on if the complex approach doesn't work out.

  2. Fitting a linear model to time series data across cohorts introduces a wrinkle. The observations are not statistically independent. Each point is related to the previous in time and the next. Therefore all your p-vales and prediction errors etc. are going to be completely wrong. In order to get correct statistics, you'll need a mixed model with random effects term for cohort. Additionally, you'll have a categorical predictor variable for "cohort", but when making a true prediction on a new cohort, you won't have a way to specify the value for this term. So your quality of fit will also be misleading.

  3. The above approach will also miss out on the useful elements of the ARIMAX approach I described for revenue365 - trend and weekly seasonality across time.

  4. I would not recommend pursuing a time series approach within cohort because then you'll be trying to fit a time series model to the first 7 days of data individually for each cohort? It doesn't seem like it will produce a good model.

1 Like

Thanks for this info. Am going to read and digest it

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.