What would a workflow of training and prediction look like with cohorted non linear regression?

dougfir · April 30, 2021, 4:20pm

Context is app installs and cumulative revenue.

On a cohort basis (daily or weekly), cumulative revenue is logarithmic in shape. I'd like to develop a predictive model that predicts the cumulative revenue of a cohort many weeks out after a week or two since install.

Before diving in I'm trying to visualize in my mind what a training and prediction workflow could look like.

Which historic data do I use to fit a model? Lets say our app has 5 years of historic data. For each new weekly cohort, I'd like to predict out into the future e.g 3 months, 6 months, 12 months, etc. of what the cohorts cumulative revenue might look like. Do I use the 7 days spend behavior of each cohort to inform a custom prediction per cohort? Or do I Just train a model on all historic data and predict a uniform prediction on all new cohorts will have $x, $y, $z cumulative revenue after 3, 6 and 12 months?

Assuming variation in cohort behavior, presumably I want to be able to use the first e.g. 7 days of cumulative revenue for the cohort to be able to inform a future prediction, as opposed to ignoring that and using a 'main' model that is trained on all historic data.

If we want to predict out as far as 6 months or a year, then any model must surely have at least 6 months or a years worth of historic data to train with. i.e. I could not fit a new model just for a specific cohort with 7 days of revenue data and then attempt to predict what 6 months of revenue look like. So how can I combine data unique to the cohort with historic data to make a prediction?

Within the above context, what are some good cohort based approaches to cumulative revenue prediction? What's my training data? Do I use the first 7 days of spend behavior to inform my prediction?

technocrat · May 1, 2021, 1:02am

This is a problem in the time series domain. We would not consider that observations at different frequencies as cohorts in the same sense that it would be used in panel data or other methods where cohort signifies different categories, such as male/female/other. It is possible to do cohorts by app or app type, though.

With your historical data, there is a choice of the frequency of observations, hourly,, daily, weekly, monthly, etc. As the frequency grows smaller, the potential accuracy of the forecast horizon increases. Confidence bands for time series projections quickly become so wide as to encompass negative values on the one hand, and unrealistically great positive values on the other.

A thorough and accessible text is Hyndman.

dougfir · May 3, 2021, 4:21pm

Thank you for the information. Time series makes sense to me. Nevertheless, I've been re-learning about log models lately and almost want to force this problem into a log regression script. I was thinking about how to do that over the weekend. The thought I'm wrangling with is that if cumulative revenue is logarithmic and fits a logarithmic regression well, how could I use new data from 7 days of spend behavior to inform the model to make a prediction. I was thinking that I could model the growth rate from day to day since install and then, for a given cohort of installs (installs on a specific day or week) then apply to growth rates to the existing cumulative revenue from day 7 to 'predict' out to day 180, 365 etc.

Ignoring the fact that ts is more applicable here, just to satisfy my own curiosity, is there some approach where I could still use a log regression model in this scenario, such as what I describe above?

technocrat · May 3, 2021, 11:37pm

If a variable increases strictly by a constant times its log

C <- 1.001
obs <- 1:100

grow <- function(x) C * log(x)
grow(obs)
#>   [1] 0.0000000 0.6938403 1.0997109 1.3876807 1.6110474 1.7935512 1.9478561
#>   [8] 2.0815210 2.1994218 2.3048877 2.4002932 2.4873916 2.5675143 2.6416964
#>  [15] 2.7107583 2.7753613 2.8360466 2.8932621 2.9473834 2.9987280 3.0475670
#>  [22] 3.0941335 3.1386297 3.1812319 3.2220947 3.2613546 3.2991327 3.3355367
#>  [29] 3.3706631 3.4045986 3.4374212 3.4692016 3.5000041 3.5298869 3.5589034
#>  [36] 3.5871025 3.6145288 3.6412237 3.6672252 3.6925683 3.7172856 3.7414073
#>  [43] 3.7649613 3.7879738 3.8104692 3.8324700 3.8539977 3.8750722 3.8957121
#>  [50] 3.9159350 3.9357575 3.9551950 3.9742622 3.9929730 4.0113405 4.0293770
#>  [57] 4.0470943 4.0645035 4.0816150 4.0984389 4.1149847 4.1312615 4.1472779
#>  [64] 4.1630420 4.1785617 4.1938444 4.2088973 4.2237272 4.2383406 4.2527437
#>  [71] 4.2669426 4.2809428 4.2947499 4.3083692 4.3218056 4.3350641 4.3481492
#>  [78] 4.3610655 4.3738173 4.3864087 4.3988436 4.4111260 4.4232594 4.4352476
#>  [85] 4.4470939 4.4588016 4.4703740 4.4818142 4.4931250 4.5043095 4.5153704
#>  [92] 4.5263104 4.5371321 4.5478381 4.5584308 4.5689125 4.5792857 4.5895524
#>  [99] 4.5997150 4.6097754

dat <- data.frame(obs = obs,amt = grow(obs))
plot(obs,grow(obs))

From any seven periods, the future can be projected indefinitely if that's the case.

More likely is that the log is a central tendency with variability around it and subsequent observations are taken to assess the current confidence bands. Can you gin up some data to explore this?

dougfir · May 4, 2021, 8:53pm

Thanks for your feedback and contribution.

Can you gin up some data to explore this

Here you go. Will ping you a password to access assuming I can message people on this forum.

I took my data and sampled and transformed a sample so it's anonymous, but the shape is still the same. File attached. When you take the log of tenure and plot against cumamt (cumulative amount) it's pretty straight. What do you reckon?

system · May 25, 2021, 8:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.