Simple regression observations are not independant. So what?

I have a regression model that models log(GrowthRate) ~ log(day) where GrowthRate is the revenue growth from a cohort of customers day to day and day is the number of days ago since the cohort installed the app.

e.g. On Jan 1st 2020 a hundred people install our app spend $10 on this day. On Jan 2 2020, day 2, this same group of people (cohort) who installed on Jan 1st spent $9. On day 3 the cohort spent $8 and perhaps by day 365 this cohort only spent $0.01.

My data then looks like this:

Screenshot from 2022-03-04 11-15-50

I've shown the first 3 days of each cohort, but in reality I have years of historic data and many cohorts have history going back day 730 or more.

I'm happy with the model. It does well on our evaluation metric and the relationship between log(GrowthRate) and log(day) is pretty linear.

But, my observations are not independent. Each row is a combination of cohort and day, so cohort like a group. Is it? I don't want to use cohort as a variable, like my research suggested, since my use case is to predict GrowthRate on new cohorts, so the new cohorts would not exist in the training data.

My observations are not independant since I'm monitoring the same cohort over time (day) alongside other cohort day combinations. I'm otherwise happy with the model which predicts well on new unseen data. So what? Why does this matter and what should I do, if anything?

It depends on what your goal is. If it's pure prediction, then you probably don't need to worry. But if you are worried about things like standard errors and confidence intervals then the correlations among observations likely do matter. In that case you would want to cluster observations by cohort. Doing that would not change the predictions.

1 Like

The statistical approach for this type of problem would be a hierarchical model with a random effect for cohort and fixed effect for day. That way you can model differences between cohort without exploding the degrees of freedom of the model. And be able to generate a prediction for a as-yet-untested cohort. I'm not that much of an expert in it, but the model would be something like

lmer(revenue ~ day + (day | cohort)


I looked at a mixed effect model here but don't understand how I could predict on new cohorts if cohort is one of the effects

Also, I would be inclined to model revenue instead of growth rate. Especially if growth rate is calculated by finite differencing. Calculating the derivative by finite differencing is notoriously noisy. But, just a thought.

Thanks for the info here. When you say 'cluster' by cohort, do you mean e.g. a mixed effects model as suggested by Arthur below? Or something else?

If it's pure prediction

With this model I wanted the best of both worlds - a reasonably accurate prediction that is also explainable!

That's a good question that I don't have the answer to. But I feel like it should be possible given that mixed model fits a distribution for cohort effects, and that intuitively seems like it has the information required to produce a prediction confidence interval.

But I feel like it should be possible

I do too! I'm optimistic there's a prescribed or conventional approach to this. I'm continuing my searching...

No, my suggestion is only an adjustment to standard errors. A quick intro is at Clustered standard errors - Wikipedia.

I do agree (without seeing the data) with @arthur.t about thinking about other functional forms. Although if you're happy with what you have that might not be important.

I believe that, for a new cohort that has no information in the fitting part of the analysis, the prediction would just be based on the fixed effects part of your formula (intercept and day), as the cohort specific intercept and slope for day would get the expected value: zero.

Hi Ron. OK, but then would the model not just end up being the same as a regular linear regression? i.e. what would be the benefit of using a mixed effects model here?

Accounting for the within-cohort covariance between observations. And if you get intermittent updates, you'd gradually include info on that cohort (re-running lmer to give estimates of the cohort level intercept and slope), and predict later days better.

Of course, if there's some correlation structure between cohorts, which would have to be incorporated into the analysis, then info on past cohorts would provide info on the new cohort. For example, my background is in quantitative genetics of livestock and the correlation structure between animals would be controlled through the pedigree - so a newborn animal's genetic merit would be a function of it's mum's and dad's genetic merit (genetic merit being a random effect solution) and they've presumably been included in the last analysis, so it wouldn't necessarily be zero. I don't know whether lmer supports that kind of thing, or just an identity matrix corr structure.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.