Fit multiple models in a loop or somehow fit just one model: varying inputs and target

I have been given a task from a stakeholder where they have created multiple models. I am trying to figure out if there's a clever way to use one single model and add relevant predictors to get a similar result. There is after all one single dataframe where the various models come from.

I'll start with a similar data frame to my problem and then explain the context:

# Set up
library(tidyverse)
library(lubridate)


# Create data
mydf <- data.frame(
  cohort = seq(ymd('2019-01-01'), ymd('2019-12-31'), by = '1 days'),
  n = rnorm(365, 1000, 50) %>% round,
  cohort_cost = rnorm(365, 800, 50)
) %>% 
  crossing(tenure_days = 0:365) %>% 
  mutate(activity_date = cohort + days(tenure_days)) %>% 
  mutate(daily_revenue = rnorm(nrow(.), 20, 1)) %>% 
  group_by(cohort) %>% 
  arrange(activity_date) %>% 
  mutate(cumulative_revenue = cumsum(daily_revenue)) %>% 
  arrange(cohort, activity_date) %>% 
  mutate(payback_velocity = round(cumulative_revenue / cohort_cost, 2)) %>% 
  select(cohort, n, cohort_cost, activity_date, tenure_days, everything())

mydf %>% glimpse
Rows: 133,590
Columns: 8
Groups: cohort [365]
$ cohort             <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-…
$ n                  <dbl> 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 99…
$ cohort_cost        <dbl> 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, …
$ activity_date      <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-05, 2019-01-06, 2019-01-07, 2019-01-08, 2019-01-09, 2019-01-10, 2019-01-11, 2019-01-12, 2019-01-13, 2019-…
$ tenure_days        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, …
$ daily_revenue      <dbl> 18.57664, 20.23128, 19.29523, 20.05705, 20.91953, 20.82540, 20.08343, 18.92085, 19.13451, 21.70503, 20.03755, 20.38968, 19.27403, 18.97484, 21.30330, 21.07885, 20…
$ cumulative_revenue <dbl> 18.57664, 38.80792, 58.10314, 78.16019, 99.07972, 119.90512, 139.98855, 158.90939, 178.04390, 199.74894, 219.78649, 240.17616, 259.45019, 278.42502, 299.72832, 32…
$ payback_velocity   <dbl> 0.02, 0.05, 0.07, 0.09, 0.12, 0.14, 0.16, 0.19, 0.21, 0.23, 0.26, 0.28, 0.30, 0.33, 0.35, 0.38, 0.40, 0.42, 0.45, 0.47, 0.49, 0.52, 0.54, 0.56, 0.58, 0.61, 0.63, …
> mydf %>% head
# A tibble: 6 x 8
# Groups:   cohort [1]
  cohort         n cohort_cost activity_date tenure_days daily_revenue cumulative_revenue payback_velocity
  <date>     <dbl>       <dbl> <date>              <int>         <dbl>              <dbl>            <dbl>
1 2019-01-01   996        851. 2019-01-01              0          18.6               18.6             0.02
2 2019-01-01   996        851. 2019-01-02              1          20.2               38.8             0.05
3 2019-01-01   996        851. 2019-01-03              2          19.3               58.1             0.07
4 2019-01-01   996        851. 2019-01-04              3          20.1               78.2             0.09
5 2019-01-01   996        851. 2019-01-05              4          20.9               99.1             0.12
6 2019-01-01   996        851. 2019-01-06              5          20.8              120.              0.14

Context is app installs. The goal is to predict velocity payback. That is, for each day of tenure, how much revenue have we received in total for the group since install as a % of how much we spent on advertising to obtain them.

E.g. the last block above showing the head(mydf) of the dataframe. The last row of the head shows that we spent 851 advertising on installs that occurred on 2019-01-01 and that by day 6, 2019-01-06 we had received back 120 in revenue for that cohort which is 14% of our spend (120/851).

The spreadsheet that I inherited, and eventually have to turn into a Shiny app, contains simple linear models for various "day from" and "day to" velocity paybacks.

Example, the following simple linear model:

day30_velocity_payback ~ day15_velocity_payback

Can we predict what % payback we will have by day 30 based on the payback we had by day 15. Etc...

day60_velocity_payback ~ day30_velocity_payback
day120_velocity_payback ~ day60_velocity_payback
day180_velocity_payback ~ day120_velocity_payback

In my eventual model(s) that will live inside a Shiny App, I would like the user to be able to select a 'day from' and 'day to' from drop downs and then have the model run corresponding predictions.

This means I would need to create a model for each combination of 'day from' and 'day to' but with a condition that 'day to' is always greater than day from. I.e. fit a model to predict velocity payback in the future based on velocity payback in the past.

I can pivot my data frame:

## wider data
mydf_wide <- mydf %>% select(cohort, n, cohort_cost, tenure_days, payback_velocity) %>% group_by(cohort, n, cohort_cost) %>% pivot_wider(names_from = tenure_days, values_from = payback_velocity, names_prefix = 'velocity_day_')

mydf_wide %>% glimpse
Rows: 365
Columns: 369
Groups: cohort, n, cohort_cost [365]
$ cohort           <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-05, 2019-01-06, 2019-01-07, 2019-01-08, 2019-01-09, 2019-01-10, 2019-01-11, 2019-01-12, 2019-01-13, 2019-01…
$ n                <dbl> 996, 967, 1002, 977, 1087, 1006, 1035, 987, 931, 1018, 1031, 1020, 990, 1019, 999, 959, 1001, 989, 1061, 966, 968, 1003, 939, 1008, 994, 1016, 944, 917, 1021, 990, …
$ cohort_cost      <dbl> 851.1910, 843.7070, 748.6089, 854.7634, 824.4729, 901.4816, 875.9514, 829.4255, 795.9618, 775.9743, 748.6413, 762.3107, 830.0364, 840.3704, 864.2655, 875.1085, 822.…
$ velocity_day_0   <dbl> 0.02, 0.02, 0.03, 0.02, 0.03, 0.02, 0.02, 0.03, 0.02, 0.03, 0.03, 0.03, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.03, 0.02, 0.02, 0.02, 0.02, 0.02, 0.03, 0.…
$ velocity_day_1   <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.04, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.…
$ velocity_day_2   <dbl> 0.07, 0.07, 0.08, 0.07, 0.08, 0.07, 0.07, 0.07, 0.07, 0.08, 0.08, 0.08, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.08, 0.07, 0.07, 0.07, 0.07, 0.08, 0.…
$ velocity_day_3   <dbl> 0.09, 0.09, 0.11, 0.09, 0.10, 0.09, 0.09, 0.10, 0.10, 0.11, 0.11, 0.11, 0.10, 0.09, 0.09, 0.09, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.09, 0.10, 0.11, 0.…
$ velocity_day_4   <dbl> 0.12, 0.12, 0.13, 0.12, 0.12, 0.11, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.11, 0.13, 0.12, 0.12, 0.12, 0.12, 0.13, 0.12, 0.12, 0.12, 0.12, 0.13, 0.…
$ velocity_day_5   <dbl> 0.14, 0.14, 0.16, 0.14, 0.15, 0.13, 0.14, 0.14, 0.15, 0.16, 0.16, 0.16, 0.15, 0.14, 0.14, 0.14, 0.15, 0.15, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.14, 0.15, 0.16, 0.…
$ velocity_day_6   <dbl> 0.16, 0.16, 0.19, 0.16, 0.18, 0.16, 0.16, 0.17, 0.17, 0.19, 0.18, 0.19, 0.17, 0.17, 0.16, 0.16, 0.18, 0.18, 0.17, 0.17, 0.17, 0.18, 0.17, 0.17, 0.16, 0.17, 0.18, 0.…
$ velocity_day_7   <dbl> 0.19, 0.18, 0.21, 0.18, 0.20, 0.18, 0.19, 0.19, 0.19, 0.21, 0.21, 0.22, 0.19, 0.19, 0.18, 0.18, 0.20, 0.20, 0.20, 0.19, 0.20, 0.21, 0.20, 0.20, 0.19, 0.20, 0.21, 0.…
$ velocity_day_8   <dbl> 0.21, 0.21, 0.24, 0.21, 0.22, 0.20, 0.21, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.21, 0.21, 0.20, 0.22, 0.23, 0.22, 0.22, 0.22, 0.23, 0.22, 0.22, 0.22, 0.22, 0.23, 0.…
$ velocity_day_9   <dbl> 0.23, 0.23, 0.27, 0.23, 0.25, 0.22, 0.23, 0.24, 0.24, 0.26, 0.26, 0.27, 0.24, 0.24, 0.23, 0.23, 0.25, 0.25, 0.24, 0.24, 0.25, 0.26, 0.25, 0.25, 0.24, 0.25, 0.26, 0.…
$ velocity_day_10  <dbl> 0.26, 0.26, 0.29, 0.26, 0.27, 0.24, 0.26, 0.27, 0.27, 0.29, 0.29, 0.29, 0.27, 0.26, 0.26, 0.25, 0.27, 0.28, 0.27, 0.26, 0.27, 0.28, 0.27, 0.27, 0.26, 0.27, 0.29, 0.…


mydf_wide %>% head
# A tibble: 6 x 369
# Groups:   cohort, n, cohort_cost [6]
  cohort         n cohort_cost velocity_day_0 velocity_day_1 velocity_day_2 velocity_day_3 velocity_day_4 velocity_day_5 velocity_day_6 velocity_day_7 velocity_day_8 velocity_day_9
  <date>     <dbl>       <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>
1 2019-01-01   996        851.           0.02           0.05           0.07           0.09           0.12           0.14           0.16           0.19           0.21           0.23
2 2019-01-02   967        844.           0.02           0.05           0.07           0.09           0.12           0.14           0.16           0.18           0.21           0.23
3 2019-01-03  1002        749.           0.03           0.05           0.08           0.11           0.13           0.16           0.19           0.21           0.24           0.27
4 2019-01-04   977        855.           0.02           0.05           0.07           0.09           0.12           0.14           0.16           0.18           0.21           0.23
5 2019-01-05  1087        824.           0.03           0.05           0.08           0.1            0.12           0.15           0.18           0.2            0.22           0.25
6 2019-01-06  1006        901.           0.02           0.05           0.07           0.09           0.11           0.13           0.16           0.18           0.2            0.22

With my data in this format I could manually fit a model one by one:

# Models

## day 1 to day 2 velocity model
mod_velocity_d1_d2 <- lm(velocity_day_2 ~ velocity_day_1, mydf_wide)

## day 1 to day 3 velocity model
mod_velocity_d1_d3 <- lm(velocity_day_3 ~ velocity_day_1, mydf_wide)

## day 1 to day 4 velocity model
mod_velocity_d1_d4 <- lm(velocity_day_4 ~ velocity_day_1, mydf_wide)

## ...

## day 2 to day 3 velocity model
mod_velocity_d2_d3 <- lm(velocity_day_2 ~ velocity_day_2, mydf_wide)

## ...

## day 100 to day 365 velocity model
mod_velocity_d100_d365 <- lm(velocity_day_365 ~ velocity_day_100, mydf_wide)

This would take a very long time.

Option1: Is there a more elegant, less code way of doing this? Having many models but perhaps fit during a loop?

Option2: Or, is there some way to define one single model that takes tenure_days as input and outputs corresponding predictions? I.e. the target would be payback_velocity and the inputs would be tenure_days at a given point. The input predictor variable would itself be variable? E.g. input tenure_days=30 and payback_velocity_day[15] would output the expected payback velocity at day30 based on the payback velocity at day15. I put 15 inside braces to denote it being variable. In this imaginary model I hope somehow exists, the input could be variable e.g.

tenure_days=30 and payback_velocity_day[15] # predict day 30 payback based on day 15 payback
tenure_days=30 and payback_velocity_day[20] # predict day 30 payback based on day 20 payback
tenure_days=60 and payback_velocity_day[18] # predict day 60 payback based on day 18 payback

Is this possible?

What's a good way to approach this problem? My first option1 using many models in a loop? Or does an option2 approach exist?

I'm going to answer this in two parts. First, I am going to suggest you consider doing something completely different, after which I'll give you some possible solutions to do what you're asking if you (or the stakeholder) remain committed to it.

So, first, I don't think linear models are a good fit for this type of analysis, rather I think you really should look at time series analysis. Unfortunately, this is not an area I am well versed in, so I can't give you anything even approximating good advice here, but I can point you to some resources I found which I think are sufficient enough to get you started down that road.

I think these would be sufficient for you to start down the path of doing some time series analysis on your data.

If you need more beyond this, a book which I can recommend by proxy is,
Time Series Analysis and Its Applications - Shumway & Stoffer
which you might be able to download for free if you have access to university Wi-Fi.

Now, the second part, a solution to what you are asking about...

  • Option 1
    If you just want to create many simple linear regression models and test them all out the following code should serve as a template you can play with,
dvars <- LETTERS[1:3] # dependent variables
ivars <- letters[1:10] # independent variables
df <- as.data.frame(array(sample(10, 13 * 100, TRUE),
                          dim = c(100, 13),
                          dimnames = list(1:100, c(dvars, ivars))))
models <- outer(dvars, ivars, function(l, r) paste(l, r, sep = " ~ "))
all_ols <- sapply(models, lm, df, simplify = FALSE)
head(all_ols)

If you want a more function based approach you can play with these...

run_model <- Vectorize(function(y, x, d) {
  f <- as.formula(paste(y, x, sep = " ~ "))
  list(f = lm(f, d))
}, vectorize.args = c("x", "y"))

run_models <- function(y, x, d) {
  structure(outer(dvars, ivars, run_model, d), dimnames = list(y, x))
}

# Imagine these as inputs from the user of your Shiny app,
y <- "A"
x <- "a"
# Then either run individual models as needed:
run_model(x, y, df)

# or, pre-run all models and just extract them as needed:
all_models <- run_models(dvars, ivars, df)
all_models[y, x]
2 Likes

A solid introductory ts text is by Hyndman. The reason that arima or another time series approach is needed arises from autocorrelation of residuals, violating a linear regression assumption.

The function based approach by @elmstedt should be considered after choice of model.

1 Like

Yes, I should have been much more clear about that.

You never want to select a model solely based on the best fit from several possible predictors, pretty much ever, but especially in simple linear regression. The problem is, depending on your significance level (let's assume 5%), you would expect 1 model out of every 20 will be statistically significant even when it isn't. So, in theory, I could generate a random response variable and 20 random predictor variables and, on average, we would expect about one of them to have a p-value under 0.05.

Here's an example with 2,500 completely random models.

get_p <- function(ols) {
  summary(ols)$coefficients[2, 4]
}

dvars <- LETTERS[1] # dependent variables
ivars <- outer(letters[1:25], seq(100), paste0)
n <- 100
set.seed(1123)
df <- as.data.frame(array(sample(10, n * length(c(dvars, ivars)), TRUE),
                          dim = c(n, length(c(dvars, ivars))),
                          dimnames = list(seq_len(n), c(dvars, ivars))))

models <- outer(dvars, ivars, function(l, r) paste(l, r, sep = " ~ "))
all_ols <- sapply(models, lm, df, simplify = FALSE)
pvalues <- sapply(all_ols, get_p)
prop.table(table(pvalues > 0.05))
hist(pvalues, breaks = seq(0, 1, 0.05), freq = FALSE)

The study and practice of model selection is a whole other beast unto itself and you should approach it with caution. It's very easy to find things which aren't really there.

3 Likes

Thanks for the answer and feedback @technocrat and @elmstedt , much appreciated. I am working on integrating it with my script just now. Agree RE time series. For this stakeholder my strategy is to first provide them with what they asked for exactly before then suggesting what they need.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.