I have been given a task from a stakeholder where they have created multiple models. I am trying to figure out if there's a clever way to use one single model and add relevant predictors to get a similar result. There is after all one single dataframe where the various models come from.
I'll start with a similar data frame to my problem and then explain the context:
# Set up
library(tidyverse)
library(lubridate)
# Create data
mydf <- data.frame(
cohort = seq(ymd('2019-01-01'), ymd('2019-12-31'), by = '1 days'),
n = rnorm(365, 1000, 50) %>% round,
cohort_cost = rnorm(365, 800, 50)
) %>%
crossing(tenure_days = 0:365) %>%
mutate(activity_date = cohort + days(tenure_days)) %>%
mutate(daily_revenue = rnorm(nrow(.), 20, 1)) %>%
group_by(cohort) %>%
arrange(activity_date) %>%
mutate(cumulative_revenue = cumsum(daily_revenue)) %>%
arrange(cohort, activity_date) %>%
mutate(payback_velocity = round(cumulative_revenue / cohort_cost, 2)) %>%
select(cohort, n, cohort_cost, activity_date, tenure_days, everything())
mydf %>% glimpse
Rows: 133,590
Columns: 8
Groups: cohort [365]
$ cohort <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-…
$ n <dbl> 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 996, 99…
$ cohort_cost <dbl> 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, 851.191, …
$ activity_date <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-05, 2019-01-06, 2019-01-07, 2019-01-08, 2019-01-09, 2019-01-10, 2019-01-11, 2019-01-12, 2019-01-13, 2019-…
$ tenure_days <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, …
$ daily_revenue <dbl> 18.57664, 20.23128, 19.29523, 20.05705, 20.91953, 20.82540, 20.08343, 18.92085, 19.13451, 21.70503, 20.03755, 20.38968, 19.27403, 18.97484, 21.30330, 21.07885, 20…
$ cumulative_revenue <dbl> 18.57664, 38.80792, 58.10314, 78.16019, 99.07972, 119.90512, 139.98855, 158.90939, 178.04390, 199.74894, 219.78649, 240.17616, 259.45019, 278.42502, 299.72832, 32…
$ payback_velocity <dbl> 0.02, 0.05, 0.07, 0.09, 0.12, 0.14, 0.16, 0.19, 0.21, 0.23, 0.26, 0.28, 0.30, 0.33, 0.35, 0.38, 0.40, 0.42, 0.45, 0.47, 0.49, 0.52, 0.54, 0.56, 0.58, 0.61, 0.63, …
> mydf %>% head
# A tibble: 6 x 8
# Groups: cohort [1]
cohort n cohort_cost activity_date tenure_days daily_revenue cumulative_revenue payback_velocity
<date> <dbl> <dbl> <date> <int> <dbl> <dbl> <dbl>
1 2019-01-01 996 851. 2019-01-01 0 18.6 18.6 0.02
2 2019-01-01 996 851. 2019-01-02 1 20.2 38.8 0.05
3 2019-01-01 996 851. 2019-01-03 2 19.3 58.1 0.07
4 2019-01-01 996 851. 2019-01-04 3 20.1 78.2 0.09
5 2019-01-01 996 851. 2019-01-05 4 20.9 99.1 0.12
6 2019-01-01 996 851. 2019-01-06 5 20.8 120. 0.14
Context is app installs. The goal is to predict velocity payback. That is, for each day of tenure, how much revenue have we received in total for the group since install as a % of how much we spent on advertising to obtain them.
E.g. the last block above showing the head(mydf) of the dataframe. The last row of the head shows that we spent 851 advertising on installs that occurred on 2019-01-01 and that by day 6, 2019-01-06 we had received back 120 in revenue for that cohort which is 14% of our spend (120/851).
The spreadsheet that I inherited, and eventually have to turn into a Shiny app, contains simple linear models for various "day from" and "day to" velocity paybacks.
Example, the following simple linear model:
day30_velocity_payback ~ day15_velocity_payback
Can we predict what % payback we will have by day 30 based on the payback we had by day 15. Etc...
day60_velocity_payback ~ day30_velocity_payback
day120_velocity_payback ~ day60_velocity_payback
day180_velocity_payback ~ day120_velocity_payback
In my eventual model(s) that will live inside a Shiny App, I would like the user to be able to select a 'day from' and 'day to' from drop downs and then have the model run corresponding predictions.
This means I would need to create a model for each combination of 'day from' and 'day to' but with a condition that 'day to' is always greater than day from. I.e. fit a model to predict velocity payback in the future based on velocity payback in the past.
I can pivot my data frame:
## wider data
mydf_wide <- mydf %>% select(cohort, n, cohort_cost, tenure_days, payback_velocity) %>% group_by(cohort, n, cohort_cost) %>% pivot_wider(names_from = tenure_days, values_from = payback_velocity, names_prefix = 'velocity_day_')
mydf_wide %>% glimpse
Rows: 365
Columns: 369
Groups: cohort, n, cohort_cost [365]
$ cohort <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-05, 2019-01-06, 2019-01-07, 2019-01-08, 2019-01-09, 2019-01-10, 2019-01-11, 2019-01-12, 2019-01-13, 2019-01…
$ n <dbl> 996, 967, 1002, 977, 1087, 1006, 1035, 987, 931, 1018, 1031, 1020, 990, 1019, 999, 959, 1001, 989, 1061, 966, 968, 1003, 939, 1008, 994, 1016, 944, 917, 1021, 990, …
$ cohort_cost <dbl> 851.1910, 843.7070, 748.6089, 854.7634, 824.4729, 901.4816, 875.9514, 829.4255, 795.9618, 775.9743, 748.6413, 762.3107, 830.0364, 840.3704, 864.2655, 875.1085, 822.…
$ velocity_day_0 <dbl> 0.02, 0.02, 0.03, 0.02, 0.03, 0.02, 0.02, 0.03, 0.02, 0.03, 0.03, 0.03, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.03, 0.02, 0.02, 0.02, 0.02, 0.02, 0.03, 0.…
$ velocity_day_1 <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.04, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.…
$ velocity_day_2 <dbl> 0.07, 0.07, 0.08, 0.07, 0.08, 0.07, 0.07, 0.07, 0.07, 0.08, 0.08, 0.08, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.08, 0.07, 0.07, 0.07, 0.07, 0.08, 0.…
$ velocity_day_3 <dbl> 0.09, 0.09, 0.11, 0.09, 0.10, 0.09, 0.09, 0.10, 0.10, 0.11, 0.11, 0.11, 0.10, 0.09, 0.09, 0.09, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.09, 0.10, 0.11, 0.…
$ velocity_day_4 <dbl> 0.12, 0.12, 0.13, 0.12, 0.12, 0.11, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.11, 0.13, 0.12, 0.12, 0.12, 0.12, 0.13, 0.12, 0.12, 0.12, 0.12, 0.13, 0.…
$ velocity_day_5 <dbl> 0.14, 0.14, 0.16, 0.14, 0.15, 0.13, 0.14, 0.14, 0.15, 0.16, 0.16, 0.16, 0.15, 0.14, 0.14, 0.14, 0.15, 0.15, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.14, 0.15, 0.16, 0.…
$ velocity_day_6 <dbl> 0.16, 0.16, 0.19, 0.16, 0.18, 0.16, 0.16, 0.17, 0.17, 0.19, 0.18, 0.19, 0.17, 0.17, 0.16, 0.16, 0.18, 0.18, 0.17, 0.17, 0.17, 0.18, 0.17, 0.17, 0.16, 0.17, 0.18, 0.…
$ velocity_day_7 <dbl> 0.19, 0.18, 0.21, 0.18, 0.20, 0.18, 0.19, 0.19, 0.19, 0.21, 0.21, 0.22, 0.19, 0.19, 0.18, 0.18, 0.20, 0.20, 0.20, 0.19, 0.20, 0.21, 0.20, 0.20, 0.19, 0.20, 0.21, 0.…
$ velocity_day_8 <dbl> 0.21, 0.21, 0.24, 0.21, 0.22, 0.20, 0.21, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.21, 0.21, 0.20, 0.22, 0.23, 0.22, 0.22, 0.22, 0.23, 0.22, 0.22, 0.22, 0.22, 0.23, 0.…
$ velocity_day_9 <dbl> 0.23, 0.23, 0.27, 0.23, 0.25, 0.22, 0.23, 0.24, 0.24, 0.26, 0.26, 0.27, 0.24, 0.24, 0.23, 0.23, 0.25, 0.25, 0.24, 0.24, 0.25, 0.26, 0.25, 0.25, 0.24, 0.25, 0.26, 0.…
$ velocity_day_10 <dbl> 0.26, 0.26, 0.29, 0.26, 0.27, 0.24, 0.26, 0.27, 0.27, 0.29, 0.29, 0.29, 0.27, 0.26, 0.26, 0.25, 0.27, 0.28, 0.27, 0.26, 0.27, 0.28, 0.27, 0.27, 0.26, 0.27, 0.29, 0.…
mydf_wide %>% head
# A tibble: 6 x 369
# Groups: cohort, n, cohort_cost [6]
cohort n cohort_cost velocity_day_0 velocity_day_1 velocity_day_2 velocity_day_3 velocity_day_4 velocity_day_5 velocity_day_6 velocity_day_7 velocity_day_8 velocity_day_9
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 996 851. 0.02 0.05 0.07 0.09 0.12 0.14 0.16 0.19 0.21 0.23
2 2019-01-02 967 844. 0.02 0.05 0.07 0.09 0.12 0.14 0.16 0.18 0.21 0.23
3 2019-01-03 1002 749. 0.03 0.05 0.08 0.11 0.13 0.16 0.19 0.21 0.24 0.27
4 2019-01-04 977 855. 0.02 0.05 0.07 0.09 0.12 0.14 0.16 0.18 0.21 0.23
5 2019-01-05 1087 824. 0.03 0.05 0.08 0.1 0.12 0.15 0.18 0.2 0.22 0.25
6 2019-01-06 1006 901. 0.02 0.05 0.07 0.09 0.11 0.13 0.16 0.18 0.2 0.22
With my data in this format I could manually fit a model one by one:
# Models
## day 1 to day 2 velocity model
mod_velocity_d1_d2 <- lm(velocity_day_2 ~ velocity_day_1, mydf_wide)
## day 1 to day 3 velocity model
mod_velocity_d1_d3 <- lm(velocity_day_3 ~ velocity_day_1, mydf_wide)
## day 1 to day 4 velocity model
mod_velocity_d1_d4 <- lm(velocity_day_4 ~ velocity_day_1, mydf_wide)
## ...
## day 2 to day 3 velocity model
mod_velocity_d2_d3 <- lm(velocity_day_2 ~ velocity_day_2, mydf_wide)
## ...
## day 100 to day 365 velocity model
mod_velocity_d100_d365 <- lm(velocity_day_365 ~ velocity_day_100, mydf_wide)
This would take a very long time.
Option1: Is there a more elegant, less code way of doing this? Having many models but perhaps fit during a loop?
Option2: Or, is there some way to define one single model that takes tenure_days as input and outputs corresponding predictions? I.e. the target would be payback_velocity and the inputs would be tenure_days at a given point. The input predictor variable would itself be variable? E.g. input tenure_days=30 and payback_velocity_day[15]
would output the expected payback velocity at day30 based on the payback velocity at day15. I put 15 inside braces to denote it being variable. In this imaginary model I hope somehow exists, the input could be variable e.g.
tenure_days=30 and payback_velocity_day[15] # predict day 30 payback based on day 15 payback
tenure_days=30 and payback_velocity_day[20] # predict day 30 payback based on day 20 payback
tenure_days=60 and payback_velocity_day[18] # predict day 60 payback based on day 18 payback
Is this possible?
What's a good way to approach this problem? My first option1 using many models in a loop? Or does an option2 approach exist?