This post is conversational, hope that's allowed here.

I'm unable to share data for this. I would like to be able to generate example data but that would imply I already understand the underlying relationship between predictor and target, whereas that's what I'm trying to figure out.

I'm doing some exploratory analysis as part of a project to create a simple logistic regression to predict renewal of users on annual plans in a subscription context.

For each subscriber, we track their login activity on each day 1:365 throughout the year. Based on historic data, I can make a box plot of just count of logins and renewed / not renewed e.g (1=renewed):

Users who renew do indeed login more, but the separation isn't great. I wondered if logins that happen closer to renewal time are more significant than those just after sign up. I applied a decay function to logins based on day of plan 1:365:

```
lambdas <- c(0.01, 0.02, 0.03)
pdata |>
filter(abs(Logins - mean(Logins)) <= 3 * sd(Logins)) |> # remove outliers
mutate(data.frame(Logins * exp(outer(Tenure, lambdas))) |>
setNames(str_c('lambda_logins_', lambdas))) |>
group_by(UserId, renewed_year1) |>
summarise(across(contains("Logins"), sum, .names = "sum_{.col}"), .groups = 'drop') |>
pivot_longer(cols = contains("Logins"), names_to = "name", values_to = "value") |>
ggplot(aes(x = as.factor(renewed_year1), y = value, fill = as.factor(renewed_year1))) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~name, scales = "free_y", ncol = 2) +
scale_y_log10() +
theme_minimal() +
labs(title = "Decayed Login Activity Lambda", x = "Renewed", y = "Decayed Logins")
```

Just eyeballing, using a decay lambda of 0.02 seems to improve separation a little.

I then looked at the logins for each user per day expressed as a % of their logins throughout the year:

```
pdata |>
group_by(UserId) |>
mutate(Logins_Pct = Logins / sum(Logins)) |>
ungroup() |>
mutate(renewed_year1 = factor(renewed_year1)) |>
group_by(Tenure, renewed_year1) |>
summarise(Avg_Pct_Logins = mean(Logins_Pct)) |>
ggplot(aes(x = Tenure, y = Avg_Pct_Logins, color = renewed_year1)) +
geom_line() +
theme_minimal() +
labs(title = "Trend of Login % Distribution in Year 1", x = "Day", y = "Login %")
```

I thought that I had found somehting when viewing this plot. The red line of churners has higher peaks. Renewers login more regularly throughout the year, which makes sense.

To try to use this finding, I then used standard deviation to boxplot login distribution over the year, hoping to improve separation more:

```
pdata |>
filter(abs(Logins - mean(Logins)) <= 3 * sd(Logins)) |>
group_by(UserId) |>
mutate(Logins_Pct = Logins / sum(Logins)) |>
group_by(renewed_year1, .add = T) |>
summarise(SD_Logins_Pct = sd(Logins_Pct), .groups = 'keep') |>
ggplot(aes(x = as.factor(renewed_year1), y = SD_Logins_Pct, fill = as.factor(renewed_year1))) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Standard Deviation of Login % Over Year 1", x = "Renewed", y = "Standard Deviation Login %")
```

Dissapointment, I thought I was onto something but the standard deviation of login % for each user over each day in the year doesnt separate renewers from churners well. This contradicts what I found in the previous plot with trend of logins as a % for each user. Maybe I'm missing something.

Has anyone 'conquered' this kind of analysis before? What are some other approaches? I'm still convinced that all logins are not equal, and that recent logins are more meaningful (My lambda with 0.02 seems to hint towards this). But also, looking at the trend chart with distribution of login activity % on each day for each user does show a pattern of more peaks / less evenly distributed logins over the year for churners, but I seem unable to make use of this.

Suggestions welcome