Please help me bug fix

naja · November 28, 2019, 1:04pm

Hi
So I have some final bug fixing for a specific part of my code that I am trying to figure out. I have attached an anonymized subset of the data I am working with so you can generate the problem yourselves. The data can be downloaded from this link:

I run the following code to create a retention mixed panel for my data. The task is to see how many donors to my company drop out of their monthly installment plans.

As you see in my final panel (see photo attached) there are two problems:

Cohort nr. 6 (I.e. customers signed up in the month of June) is missing from the count. I suspect the bug in this problem is under the section where I create the cohort.
At the end of each row the final number adds up. I can’t figure out why as it is measuring the dropout rate. How do I remove this final count in each row?

I really appreciate the help as these two bugs have been bothering me for days now.

Many Thanks,
Naja

My lines of code are:

#Question to rstudio community 

#loading packages

library(foreign) 
library(DT)
library(tidyverse)
library(reshape2)
library(ggplot2)
library(readxl)
library(stringr)
library(dplyr)
library(lubridate)
library(sqldf)

#loading data

donations2018 <- read_xlsx ("trial_sample2.xlsx")
###################################################
#Preprocessing dates
# Converting the Payment.date column into a date object
donations2018$Payment.date <- as.Date(donations2018$Payment.date, format = "%Y/%m/%d")

#Converting Start.date Column into a date object
donations2018$Start.date <- as.Date(donations2018$Start.date, format = "%Y/%m/%d")

# Setting up a column for the year data to make it easier to focus on specific year
donations2018$Year <- as.numeric(format(donations2018$Payment.date, '%Y'))

#Setting up a colum for only month
donations2018$Month <- as.numeric(format(donations2018$Payment.date, '%m'))




####################################################################################################
#Creating the cohorts

# Getting the purchase frequencies per Contact_number
order.frequency <- as.data.frame(table(donations2018$Contact_number)) 
colnames(order.frequency)[1] <- "Cust.Id"


# Getting the first transaction dates for each customer
join.date <- aggregate(Payment.date~Contact_number,donations2018,min, na.rm = TRUE)


# Changing the name of the column Payment.date to First_Payment
# since this is the first transaction date for each customer
colnames(join.date)[2] <- "First_Payment"

# Merge the Join date data to the donations2018 data frame
donations2018 <- merge(donations2018, join.date, by.x = "Contact_number",by.y = "Contact_number", all.x = TRUE)

# Merge the frequency data to the donations2018 data frame
donations2018 <- merge(donations2018, order.frequency, by.x = "Contact_number",by.y = "Cust.Id", all.x = TRUE) 

# Creating the groups/Cohorts based on the join date month
donations2018$Cohort <- as.numeric(format(donations2018$First_Payment, "%m"))

rm(join.date)

####################################################################################################
#Cohort Age
# Calculating the difference in days between the Payment.date column by join date column
# There is no option for month, but getting the month from the days is simple division
donations2018$Age_by_Day <- as.numeric(difftime(donations2018$Payment.date,donations2018$First_Payment, units = c("days")))

#age by month
#Transform dates to date class
donations2018 = donations2018 %>% mutate(Payment.date= as.Date(Payment.date, format = "%Y-%m-%d"), 
                                         First_Payment = as.Date(First_Payment, format = "%Y-%m-%d"))

#Calculate the difference in months
donations2018 = donations2018 %>% mutate(Age_by_Month = (year(Payment.date) - year(First_Payment))*12 +
                                           (month(Payment.date) - month(First_Payment)))

# Dumping the day element from the join date column
donations2018$First_Payment <- format(donations2018$First_Payment, "%Y-%m")

# Now we remove the day element from the Payment.date data since
# this Cohort Analysis is based on monthly activity.
donations2018$Payment.date <- format(donations2018$Payment.date, "%Y-%m")

# We relabel the cohort column data to something more intuitive for the sake
# of the report consumers, then factor them since these are sequential
groups <- c("Jan Cohorts",
            "Feb Cohorts",
            "Mar Cohorts",
            "Apr Cohorts",
            "May Cohorts",
            "Jun Cohorts",
            "Jul Cohorts",
            "Aug Cohorts",
            "Sep Cohorts",
            "Oct Cohorts",
            "Nov Cohorts",
            "Dec Cohorts")

for(i in 1:12){
  donations2018[donations2018$Cohort==i,"Cohort"] <- groups[i]
}
rm(i,groups)
donations2018

donations2018$Cohort <- factor(donations2018$Cohort,ordered = T,levels =c("Jan Cohorts",
                                                                          "Feb Cohorts",
                                                                          "Mar Cohorts",
                                                                          "Apr Cohorts",
                                                                          "May Cohorts",
                                                                          "Jun Cohorts",
                                                                          "Jul Cohorts",
                                                                          "Aug Cohorts",
                                                                          "Sep Cohorts",
                                                                          "Oct Cohorts",
                                                                          "Nov Cohorts",
                                                                          "Dec Cohorts"))



####################################################################################################
# Creating rows for each cohort group
#Creating columns for each value in the Age_by_Month column;0-11
# The default aggregation setup for dcast is, fun.aggregate = length
cohorts.wide <- reshape2::dcast(donations2018,Cohort~Age_by_Month,
                                value.var="Contact_number",
                                fun.aggregate = length)


#removing january variable: 
#cohorts.wide <- cohorts.wide [-c(1),] 

# Cloning the output for retention and churn mixpanels
# to be used later
cw.retention <- cohorts.wide
cw.churn <- cohorts.wide

#making cohorts numeric

cohorts.wide$Cohort <- as.numeric(cohorts.wide$Cohort)


#cohort 1 = January Cohort 12 = december

# Creating 19 breaks and 20 rgb color values ranging from blue to white
breaks <- quantile(cohorts.wide[,], probs = seq(.05, .95, .05), na.rm = TRUE)

colors <- sapply(round(seq(155, 80, length.out = length(breaks) + 1), 0),
                 function(x){ rgb(x,x,155, maxColorValue = 155) } )


# The Retention Mixpanel with counts
DT::datatable(cohorts.wide,
              class = 'cell-border stripe',
              rownames = FALSE,
              options = list(
                ordering=F,
                dom = 't',
                pageLength = 12) ) %>%
  formatStyle("0",
              backgroundColor = 'lightgrey',
              fontWeight = 'bold') %>%
  formatStyle(names(cohorts.wide[c(-1,-2)]),fontWeight = 'bold',color = 'white', backgroundColor = styleInterval(breaks,colors))

###############################################################################################

andresrcs · November 28, 2019, 1:17pm

Where the sample data is? I can't manage to find it.
To help us help you, could you please turn this into a minimal self-contained reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

naja · November 28, 2019, 1:30pm

Hi thank you!
I tried to attach an excel file but maybe that's not allowed. My data frame is quite large as i am working with matrix data and need at least a few observations in each cell. I've distilled it to 3000 rows, but i guess that's too big to paste in here. What do people normally do in these situations?
Many Thanks,
Naja

andresrcs · November 28, 2019, 1:32pm

You could upload your file to a cloud storage service and make the reprex reading the data from a URL link.

naja · November 28, 2019, 1:35pm

Awesome my reprex can be downloaded from this link:

andresrcs · November 28, 2019, 10:32pm

I think you have to check your expectations because the results are consistent with your data

There is no Cohort 6 because there are no customers that signed up in the month of June

That is not a final count, that is the number of clients by Cohort and Age_by_Month

To help you improve your chances of getting help, I'm going to make a smaller self-contained reprex for you.

library(tidyverse)
library(readxl)
library(lubridate)

url <- "https://drive.google.com/uc?authuser=0&id=1rV12XIUnMy6P5tuKn2cNOXuQBaneP_d-&export=download"
download.file(url, destfile = "raw_data.xlsx")

read_xlsx("raw_data.xlsx") %>%
    group_by(Contact_number) %>%
    mutate(First_Payment = min(Payment.date),
           Cohort = factor(str_to_title(paste(month(First_Payment, label = TRUE, abbr = TRUE), "Cohorts")),
                           ordered = T, levels =c("Jan Cohorts",
                                                  "Feb Cohorts",
                                                  "Mar Cohorts",
                                                  "Apr Cohorts",
                                                  "May Cohorts",
                                                  "Jun Cohorts",
                                                  "Jul Cohorts",
                                                  "Aug Cohorts",
                                                  "Sep Cohorts",
                                                  "Oct Cohorts",
                                                  "Nov Cohorts",
                                                  "Dec Cohorts")),
           Age_by_Month = (year(Payment.date) - year(First_Payment)) * 12 +
               (month(Payment.date) - month(First_Payment)),
           First_Payment = format(First_Payment, "%Y-%m"),
           Payment.date = format(Payment.date, "%Y-%m")) %>%
    pivot_wider(id_cols = Cohort,
                names_from = Age_by_Month,
                values_from = Contact_number,
                values_fn = list(Contact_number = length)) %>%
    select(Cohort, num_range("", 0:20)) %>% 
    arrange(Cohort)
#> # A tibble: 10 x 22
#>    Cohort   `0`   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
#>    <ord>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1 Feb C…    11    11    11    10    10    10    10     9     8     8     8
#>  2 Mar C…    30    29    26    23    23    22    21    20    20    19    19
#>  3 Apr C…    15    14    13    13    13    13    13    13    13    13    13
#>  4 May C…    47    47    43    41    40    34    29    28    27    26    26
#>  5 Jul C…    12    11    10    10    10    10    10     9     9     9     9
#>  6 Aug C…    33    30    31    29    28    26    26    26    26    26    25
#>  7 Sep C…    30    30    29    25    24    24    24    24    23    23    22
#>  8 Oct C…    16    16    15    14    13    13    13    12    12    12    12
#>  9 Nov C…    23    23    23    21    20    18    17    17    17    16    16
#> 10 Dec C…    16    16    16    16    16    15    15    14    14    14    26
#> # … with 10 more variables: `11` <int>, `12` <int>, `13` <int>, `14` <int>,
#> #   `15` <int>, `16` <int>, `17` <int>, `18` <int>, `19` <int>, `20` <int>

^{Created on 2019-11-28 by the reprex package (v0.3.0.9000)}

system · December 19, 2019, 10:32pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.