dplyr mutate gives NA values

sueb · December 2, 2018, 2:22pm

I'm trying to calculate the dates between purchases and then the next expected date of purchase. I'm not sure how to deal with cases when it's the first purchase, the code currently gives NA which is accurate as you can't work out previous purchase if it's the first one. Instead of NA I'd prefer it to give the date of purchase. I've been trying to use case when but got stuck. Any suggestions?

dashboard <- dashboard %>%
  mutate(Date_Purchased =  as.Date(dashboard$Date_Purchased)) %>%
  group_by(Customer_Email) %>%
  arrange(Date_Purchased) %>%
  mutate(previousPurchaseDate = lag(Date_Purchased,1), 
         difference = Date_Purchased - previousPurchaseDate, 
         expectedNextPurchase = Date_Purchased + difference)

Huge thanks if you can help

mara · December 2, 2018, 4:01pm

So, here's a small reprex with dummy data, since (obviously) I don't have yours (FFR, making a self-contained reproducible example is super helpful), which is one way you could do this (I'm sure there are many more).

I created an intermediary lpurchase with lag(), and then used case_when() using is.na() as the logical predicate.

You don't actually need to separate out the mutate() calls (you can do multiple mutations inside of one mutate), I just wanted to give it a bit more visual clarity.

library(tidyverse)

dat <- tibble::tribble(
          ~purchase_date, ~product,
            "2017-12-17",  "apple",
            "2017-12-22", "banana",
            "2017-12-21", "banana",
            "2017-12-21", "carrot",
            "2017-11-29", "banana",
            "2017-12-18", "carrot",
            "2017-12-05",  "apple",
            "2017-12-20", "banana",
            "2017-12-19", "carrot"
          )

dat <- dat %>%
  mutate(purchase_date = lubridate::ymd(purchase_date))

dat %>%
  group_by(product) %>%
  arrange(purchase_date, .by_group = TRUE) %>%
  mutate(lpurchase = dplyr::lag(purchase_date)) %>%
  mutate(
    mod_last = case_when(
      is.na(lpurchase) ~ purchase_date,
      TRUE ~ dplyr::lag(purchase_date)
    )
  )
#> # A tibble: 9 x 4
#> # Groups:   product [3]
#>   purchase_date product lpurchase  mod_last  
#>   <date>        <chr>   <date>     <date>    
#> 1 2017-12-05    apple   NA         2017-12-05
#> 2 2017-12-17    apple   2017-12-05 2017-12-05
#> 3 2017-11-29    banana  NA         2017-11-29
#> 4 2017-12-20    banana  2017-11-29 2017-11-29
#> 5 2017-12-21    banana  2017-12-20 2017-12-20
#> 6 2017-12-22    banana  2017-12-21 2017-12-21
#> 7 2017-12-18    carrot  NA         2017-12-18
#> 8 2017-12-19    carrot  2017-12-18 2017-12-18
#> 9 2017-12-21    carrot  2017-12-19 2017-12-19

^{Created on 2018-12-02 by the reprex package (v0.2.1.9000)}

sueb · December 2, 2018, 4:20pm

That's brilliant maria and thank you. I take on board your point about creating a reprex with dummy data, will do next time. Still quite new to this

What does the .by_group = TRUE mean? is that instead of writing group_by and then arrange?

mara · December 2, 2018, 4:46pm

From the arrange() docs:

.by_group If TRUE, will sort first by grouping variable. Applies to grouped data frames only.

So, this way, I'm arranging within the grouped variable (of product).

On the contrary, it only works with a grouped data frame!

sueb · December 2, 2018, 5:06pm

Oh I see, I thought as it was piped in from the group_by it would automatically do that, should I add this to my code to ensure this.
Really appreciate the help,
Sue

system · December 23, 2018, 5:20pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.