repeating operations with date

help · August 17, 2022, 11:25pm

hi,
below is my code. I create several ages based on variables, and then I create a new mutated variable based on "buckets". I am wondering is there a more efficient way to do this than I currently do. thank uu!

data %>%
mutate(current_age =  birthdate - Sys.Date(),
signed_age = birthdate - signed_date,
married_age = birthdate - married_date,
prom_age = birthdate - prom_date) %>% 
mutate(current_age_bucket = 
case_when(current_age < 20 ~ "Under 20",  
current_age >= 20  & current_age < 30 ~ "20-29",
...
...
..),
prom_age_bucket =  prom_age < 20 ~ "Under 20",  
prom_age >= 20  & prom_age < 30 ~ "20-29",
prom_age >= 30 & prom_age < 40 ~ "30-39,
...
..
)

FactOREO · August 18, 2022, 9:13am

Hello,

just out of curiosity:

How can your code be meaningful in this way? E.g. current_age = birthdate - Sys.Date(), unless your birthdate entry is in the future, your value would be negative.

Looking at this post from SO: r - How to reconcile purrr::map with case_when - Stack Overflow, it seems better to use cut for your request. I will use positive date values for the reprex, since negative Dates don't seem logical for me.

library(tidyverse)

data <- data.frame(
  id = 1:100,
  current_age = sample(11:100,size = 100, replace = TRUE),
  signed_age = sample(11:100, size = 100, replace = TRUE))
# to get a [0,20] category
breaks <- c(0,seq.default(20,100,10))
# define labels
labels <- c('Under 20','20-29','30-39','40-49','50-59','60-69','70-79','80-89','Over 90')
# longer way (but preferable I think if you really need the wide format)
# is faster as well, since no pivot_longer or pivot_wider happens
data |>
  mutate(current_age_bucket = cut(x = current_age,
                                  breaks = breaks,
                                  labels = labels,
                                  right = FALSE,
                                  include.lowest = TRUE),
         signed_age_bucket = cut(x = signed_age,
                                 breaks = breaks,
                                 labels = labels,
                                 right = FALSE,
                                 include.lowest = TRUE)) |>
  head()
#>   id current_age signed_age current_age_bucket signed_age_bucket
#> 1  1          60         25              60-69             20-29
#> 2  2          50         69              50-59             60-69
#> 3  3          80         53              80-89             50-59
#> 4  4          48         16              40-49          Under 20
#> 5  5          74         44              70-79             40-49
#> 6  6          33         30              30-39             30-39
# probably better way regarding coding, but transforming into wide format is somewhat bad
data |>
  pivot_longer(cols = c('current_age','signed_age'),
               values_to = 'age', names_to = 'type_age') |>
  mutate(
    bucket = cut(x = age, breaks = breaks, labels = labels, right = FALSE, include.lowest = TRUE)
  ) |>
  select(id, type_age, bucket) |>
  pivot_wider(names_from = 'type_age', values_from = 'bucket') |>
  head()
#> # A tibble: 6 × 3
#>      id current_age signed_age
#>   <int> <fct>       <fct>     
#> 1     1 60-69       20-29     
#> 2     2 50-59       60-69     
#> 3     3 80-89       50-59     
#> 4     4 40-49       Under 20  
#> 5     5 70-79       40-49     
#> 6     6 30-39       30-39

rbenchmark::benchmark(
  option1 = data |>
    mutate(current_age_bucket = cut(x = current_age,
                                    breaks = breaks,
                                    labels = labels,
                                    right = FALSE,
                                    include.lowest = TRUE),
           signed_age_bucket = cut(x = signed_age,
                                   breaks = breaks,
                                   labels = labels,
                                   right = FALSE,
                                   include.lowest = TRUE)),
  option2 = data |>
    pivot_longer(cols = c('current_age','signed_age'),
                 values_to = 'age', names_to = 'type_age') |>
    mutate(
      bucket = cut(x = age, breaks = breaks, labels = labels, right = FALSE, include.lowest = TRUE)
    ) |>
    select(id, type_age, bucket) |>
    pivot_wider(names_from = 'type_age', values_from = 'bucket')
)
#>      test replications elapsed relative user.self sys.self user.child sys.child
#> 1 option1          100    0.44    1.000      0.28     0.06         NA        NA
#> 2 option2          100    4.44   10.091      3.62     0.15         NA        NA

^{Created on 2022-08-18 by the reprex package (v2.0.1)}

As you see, option 1 is faster but also more typing. Option 2 has the benefit of just writing cut once. But you will have trouble retaining the original age you created from the first mutate call, I think.

Hope this answers your question.

Kind regards

Edit:

I was assuming you have valid age entries, but if you only have time differences in days (as expected with date2 - date1 in the mutate call), you will need to do somethink like

lubridate::time_length(difftime(Sys.Date(), birthdate), "years")

to get a valid numeric age output first, which can than be handled by the above cut specification.

system · September 8, 2022, 9:14am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.