Hello,
just out of curiosity:
How can your code be meaningful in this way? E.g. current_age = birthdate - Sys.Date()
, unless your birthdate
entry is in the future, your value would be negative.
Looking at this post from SO: r - How to reconcile purrr::map with case_when - Stack Overflow, it seems better to use cut
for your request. I will use positive date values for the reprex, since negative Dates don't seem logical for me.
library(tidyverse)
data <- data.frame(
id = 1:100,
current_age = sample(11:100,size = 100, replace = TRUE),
signed_age = sample(11:100, size = 100, replace = TRUE))
# to get a [0,20] category
breaks <- c(0,seq.default(20,100,10))
# define labels
labels <- c('Under 20','20-29','30-39','40-49','50-59','60-69','70-79','80-89','Over 90')
# longer way (but preferable I think if you really need the wide format)
# is faster as well, since no pivot_longer or pivot_wider happens
data |>
mutate(current_age_bucket = cut(x = current_age,
breaks = breaks,
labels = labels,
right = FALSE,
include.lowest = TRUE),
signed_age_bucket = cut(x = signed_age,
breaks = breaks,
labels = labels,
right = FALSE,
include.lowest = TRUE)) |>
head()
#> id current_age signed_age current_age_bucket signed_age_bucket
#> 1 1 60 25 60-69 20-29
#> 2 2 50 69 50-59 60-69
#> 3 3 80 53 80-89 50-59
#> 4 4 48 16 40-49 Under 20
#> 5 5 74 44 70-79 40-49
#> 6 6 33 30 30-39 30-39
# probably better way regarding coding, but transforming into wide format is somewhat bad
data |>
pivot_longer(cols = c('current_age','signed_age'),
values_to = 'age', names_to = 'type_age') |>
mutate(
bucket = cut(x = age, breaks = breaks, labels = labels, right = FALSE, include.lowest = TRUE)
) |>
select(id, type_age, bucket) |>
pivot_wider(names_from = 'type_age', values_from = 'bucket') |>
head()
#> # A tibble: 6 × 3
#> id current_age signed_age
#> <int> <fct> <fct>
#> 1 1 60-69 20-29
#> 2 2 50-59 60-69
#> 3 3 80-89 50-59
#> 4 4 40-49 Under 20
#> 5 5 70-79 40-49
#> 6 6 30-39 30-39
rbenchmark::benchmark(
option1 = data |>
mutate(current_age_bucket = cut(x = current_age,
breaks = breaks,
labels = labels,
right = FALSE,
include.lowest = TRUE),
signed_age_bucket = cut(x = signed_age,
breaks = breaks,
labels = labels,
right = FALSE,
include.lowest = TRUE)),
option2 = data |>
pivot_longer(cols = c('current_age','signed_age'),
values_to = 'age', names_to = 'type_age') |>
mutate(
bucket = cut(x = age, breaks = breaks, labels = labels, right = FALSE, include.lowest = TRUE)
) |>
select(id, type_age, bucket) |>
pivot_wider(names_from = 'type_age', values_from = 'bucket')
)
#> test replications elapsed relative user.self sys.self user.child sys.child
#> 1 option1 100 0.44 1.000 0.28 0.06 NA NA
#> 2 option2 100 4.44 10.091 3.62 0.15 NA NA
Created on 2022-08-18 by the reprex package (v2.0.1)
As you see, option 1 is faster but also more typing. Option 2 has the benefit of just writing cut
once. But you will have trouble retaining the original age you created from the first mutate
call, I think.
Hope this answers your question.
Kind regards
Edit:
I was assuming you have valid age entries, but if you only have time differences in days (as expected with date2 - date1 in the mutate
call), you will need to do somethink like
lubridate::time_length(difftime(Sys.Date(), birthdate), "years")
to get a valid numeric
age output first, which can than be handled by the above cut
specification.