Does anyone have any recommendations for efficiently recoding data in a tibble?
I regularly work with coded values that need to be converted into human-readable names. I like the explicitness that dplyr::recode()
provides, but it becomes cumbersome if there are many different coded values and it can be slow with large datasets (see the benchmark()
tests below).
An alternative I sometimes use is to create a named vector of the human-readable values and assign the code values to the names attribute. This works well if there's a 1:1 relationship between codes and human-readable values and it's faster than the recode
method, but it's still slow enough that it has become a bottleneck for my daily workflow.
How do other folks tackle this type of problem?
Reprex
suppressWarnings(library(tidyverse))
suppressWarnings(library(rbenchmark))
# object to decode and the key
tbl <- tibble(CODE = sample(letters[1:3], 1e+06, replace = TRUE))
key <- tribble(~CODE, ~FRUIT,
"a", "apple",
"b", "banana",
"c", "cherry"
)
# Alternative One: dplyr::recode
mutate(tbl, FRUIT = recode(CODE, a = "apple", b = "banana", c = "cherry"))
#> # A tibble: 1,000,000 x 2
#> CODE FRUIT
#> <chr> <chr>
#> 1 c cherry
#> 2 b banana
#> 3 a apple
#> 4 b banana
#> 5 c cherry
#> 6 c cherry
#> 7 b banana
#> 8 c cherry
#> 9 a apple
#> 10 a apple
#> # ... with 999,990 more rows
# Alternative Two: named vector
# create a named vector from the key object and use it to recode
named_vector <-
key %>%
pmap(~set_names(..2, ..1)) %>%
unlist()
mutate(tbl, FRUIT = named_vector[CODE])
#> # A tibble: 1,000,000 x 2
#> CODE FRUIT
#> <chr> <chr>
#> 1 c cherry
#> 2 b banana
#> 3 a apple
#> 4 b banana
#> 5 c cherry
#> 6 c cherry
#> 7 b banana
#> 8 c cherry
#> 9 a apple
#> 10 a apple
#> # ... with 999,990 more rows
# Speed Tests
# Alt 1
benchmark(mutate(tbl, FRUIT = recode(CODE, a = "apple", b = "banana", c = "cherry")),
columns = c("replications", "elapsed", "relative", "user.self", "sys.self"))
#> replications elapsed relative user.self sys.self
#> 1 100 27.17 1 23.44 3.65
# Alt 2
benchmark(mutate(tbl, FRUIT = named_vector[CODE]), columns = c("replications",
"elapsed", "relative", "user.self", "sys.self"))
#> replications elapsed relative user.self sys.self
#> 1 100 9.67 1 8.43 1.25