Quickly recoding large(ish) vectors in a tibble

tiernan · November 3, 2017, 4:27pm

Does anyone have any recommendations for efficiently recoding data in a tibble?

I regularly work with coded values that need to be converted into human-readable names. I like the explicitness that dplyr::recode() provides, but it becomes cumbersome if there are many different coded values and it can be slow with large datasets (see the benchmark() tests below).

An alternative I sometimes use is to create a named vector of the human-readable values and assign the code values to the names attribute. This works well if there's a 1:1 relationship between codes and human-readable values and it's faster than the recode method, but it's still slow enough that it has become a bottleneck for my daily workflow.

How do other folks tackle this type of problem?

Reprex

suppressWarnings(library(tidyverse))  
suppressWarnings(library(rbenchmark))

# object to decode and the key

tbl <- tibble(CODE = sample(letters[1:3], 1e+06, replace = TRUE))

key <- tribble(~CODE,   ~FRUIT, 
                 "a",  "apple", 
                 "b", "banana", 
                 "c",  "cherry"
               )

# Alternative One: dplyr::recode

mutate(tbl, FRUIT = recode(CODE, a = "apple", b = "banana", c = "cherry")) 
#> # A tibble: 1,000,000 x 2
#>     CODE  FRUIT
#>    <chr>  <chr>
#>  1     c cherry
#>  2     b banana
#>  3     a  apple
#>  4     b banana
#>  5     c cherry
#>  6     c cherry
#>  7     b banana
#>  8     c cherry
#>  9     a  apple
#> 10     a  apple
#> # ... with 999,990 more rows

# Alternative Two: named vector

# create a named vector from the key object and use it to recode
named_vector <- 
    key %>% 
    pmap(~set_names(..2, ..1)) %>% 
    unlist()

mutate(tbl, FRUIT = named_vector[CODE])
#> # A tibble: 1,000,000 x 2
#>     CODE  FRUIT
#>    <chr>  <chr>
#>  1     c cherry
#>  2     b banana
#>  3     a  apple
#>  4     b banana
#>  5     c cherry
#>  6     c cherry
#>  7     b banana
#>  8     c cherry
#>  9     a  apple
#> 10     a  apple
#> # ... with 999,990 more rows

# Speed Tests

# Alt 1
benchmark(mutate(tbl, FRUIT = recode(CODE, a = "apple", b = "banana", c = "cherry")), 
  columns = c("replications", "elapsed", "relative", "user.self", "sys.self"))
#>   replications elapsed relative user.self sys.self
#> 1          100   27.17        1     23.44     3.65

# Alt 2

benchmark(mutate(tbl, FRUIT = named_vector[CODE]), columns = c("replications", 
  "elapsed", "relative", "user.self", "sys.self"))
#>   replications elapsed relative user.self sys.self
#> 1          100    9.67        1      8.43     1.25

Frank · November 3, 2017, 5:07pm

I always do left_join(tbl, key) or the data.table equivalent.

I don't know if it wins as far as efficiency, but I find the code easy to follow. Also it beats named vectors and dplyr::recode for generality, since those require your code be a string, while it might instead be an integer or even a pair of columns of mixed types.

tiernan · November 3, 2017, 5:18pm

Good idea - I can't believe I never thought to use left_join for this purpose.

And it looks like it outperforms the named vector approach in this toy example:

benchmark(left_join(tbl, key, by = "CODE"), 
          columns = c("replications", "elapsed", "relative", "user.self", "sys.self"))
#>  replications elapsed relative user.self sys.self
#>1          100   6.575        1       6.5    0.072

martin.R · November 3, 2017, 6:14pm

If speed is of the utmost performance then you should consider using data.table (as already mentioned by Frank), particularly with keys. For very large data sets the performance is highly efficient and it has many non-equi join options should they be required.

pgensler · November 4, 2017, 5:25am

I think dplyr's case_when is more readable, than recode, but still just as slow:


suppressWarnings(library(tidyverse))  
suppressWarnings(library(rbenchmark))


# object to decode and the key

tbl <- tibble(CODE = sample(letters[1:3], 1e+06, replace = TRUE))

benchmark(
rev <- tbl %>%
  mutate(type =
           case_when(
             CODE == "a" ~ "apple",
             CODE == "b" ~ "banana",
             CODE == "c" ~ "cherry",
             FALSE ~ "no fruit"
           )
)
)

test
1 rev <- tbl %>% mutate(type = case_when(CODE == "a" ~ "apple", CODE == "b" ~ "banana", CODE == "c" ~ "cherry", FALSE ~ "no fruit"))
  replications elapsed relative user.self sys.self user.child sys.child
1          100  26.552        1    22.443    3.947          0         0

Maybe even using purr's map_df to apply a custom function like recode could be faster?

davis · November 4, 2017, 12:51pm

I think you'll get a 2x speed up or so from using factors. At least in the case of using a recode, it should only have to change the 3 labels on the levels of the factors, rather than altering every element of the vector.

suppressWarnings(library(tidyverse))  
suppressWarnings(library(microbenchmark))

# object to decode and the key

tbl <- tibble(CODE = sample(letters[1:3], 1e+06, replace = TRUE))

key <- tribble(~CODE,   ~FRUIT, 
               "a",  "apple", 
               "b", "banana", 
               "c",  "cherry"
)

tbl_with_factor <- mutate(tbl, CODE = as.factor(CODE))
key_with_factor <- mutate(key, CODE = as.factor(CODE))

# Speed tests

microbenchmark::microbenchmark(
  # Fastest method from previous discussion
  left_join(tbl, key, by = "CODE"),
  
  # Try a recode with factors
  mutate(tbl_with_factor, FRUIT = recode(CODE, a = "apple", b = "banana", c = "cherry")),
  
  # Try the same thing but with recode from forcats
  mutate(tbl_with_factor, FRUIT = forcats::fct_recode(CODE, apple = "a", banana = "b", cherry = "c")),
  
  # Use base R and change the levels
  mutate(tbl_with_factor, FRUIT = `levels<-`(CODE, list("apple" = "a", "banana" = "b", "cherry" = "c"))),
  
  # Left join when everything is a factor
  left_join(tbl_with_factor, key_with_factor, by = "CODE")
)

This gives some nice results

Unit: milliseconds
                                                                                                     expr      min       lq
                                                                         left_join(tbl, key, by = "CODE") 50.89433 55.58770
              mutate(tbl_with_factor, FRUIT = recode(CODE, a = "apple", b = "banana",      c = "cherry")) 22.98268 29.49390
 mutate(tbl_with_factor, FRUIT = forcats::fct_recode(CODE, apple = "a",      banana = "b", cherry = "c")) 24.25023 28.93651
    mutate(tbl_with_factor, FRUIT = `levels<-`(CODE, list(apple = "a",      banana = "b", cherry = "c"))) 22.41556 26.74911
                                                 left_join(tbl_with_factor, key_with_factor, by = "CODE") 36.87323 41.05750
     mean   median       uq      max neval
 64.41534 58.53341 61.98173 177.3578   100
 38.77766 31.69801 35.41894 154.4150   100
 36.24761 31.47930 34.62983 155.7492   100
 35.36276 29.22803 32.72580 156.9444   100
 46.23981 43.02139 47.31327 162.5166   100

Looking at median values, the mutate + recode() with factor runs all seem to score the best, at about 2x what you were doing before.

stanwood · November 16, 2017, 2:59pm

The OP use case seems tailored for factors as indicated above. Coming from SAS I'm new to factors and not entirely comfortable with them yet. I tend to use the left_join() idiom that @Frank mentioned. In particular, I often have many things I wish to keep track of regarding each factor. I keep these in a table and join in the things I need or loop over a subset of rows in the table to take action on each factor.