Is it normal to take long time to rename factors

Dong · September 26, 2018, 12:49am

It is puzzling to me that recoding the factors could take so long. I though only the levels are stored and the character representation of the factors are not repeated. Is there a faster way to achieve below?

library(bench)
library(tidyverse)

df <- data.frame("y" = rnorm(3E7), "Grp" = rep(c("A_something", "B_something", "C_something"), each = 1E7))

bench::mark(
  mutate(df, grp = str_replace(Grp, "_something", ""))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 "mutate(d~ 22.4s 22.4s  22.4s 22.4s    0.0446     573MB     1     1
#> # ... with 1 more variable: total_time <bch:tm>

Created on 2018-09-25 by the reprex package (v0.2.0).

rensa · September 26, 2018, 3:35am

Hey @Dong! I think part of the problem here is that stringr::str_replace() takes 'Either a character vector, or something coercible to one.'

A factor is essentially a numeric vector where the possible labels are stored once, separately. By using str_replace(), you're converting your factor to character (essentially causing the entire vector to be re-written), searching and replacing every value, and then converting the whole thing back. The same is happening with the creation: you create a character column and then data.frame converts it to a factor automatically.

I think both your factor creation and releveling would go a lot faster this way, using the forcats package to change the levels without touching the values:

library(forcats)
df <- data.frame(
  "y" = rnorm(3E7),
  "Grp" = factor(rep(1:3, 1E7), levels = c("1" = "A_something", "2" = "B_something", "3" = "C_something")))

df$grp = df$Grp %>% fct_relabel(str_replace, "_something", "")

The original releveling took about a minute on my fairly new laptop; using fact_relabel took a fraction of a second Creating the original data frame column directly as a factor also helps a bit; it took 2–3 seconds versus about 10 using a character vector!

rensa · September 26, 2018, 4:12am

One thing I forgot to mention explicitly is that forcats::fct_relabel() causes str_replace to operate on the set of factor labels (length 3), not on the vector values (length 3E7)!

Dong · September 26, 2018, 5:50am

Thanks @rensa for the clear explanation. I was trying to use str_replace to do the work of fct_relabel and got exactly what I deserved

Again, thanks for introducing this forcats function to me.

rensa · September 26, 2018, 6:02am

That's okay! As a long-time user of factors, I'm ashamed to say that I've only just started using forcats myself

Dong · September 26, 2018, 7:28pm

By the way, I noticed that @rensa 's method also works on data.table, but at 10x slower than for data.frame. I wonder if some conversion is going on.

I have been using data.table for performance/memory reason. If the readers have a solution to relabel the factors in data.table, please share as well.

cderv · September 26, 2018, 8:00pm

Getting the column as factor with help you relabel it. You can do it with base function, and it applies to data.frame so on data.table and tibble to.
levels will get you a character vector of the level value, a character vector that you can deal with to replace the value of levels. There is much less value than in you Grp character column.

library(data.table)

df <- data.table("y" = rnorm(3E7), "Grp" = rep(c("A_something", "B_something", "C_something"), each = 1E7))
# transform into factor
df[, Grp := as.factor(Grp)]

levels(df$Grp) <- gsub("_something", "", levels(df$Grp))
df
#>                     y Grp
#>        1: -1.61195065   A
#>        2:  0.98342872   A
#>        3: -1.55122757   A
#>        4:  1.17911409   A
#>        5: -2.24083948   A
#>       ---                
#> 29999996:  0.89209690   C
#> 29999997: -0.14506757   C
#> 29999998:  0.57133525   C
#> 29999999: -0.01521659   C
#> 30000000:  0.17231753   C

^{Created on 2018-09-26 by the reprex package (v0.2.1)}

I let you bench::mark() what you want.

Dong · September 26, 2018, 8:52pm

Thanks for teaching me the use of levels. The time I got from tictoc are now roughly comparable.

df$Grp = df$Grp %>% fct_relabel(str_replace, "_something", "") 0.59 sec
dt$Grp = dt$Grp %>% fct_relabel(str_replace, "_something", "") 0.72 sec
levels(dt$Grp) <- gsub("_something", "", levels(dt$Grp)) 0.97 sec

So my previous "10x" observation is not true. Sorry for my confusions.

cderv · September 28, 2018, 5:56am

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Dong · March 7, 2019, 8:04pm

Is it possible to do this in a dplyr pipeline?

This an old thread now. But df is often the result from tidyr::gather I don't feel like stopping the pipe and name it to do fct_relabel. Any suggestions?

Thanks!