It is puzzling to me that recoding the factors could take so long. I though only the levels are stored and the character representation of the factors are not repeated. Is there a faster way to achieve below?
library(bench)
library(tidyverse)
df <- data.frame("y" = rnorm(3E7), "Grp" = rep(c("A_something", "B_something", "C_something"), each = 1E7))
bench::mark(
mutate(df, grp = str_replace(Grp, "_something", ""))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> <dbl> <int>
#> 1 "mutate(d~ 22.4s 22.4s 22.4s 22.4s 0.0446 573MB 1 1
#> # ... with 1 more variable: total_time <bch:tm>
Hey @Dong! I think part of the problem here is that stringr::str_replace() takes 'Either a character vector, or something coercible to one.'
A factor is essentially a numeric vector where the possible labels are stored once, separately. By using str_replace(), you're converting your factor to character (essentially causing the entire vector to be re-written), searching and replacing every value, and then converting the whole thing back. The same is happening with the creation: you create a character column and then data.frame converts it to a factor automatically.
I think both your factor creation and releveling would go a lot faster this way, using the forcats package to change the levels without touching the values:
The original releveling took about a minute on my fairly new laptop; using fact_relabel took a fraction of a second Creating the original data frame column directly as a factor also helps a bit; it took 2–3 seconds versus about 10 using a character vector!
One thing I forgot to mention explicitly is that forcats::fct_relabel() causes str_replace to operate on the set of factor labels (length 3), not on the vector values (length 3E7)!
Getting the column as factor with help you relabel it. You can do it with base function, and it applies to data.frame so on data.table and tibble to. levels will get you a character vector of the level value, a character vector that you can deal with to replace the value of levels. There is much less value than in you Grp character column.
library(data.table)
df <- data.table("y" = rnorm(3E7), "Grp" = rep(c("A_something", "B_something", "C_something"), each = 1E7))
# transform into factor
df[, Grp := as.factor(Grp)]
levels(df$Grp) <- gsub("_something", "", levels(df$Grp))
df
#> y Grp
#> 1: -1.61195065 A
#> 2: 0.98342872 A
#> 3: -1.55122757 A
#> 4: 1.17911409 A
#> 5: -2.24083948 A
#> ---
#> 29999996: 0.89209690 C
#> 29999997: -0.14506757 C
#> 29999998: 0.57133525 C
#> 29999999: -0.01521659 C
#> 30000000: 0.17231753 C
If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:
This an old thread now. But df is often the result from tidyr::gather I don't feel like stopping the pipe and name it to do fct_relabel. Any suggestions?