The most efficient way to split each row of a tibble into multiple rows

junghoonshin · May 21, 2020, 7:03am

Let's say I have a tibble named my_data.

my_data = tibble(row_number = c("row1,row2", "row3,row4,row5", "row6,row7"),
                 variable1 = rnorm(3),
                 variable2 = rnorm(3))

I want to split each row of my_data into multiple rows so that each row has only one row index in the row_number column. Below is my code to do this.

my_data2 = vector(mode = "list", length = nrow(my_data))
for (i in 1:nrow(my_data)) {
  my_data2[[i]] = tibble(row_number = my_data$row_number[i] %>% str_split(",") %>% flatten_chr,
                         variable1 = my_data$variable1[i],
                         variable2 = my_data$variable2[i])
}
my_data2 %<>% bind_rows

Although this code does exactly what I want, it takes too long when my_data is very large (my actual data frame has hundreds of thousands of rows). Is there a more efficient way to do this?

junghoonshin · May 21, 2020, 7:29am

I've come up with one.

my_data = tibble(row_number = c("row1,row2", "row3,row4,row5", "row6,row7"),
                 variable1 = rnorm(3),
                 variable2 = rnorm(3))

split_row = function(row_number, ...) {
  tibble(row_number = str_split(row_number, ",") %>% flatten_chr, ...)
}

my_data2 = my_data %>% pmap_dfr(split_row)

According to my test, this code runs much faster than the original one while doing the exactly same job.

siddharthprabhu · May 21, 2020, 7:34am

Use tidyr::separate_rows().

library(tidyr)

my_data <- tibble(row_number = c("row1,row2", "row3,row4,row5", "row6,row7"), 
                  variable1 = rnorm(3), 
                  variable2 = rnorm(3))

separate_rows(my_data, row_number)
#> # A tibble: 7 x 3
#>   row_number variable1 variable2
#>   <chr>          <dbl>     <dbl>
#> 1 row1         -0.475      0.903
#> 2 row2         -0.475      0.903
#> 3 row3         -0.367     -0.388
#> 4 row4         -0.367     -0.388
#> 5 row5         -0.367     -0.388
#> 6 row6          0.0367     0.671
#> 7 row7          0.0367     0.671

^{Created on 2020-05-21 by the reprex package (v0.3.0)}

system · May 28, 2020, 7:34am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.