Let's say I have a tibble named my_data
.
my_data = tibble(row_number = c("row1,row2", "row3,row4,row5", "row6,row7"),
variable1 = rnorm(3),
variable2 = rnorm(3))
I want to split each row of my_data
into multiple rows so that each row has only one row index in the row_number
column. Below is my code to do this.
my_data2 = vector(mode = "list", length = nrow(my_data))
for (i in 1:nrow(my_data)) {
my_data2[[i]] = tibble(row_number = my_data$row_number[i] %>% str_split(",") %>% flatten_chr,
variable1 = my_data$variable1[i],
variable2 = my_data$variable2[i])
}
my_data2 %<>% bind_rows
Although this code does exactly what I want, it takes too long when my_data
is very large (my actual data frame has hundreds of thousands of rows). Is there a more efficient way to do this?
I've come up with one.
my_data = tibble(row_number = c("row1,row2", "row3,row4,row5", "row6,row7"),
variable1 = rnorm(3),
variable2 = rnorm(3))
split_row = function(row_number, ...) {
tibble(row_number = str_split(row_number, ",") %>% flatten_chr, ...)
}
my_data2 = my_data %>% pmap_dfr(split_row)
According to my test, this code runs much faster than the original one while doing the exactly same job.
Use tidyr::separate_rows()
.
library(tidyr)
my_data <- tibble(row_number = c("row1,row2", "row3,row4,row5", "row6,row7"),
variable1 = rnorm(3),
variable2 = rnorm(3))
separate_rows(my_data, row_number)
#> # A tibble: 7 x 3
#> row_number variable1 variable2
#> <chr> <dbl> <dbl>
#> 1 row1 -0.475 0.903
#> 2 row2 -0.475 0.903
#> 3 row3 -0.367 -0.388
#> 4 row4 -0.367 -0.388
#> 5 row5 -0.367 -0.388
#> 6 row6 0.0367 0.671
#> 7 row7 0.0367 0.671
Created on 2020-05-21 by the reprex package (v0.3.0)
2 Likes
system
Closed
May 28, 2020, 7:34am
4
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.