Tidy way to collapse rows when a spread/pivot won't work

ejlatour · June 25, 2018, 7:20pm

I want to combine duplicate rows into a one with multiple columns for the unique info. I was able to figure out a couple of ways using the tidyverse, but I'm wondering if there is a better way than what I've come up with.

I could use another set of eyes on this problem.

It is sort of the reverse of what was done in Tidy way to split a column.

data.frame(
  id = c(1, 1, 2, 2, 3), 
  code = c("A", "B", "C", "D", "E")
)
#>   id code
#> 1  1    A
#> 2  1    B
#> 3  2    C
#> 4  2    D
#> 5  3    E

library(tidyverse)

(df <- tibble::tibble(
  id = c(1, 1, 2, 2, 3, 4, 4), 
  code = c("A", "B", "C", "D", "E", NA, NA)))
#> # A tibble: 7 x 2
#>      id code 
#>   <dbl> <chr>
#> 1     1 A    
#> 2     1 B    
#> 3     2 C    
#> 4     2 D    
#> 5     3 E    
#> 6     4 <NA> 
#> 7     4 <NA>

(max_dups <- df %>% 
  dplyr::count(id) %>% 
  dplyr::pull(n) %>% 
  max(.))
#> [1] 2

df %>% 
  group_by(id) %>% 
  summarise(combo_code = paste0(code, collapse = "~~~~")) %>% 
  tidyr::separate(., 
                  col = combo_code, 
                  into = paste("new_code", 1:max_dups, sep = "_"), 
                  sep = "~~~~", 
                  remove = TRUE, 
                  fill = "right") %>% 
  mutate_at(.vars = vars(paste("new_code", 1:max_dups, sep = "_")), 
            .funs = funs(dplyr::na_if(., "NA")))
#> # A tibble: 4 x 3
#>      id new_code_1 new_code_2
#>   <dbl> <chr>      <chr>     
#> 1     1 A          B         
#> 2     2 C          D         
#> 3     3 E          <NA>      
#> 4     4 <NA>       <NA>

Created on 2018-06-25 by the reprex package (v0.2.0).

I figured out a way to do it with tidyr::spread() but it does't seem as fool-proof.

## Another way ---------------- 

(df <- tibble::tibble(
  id = c(1, 1, 2, 2, 3, 4, 4), 
  code = c("A", "B", "C", "D", "E", NA, NA)))
#> # A tibble: 7 x 2
#>      id code 
#>   <dbl> <chr>
#> 1     1 A    
#> 2     1 B    
#> 3     2 C    
#> 4     2 D    
#> 5     3 E    
#> 6     4 <NA> 
#> 7     4 <NA>

library(tidyverse)

df %>% 
  group_by(id) %>% 
  mutate(dup_count = sequence(dplyr::n()), 
         key = paste("new_code", dup_count, sep = "_")) %>% 
  tidyr::spread(., 
                key = key,
                value = code) %>% 
  tidyr::fill(dplyr::starts_with("new_code"), .direction = "up") %>% 
  dplyr::distinct(., id, .keep_all = TRUE)
#> # A tibble: 4 x 4
#> # Groups:   id [4]
#>      id dup_count new_code_1 new_code_2
#>   <dbl>     <int> <chr>      <chr>     
#> 1     1         1 A          B         
#> 2     2         1 C          D         
#> 3     3         1 E          <NA>      
#> 4     4         1 <NA>       <NA>

Created on 2018-06-25 by the reprex package (v0.2.0).

alistaire · June 26, 2018, 1:17am

Your last approach is close, but by adding two columns (dup_count and key), you're stopping it from collapsing. Add a single column and it's clean:

library(tidyverse)

df <- tibble::tibble(
    id = c(1, 1, 2, 2, 3, 4, 4), 
    code = c("A", "B", "C", "D", "E", NA, NA)
)

df %>% 
    group_by(id) %>% 
    mutate(key = paste0('new_code_', row_number())) %>% 
    spread(key, code)
#> # A tibble: 4 x 3
#> # Groups:   id [4]
#>      id new_code_1 new_code_2
#>   <dbl> <chr>      <chr>     
#> 1     1 A          B         
#> 2     2 C          D         
#> 3     3 E          <NA>      
#> 4     4 <NA>       <NA>

See more examples here:

ejlatour · June 26, 2018, 4:43am

@alistaire Thank you so much! The method you show is very direct and much cleaner than what I was trying. I think that my logic was there, but wasn't quite there with the code yet. You examples in the link are very helpful also; I did not come across that in my searching into this problem. Thanks again for the help with this one!

cderv · June 26, 2018, 5:37am

Hi,

If your question has been answered, can you mark your question as solved

It will help other knows the status of the question right from the list of question or by searching for related topic.

Thank you.