Merge rows with the same value within a column and put different values in the same row

IsabelHx · April 20, 2022, 1:33pm

Hey everyone,

I'm very new with Rstudio and still struggling a lot. I have a dataframe that looks like this:

Patient ID - Disease number - Treatment
p001 - d001 - treatment A
p001 -d001 - treatment B
p002 -d002 - treatment C

So to clarify, patient 1 got two treatments (treatment A and treatment B) for the same disease (d001) and patient 2 got one treatment (treatment C) for another disease (d002).

How do I get this dataframe to only have 1 disease number in a column and then different treatments in the same row. So a bit more like this:

Disease number - patient ID - treatment
d001 - p001 - treatment A, treatment B
d002 - p002 - treatment C

Can anyone help me? Thanks in advance!

nirgrahamuk · April 20, 2022, 2:57pm

if you want all the treatments listed together in the same cell, then you would use dplyr to group by and summarise, the summarisation involing a paste with collapse options. If you want the treatement info spread along the row each to its own column then instead use tidyr pivot_wider.

IsabelHx · April 24, 2022, 12:28pm

Thank you for your comment. I would like to have the different treatments in different columns. I tried the pivot wider option but I don't get what I need. When I use pivot_wider(names_from=treatment, values_from=Disease number) I get a weird dataframe like this:

Patient ID - Treatment A - Treatment B - Treatment C
p001. - d001 - d001 - na
p002 - na - na - d002

What do I do wrong? I would like to have the dataframe look more like this:

Disease number - patient ID - treatment 1 - treatment 2
d001 - p001 -treatment A - treatment B
d002 - p002 - treatment C

nirgrahamuk · April 24, 2022, 2:39pm

library(tidyverse)

(start_df <- tribble(~pid,~dn,~t,
                     "p001","d001" ,"treatment A",
                     "p001","d001","treatment B",
                     "p002","d002" , "treatment C"))

(end_df <- start_df %>%
    pivot_wider(id_cols=c("pid","dn"),
                names_from = "t",
                values_from="t",
                values_fn = function(x){TRUE},
                values_fill = FALSE))

IsabelHx · April 24, 2022, 3:26pm

Thanks for your fast reply but this is still not what I'm looking for unfortunately.
I want to merge the double data so there is only one row per disease number and I want the different treatments next to each other in different columns so treatment no1, treatment no2 etc.
So I end up with only one row per disease number instead of multiple rows.

So is there a way to merge two or more rows that have the same values in all columns except one (treatment) and have the values that are different be added in a new column (so treatment no2) so there is only one row for one patient?

nirgrahamuk · April 24, 2022, 3:47pm

I don't understand your comment on the output of the code I provided against your example.
The result is

# A tibble: 2 x 5
  pid   dn    `treatment A` `treatment B` `treatment C`
  <chr> <chr> <lgl>         <lgl>         <lgl>        
1 p001  d001  TRUE          TRUE          FALSE        
2 p002  d002  FALSE         FALSE         TRUE

i.e. the original 3 rows summarised to 2 rows.

andresrcs · April 24, 2022, 4:15pm

You can get the desired output but you would need to find out the max number of treatments per patient in advance and, honestly, the result would be messy and less useful for future processing.

library(tidyverse)

start_df <- tribble(~patient_id,~disease_number,~treatment,
                     "p001","d001" ,"treatment A",
                     "p001","d001","treatment B",
                     "p002","d002" , "treatment C")

start_df %>% 
    group_by(patient_id, disease_number) %>% 
    summarise(treatment = paste(treatment, collapse = ",")) %>% 
    separate(treatment, into = c("treatment_1", "treatment_2"), sep = ",")
#> `summarise()` has grouped output by 'patient_id'. You can override using the
#> `.groups` argument.
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> # A tibble: 2 × 4
#> # Groups:   patient_id [2]
#>   patient_id disease_number treatment_1 treatment_2
#>   <chr>      <chr>          <chr>       <chr>      
#> 1 p001       d001           treatment A treatment B
#> 2 p002       d002           treatment C <NA>

^{Created on 2022-04-24 by the reprex package (v2.0.1)}

Even merging the treatments into a character string would make more sense

start_df %>% 
    group_by(patient_id, disease_number) %>% 
    summarise(treatment = paste(treatment, collapse = ","))
#> `summarise()` has grouped output by 'patient_id'. You can override using the
#> `.groups` argument.
#> # A tibble: 2 × 3
#> # Groups:   patient_id [2]
#>   patient_id disease_number treatment              
#>   <chr>      <chr>          <chr>                  
#> 1 p001       d001           treatment A,treatment B
#> 2 p002       d002           treatment C

Note: Next time please provide a proper REPRoducible EXample (reprex) illustrating your issue.