Hey everyone,
I'm very new with Rstudio and still struggling a lot. I have a dataframe that looks like this:
Patient ID - Disease number - Treatment
p001 - d001 - treatment A
p001 -d001 - treatment B
p002 -d002 - treatment C
So to clarify, patient 1 got two treatments (treatment A and treatment B) for the same disease (d001) and patient 2 got one treatment (treatment C) for another disease (d002).
How do I get this dataframe to only have 1 disease number in a column and then different treatments in the same row. So a bit more like this:
Disease number - patient ID - treatment
d001 - p001 - treatment A, treatment B
d002 - p002 - treatment C
Can anyone help me? Thanks in advance!
if you want all the treatments listed together in the same cell, then you would use dplyr to group by and summarise, the summarisation involing a paste with collapse options. If you want the treatement info spread along the row each to its own column then instead use tidyr pivot_wider.
Thank you for your comment. I would like to have the different treatments in different columns. I tried the pivot wider option but I don't get what I need. When I use pivot_wider(names_from=treatment, values_from=Disease number) I get a weird dataframe like this:
Patient ID - Treatment A - Treatment B - Treatment C
p001. - d001 - d001 - na
p002 - na - na - d002
What do I do wrong? I would like to have the dataframe look more like this:
Disease number - patient ID - treatment 1 - treatment 2
d001 - p001 -treatment A - treatment B
d002 - p002 - treatment C
library(tidyverse)
(start_df <- tribble(~pid,~dn,~t,
"p001","d001" ,"treatment A",
"p001","d001","treatment B",
"p002","d002" , "treatment C"))
(end_df <- start_df %>%
pivot_wider(id_cols=c("pid","dn"),
names_from = "t",
values_from="t",
values_fn = function(x){TRUE},
values_fill = FALSE))
Thanks for your fast reply but this is still not what I'm looking for unfortunately.
I want to merge the double data so there is only one row per disease number and I want the different treatments next to each other in different columns so treatment no1, treatment no2 etc.
So I end up with only one row per disease number instead of multiple rows.
So is there a way to merge two or more rows that have the same values in all columns except one (treatment) and have the values that are different be added in a new column (so treatment no2) so there is only one row for one patient?
I don't understand your comment on the output of the code I provided against your example.
The result is
# A tibble: 2 x 5
pid dn `treatment A` `treatment B` `treatment C`
<chr> <chr> <lgl> <lgl> <lgl>
1 p001 d001 TRUE TRUE FALSE
2 p002 d002 FALSE FALSE TRUE
i.e. the original 3 rows summarised to 2 rows.
You can get the desired output but you would need to find out the max number of treatments per patient in advance and, honestly, the result would be messy and less useful for future processing.
library(tidyverse)
start_df <- tribble(~patient_id,~disease_number,~treatment,
"p001","d001" ,"treatment A",
"p001","d001","treatment B",
"p002","d002" , "treatment C")
start_df %>%
group_by(patient_id, disease_number) %>%
summarise(treatment = paste(treatment, collapse = ",")) %>%
separate(treatment, into = c("treatment_1", "treatment_2"), sep = ",")
#> `summarise()` has grouped output by 'patient_id'. You can override using the
#> `.groups` argument.
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> # A tibble: 2 × 4
#> # Groups: patient_id [2]
#> patient_id disease_number treatment_1 treatment_2
#> <chr> <chr> <chr> <chr>
#> 1 p001 d001 treatment A treatment B
#> 2 p002 d002 treatment C <NA>
Created on 2022-04-24 by the reprex package (v2.0.1)
Even merging the treatments into a character string would make more sense
start_df %>%
group_by(patient_id, disease_number) %>%
summarise(treatment = paste(treatment, collapse = ","))
#> `summarise()` has grouped output by 'patient_id'. You can override using the
#> `.groups` argument.
#> # A tibble: 2 × 3
#> # Groups: patient_id [2]
#> patient_id disease_number treatment
#> <chr> <chr> <chr>
#> 1 p001 d001 treatment A,treatment B
#> 2 p002 d002 treatment C
Note: Next time please provide a proper REPRoducible EXample (reprex) illustrating your issue.