I'm dealing with a lot of biological data in VCF format, which has text that is tab-separated with 4 or 5 columns that each observation has. However, there are two final columns of variable fields - the first column has which fields are present in the format
Field1;Field2;Field4
Filed1:Field2:Field3
and the second column has the value for each field
The acutal data looks like this if I transform it to a tibble
head(vcf1@gt) %>% as_tibble()
# A tibble: 6 × 2
FORMAT PATIENT
<chr> <chr>
1 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:237,4:0.02:241:105,0:132,4:121,116,2,2
2 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:158,4:0.039:162:78,0:80,4:77,81,1,3
3 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:2,2:0.5:4:1,0:1,2:2,0,2,0
4 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1,2:0.5:3:1,2:0,0:1,0,1,1
5 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:38,4:0.12:42:23,4:15,0:20,18,2,2
6 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:38,4:0.12:42:23,4:15,0:20,18,2,2
The problem is that not all observations have each field. This example does, but you could easily have a situation where one line is missing SB and the corresponding field in the PATIENT column.
In order to convert it to a tibble that has all the data, I run the code
vcf1@gt %>%
tidyr::separate_longer_delim(cols = everything(), delim = ":") %>%
dplyr::mutate(FORMAT = stringr::str_c("gt_", FORMAT)) %>%
tidyr::pivot_wider(names_from = FORMAT, values_from = dplyr::last_col())
Is there a better (hopefully faster) way to do this? Can I use Vroom somehow?
Does the answer change if all fields are always present?
Thank you,
Uri David