Extracting data from columar lists

I'm dealing with a lot of biological data in VCF format, which has text that is tab-separated with 4 or 5 columns that each observation has. However, there are two final columns of variable fields - the first column has which fields are present in the format
Field1;Field2;Field4
Filed1:Field2:Field3

and the second column has the value for each field

The acutal data looks like this if I transform it to a tibble

head(vcf1@gt) %>% as_tibble()
# A tibble: 6 × 2
  FORMAT                   PATIENT           
  <chr>                    <chr>                                     
1 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:237,4:0.02:241:105,0:132,4:121,116,2,2
2 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:158,4:0.039:162:78,0:80,4:77,81,1,3   
3 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:2,2:0.5:4:1,0:1,2:2,0,2,0             
4 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1,2:0.5:3:1,2:0,0:1,0,1,1             
5 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:38,4:0.12:42:23,4:15,0:20,18,2,2      
6 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:38,4:0.12:42:23,4:15,0:20,18,2,2   

The problem is that not all observations have each field. This example does, but you could easily have a situation where one line is missing SB and the corresponding field in the PATIENT column.

In order to convert it to a tibble that has all the data, I run the code

vcf1@gt %>%
    tidyr::separate_longer_delim(cols = everything(), delim = ":") %>%
    dplyr::mutate(FORMAT = stringr::str_c("gt_", FORMAT)) %>%
    tidyr::pivot_wider(names_from = FORMAT, values_from = dplyr::last_col())

Is there a better (hopefully faster) way to do this? Can I use Vroom somehow?
Does the answer change if all fields are always present?

Thank you,

Uri David

I assume you are using vcfR; it provides a vcfR2tidy function that might be right for you to use.

Hi.
I'm actually using vcfR. The vcfR::vcfR2tidy() function uses a lot of do.call and apply, and takes about 2.5 times longer than the tidy code I've written (while getting equivalent results).
Is there any tidyverse way to rephrase or speed up what I'm processing?

Uri David

In general I would not begin to attempt to optimise rewriting any code without the benefit of suitable data to test and benchmark with. Can you suggest or provide suitable data to test against ?

Let me see if I can find publicly available VCFs to look at. My actual data is not something I can share with you.
This project (Python based) seems to have various examples, but they might be deliberatly buggy

This should have a few examples of VCF files
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07//exon/snps

Thank you,

Uri David

PS
I only gave some of my code before, the bit I thought takes more time. The entire function looks like this.
Thank you for looking at this!!

potentially_faster_vcf_tidyr <- function(vcfR_datastructure) {
  base <- as_tibble(vcfR_datastructure@fix) %>%
    dplyr::mutate(
      POS = as.integer(POS),
      QUAL = as.numeric(QUAL),
      ChromKey = as.integer(factor(CHROM),
                            levels = unique(CHROM))
    )
  
  info_fields_df <-
    vcfR::vcf_field_names(vcfR_datastructure) %>% 
    dplyr::mutate(
      convert_type=case_when(
        Type == "Integer" & Number==1 ~ "integer", 
        (Type=="Float" | Type=="Numeric") & Number==1 ~ "numeric", 
        Type=="Flag" & Number==0 ~ "flag", 
        .default = "character"))
  info_flag_fields <- info_fields_df %>% 
    dplyr::filter(convert_type == "flag") %>%
    pull(ID)
  info_integer_fields <- info_fields_df %>% 
    dplyr::filter(convert_type == "integer") %>%
    pull(ID)
  info_numeric_fields <- info_fields_df %>% 
    dplyr::filter(convert_type == "numeric") %>%
    pull(ID)
  info_string_fields <- info_fields_df %>% 
    dplyr::filter(convert_type == "character") %>%
    pull(ID)
  
  fix_tbl <- base %>%
    tidyr::separate_longer_delim(cols = "INFO", delim = ";") %>%
    dplyr::mutate(INFO = if_else(
      stringr::str_detect(
        string = INFO,
        pattern = paste(info_flag_fields, collapse = "|")
      ) &
        !stringr::str_detect(string = INFO, "="),
      stringr::str_c(INFO, "=TRUE"),
      INFO
    )) %>%
    tidyr::separate_wider_delim(delim = "=",
                                cols="INFO",
                                names = c("INFO_field", "INFO_value")) %>%
    tidyr::pivot_wider(names_from=INFO_field, values_from=INFO_value) %>%
    dplyr::mutate(across(any_of(info_flag_fields), ~ !is.na(.x)),
                  across(any_of(info_integer_fields), ~ as.integer(.x)),
                  across(any_of(info_numeric_fields), ~ as.numeric(.x))
                  )
  
  gt_integer_fields <- vcfR::vcf_field_names(vcfR_datastructure, 
                                             tag = "FORMAT") %>%
    dplyr::mutate(convert_type = case_when(
      Type == "Integer" & Number == 1 ~ "integer"
    )) %>%
    dplyr::filter(convert_type == "integer") %>%
    pull(ID) 
  gt_integer_fields <- stringr::str_c("gt_", gt_integer_fields)
  
  
  gt_tbl <-
    bind_cols(
      fix_tbl %>% dplyr::select(ChromKey, POS),
      as_tibble(vcfR_datastructure@gt)
    ) %>%
    tidyr::separate_longer_delim(cols = everything(), delim = ":") %>%
    dplyr::mutate(FORMAT = stringr::str_c("gt_", FORMAT)) %>%
    # This probably will not work right when there are multiple individuals in a VCF file
    tidyr::pivot_wider(names_from = FORMAT, values_from = dplyr::last_col()) %>%
    dplyr::mutate(
    Indiv = rep(
      colnames(vcfR_datastructure@gt)[-1],
      each = nrow(vcfR_datastructure@fix)
    ),
    .after=POS) %>%
    dplyr::mutate(across(any_of(gt_integer_fields), ~ as.integer(.x))) %>%
    dplyr::bind_cols(
      vcfR::extract.gt(vcfR_datastructure, element = "GT", return.alleles = T) %>%
        as_tibble() %>%
        dplyr::rename(gt_GT_alleles = dplyr::last_col())
    )
  
  return(list(fix=fix_tbl, gt=gt_tbl))
}

I appreciate, that it takes effort to provide a reprex; a minimal reproducible example; but the alternative of providing a somewhat bloated start point is it would require someone else like me to go to additional effort to go through and remove irrelevant components to zoom in on the core issue.

I would think that you should prioritise efforts to make a an approach that generally works (like vcrftidy does) , and not which fails under common conditions , only working when data is 'just so'.

I will give you one tip though; it can often be an improvement to improve a tidyverse approach by using data.table and dtplyr;

I see your point.
What is the dtplyr/data.table equivalent of tidyr::pivot_longer and tidyr::pivot_wider? Do you happen to know by any chance?

Checking - those seem to be implemented by dtplyr, so great. Do you know the equivalent of tidyr::separate_longer_delim in dtplyr or data.table? It looks like dtplyr only implements the older version tidyr::separate.

Thank you

dtplyr I believe does implement the basic versions of pivot_wider, pivot_longer, and seperate

Thank you - one final question
Is there a way to incoporate data.table commands directly/explicitly into a dtplyr pipeline?
For example, it looks like there is no dtplyr translation of tidyr::separate_delim_longer, but assuming I have a data.table, I can follow the commands on stackoverflow and run

data.table::as.data.table(base)[,strsplit(as.character(INFO), ";", fixed=TRUE), by = .(CHROM, POS, QUAL, INFO)][,.(INFO = V1, CHROM, POS, QUAL)]

Is there a way I can copy this command and incoporate it into the translated pipeline without running it? I'm hoping that when incorporated, the entire pipelin will speed up.
I've tried searching for documentation for this question, but haven't found it yet.

Thank you

make your own data.table pipeline, using dtplyr to translate one step at a time for you; or using a chatgpt/bingchat variant to help you translate between modalities.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.