Unifying (genomic) segments/intervals

uridavid.akavia · January 8, 2024, 9:59am

I have a genomic segments file that is basically a modified BED file. I want to unify two files that have segment borders in different places.
I can read the files as tibbles, and I'll demonstrate what I want. It is possible this can be done via bedtools, but I've read the documentation and I can't find it.

A simple reprex

A_tbl = tibble::tibble(chromosome=c(1, 1), chr_start=c(50, 151), chr_end=c(150, 250), cn=c(1, 2))
B_tbl = tibble::tibble(chromosome=c(1, 1), chr_start=c(50, 101), chr_end=c(100, 250), cn=c(1, 2))

I want to end up with

A_B_tbl = tibble::tibble(chromsome=c(1, 1, 1), chr_start=c(50, 101, 151), chr_end=c(100, 150, 250), A_cn=c(1, 1, 2), B_cn=c(1, 2, 2))

I can't figure out how to do this. I've tried working with GenomicRanges, using overlapping join, but none of them get it right. It feels like it is something obvious I'm missing - is there maybe an option to do it via data.table or SQL, but not dplyr?
Is there a way to do this without looping over the intervals?

Thank you,

Uri David

technocrat · January 8, 2024, 10:28am

My Bayesian prior on selecting tools in R is that if an existing, popular tool doesn't do what I'm looking for it is more likely that I can get to where I need to go by figuring out why than trying to do it on my own. Try this guide:

Merging two genomic segment files with different segment borders involves several steps: reading the files, harmonizing the data formats, and then deciding how to reconcile the differences in segment borders. This process can be complex, depending on the specifics of how the segment borders differ. Here's a step-by-step guide on how to approach this in R, especially using the GenomicRanges package from Bioconductor, which is well-suited for handling genomic ranges and overlaps.

Step 1: Install and Load Necessary Packages

You'll need the rtracklayer package for importing BED files and GenomicRanges for handling genomic range operations. If you haven't installed these packages, you can do so using Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("rtracklayer")
BiocManager::install("GenomicRanges")

Then, load the packages:

library(rtracklayer)
library(GenomicRanges)

Step 2: Import the BED Files

Use rtracklayer::import() to read the BED files into R. This function will create a GRanges object directly.

segments1 <- rtracklayer::import("path/to/your/firstfile.bed")
segments2 <- rtracklayer::import("path/to/your/secondfile.bed")

Step 3: Reconcile Segment Borders

Decide how to unify the segments. If you want to combine overlapping segments, consider these options:

Union of segments: Create segments that represent the combined range of any overlapping segments.
Intersection of segments: Create segments only where the two files overlap.

For union or intersection, use functions from the GenomicRanges package:

# Union of segments
union_segments <- reduce(c(segments1, segments2))

# Intersection of segments
intersect_segments <- intersect(segments1, segments2)

Step 4: Handle Non-Overlapping Segments

If there are segments in one file that don't overlap with the other, decide how to handle them. You might want to include them as-is or ignore them.

Step 5: Create the Unified File

Once you've processed the segments, you can export the unified data back to a BED file:

rtracklayer::export(union_segments, "unified_segments.bed")

Replace union_segments with intersect_segments or whatever object represents your final unified segments.

Notes:

The specific approach may vary depending on your data and needs.
This example assumes that the segment files are in a format compatible with rtracklayer::import(). If your files are heavily modified from the standard BED format, you might need to read them using different methods and manually convert them into GRanges objects.
Handling genomic data can get complex, especially when dealing with edge cases in segment overlaps. Ensure to verify the results.

uridavid.akavia · January 8, 2024, 12:16pm

Thank you for your reply. My problem with GenomicRanges operations is that they drop the additional data columns for some reason. I haven't been able to prevent them. I did figure out Step 3, how to Reconcile Segment Borders, and then I can get the value later.

However, does purrr::pmap work well on grouped data?

Step 3: Reconcile Segment Borders

chr1_segments <-
  dplyr::bind_rows(
    seg1_tbl  |>
      dplyr::filter(chrom == 1) |>
      dplyr::select(ID, chrom, loc.start, loc.end),
    seg2_tbl |>
      dplyr::filter(chrom == 1) |>
      dplyr::select(ID, chrom, loc.start, loc.end)
  ) |> dplyr::distinct()

merged_segments_chr1 <- tidyr::expand(chr1_segments, loc.start, loc.end) |>
  dplyr::mutate(loc.width = loc.end - loc.start) |>
  dplyr::filter(loc.width > 0) |>
  dplyr::slice_min(order_by = loc.width, by = loc.start, n = 1) |>
  dplyr::mutate(chrom = 1, .before = dplyr::everything())

Is there a way I can run this on each chromsome, so grouped purrr::pmap?
If not, I'm thinking I should do group_split on each file, and then run pmap. Does that make sense? Can you suggest a better way to do this, and/or a way to re-add the chromosome number, which expand removes?

Once I have the main file, I can get the data columns using

dplyr::left_join(
  merged_segments_chr1,
  seg1_tbl,
  by = dplyr::join_by(chrom, overlaps(loc.start, loc.end, loc.start, loc.end))
)

Thank you!

technocrat · January 9, 2024, 7:37am

In GenomicRanges, the reduce function is used to merge overlapping ranges. However, when using reduce on combined GRanges objects (like reduce(c(segments1, segments2))), the metadata (such as gene names or other annotations) is not retained by default. This is because reduce is designed to simplify the range information, and it's not always clear how to combine the metadata from overlapping ranges.

If you want to keep the metadata, you need to handle it separately. One approach is to identify which ranges are being merged and then decide how to combine their corresponding metadata. Here's a general strategy to do this:

Identify Overlapping Ranges: Before reducing, find out which ranges in segments1 and segments2 overlap.

Aggregate Metadata for Overlapping Ranges: Create a strategy to combine metadata for these overlapping ranges. This might involve concatenating gene names, taking the union of attributes, etc.

Reduce the Ranges: Use reduce to merge overlapping ranges.

Assign Aggregated Metadata to Reduced Ranges: After reducing, assign the aggregated metadata back to the reduced ranges.

Here's an example illustrating these steps:

# Load the rtracklayer package
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
# Sample data for a BED file

bed_data_1 <- "
chr1\t213941196\t213942363\tGene1\t0\t+
chr1\t213942363\t213943530\tGene2\t0\t-
chr2\t158364697\t158365864\tGene3\t0\t+
"

# Write the original sample data to a file
writeLines(bed_data_1, con = "overlap1.bed")

# Overlapping bed_data
bed_data_2 <- "
chr1\t213941500\t213942500\tGene1_overlap\t0\t+
chr1\t213943000\t213944000\tGene2_overlap\t0\t-
chr2\t158364800\t158365900\tGene3_overlap\t0\t+
"

# Write the overlapping sample data to a file
writeLines(bed_data_2, con = "overlap2.bed")

# Import the original BED file
segments1 <- import("overlap1.bed")
#> Error in import("overlap1.bed"): could not find function "import"

# Import the overlapping BED file
segments2 <- import("overlap2.bed")
#> Error in import("overlap2.bed"): could not find function "import"

# View the contents
segments1
#> Error in eval(expr, envir, enclos): object 'segments1' not found
segments2
#> Error in eval(expr, envir, enclos): object 'segments2' not found

union_segments <- reduce(c(segments1,segments2))
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'reduce': object 'segments1' not found

union_segments
#> Error in eval(expr, envir, enclos): object 'union_segments' not found


combined_segments <- c(segments1, segments2)
#> Error in eval(expr, envir, enclos): object 'segments1' not found

# Identify overlaps
overlaps <- findOverlaps(combined_segments)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'query' in selecting a method for function 'findOverlaps': object 'combined_segments' not found

# Create a data frame to store aggregated metadata (this will depend on your specific metadata)
metadata_agg <- data.frame(matrix(ncol = ncol(mcols(combined_segments)), nrow = length(combined_segments)))
#> Error in eval(expr, envir, enclos): object 'combined_segments' not found
colnames(metadata_agg) <- colnames(mcols(combined_segments))
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colnames': error in evaluating the argument 'x' in selecting a method for function 'mcols': object 'combined_segments' not found

# Loop through overlaps and aggregate metadata
for (i in seq_along(combined_segments)) {
  overlapping_indices <- subjectHits(overlaps[queryHits(overlaps) == i])
  
  # Aggregate metadata, e.g., by concatenating names (adjust according to your metadata)
  if (length(overlapping_indices) > 0) {
    metadata_agg[i,] <- apply(mcols(combined_segments)[overlapping_indices,], 2, paste, collapse = ";")
  } else {
    metadata_agg[i,] <- mcols(combined_segments)[i,]
  }
}
#> Error in eval(expr, envir, enclos): object 'combined_segments' not found

# Now reduce the ranges
reduced_segments <- reduce(combined_segments)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'reduce': object 'combined_segments' not found

# Re-attach the aggregated metadata
# This is a simple approach and may need adjustments based on your data and how overlaps are handled
mcols(reduced_segments) <- metadata_agg[seq_along(reduced_segments),]
#> Error in eval(expr, envir, enclos): object 'metadata_agg' not found

# View
mcols(reduced_segments)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mcols': object 'reduced_segments' not found

^{Created on 2024-01-08 with reprex v2.0.2}

The key point is that union() discards the metadata on purpose because it does not always make sense to simply slap them together.

uridavid.akavia · January 16, 2024, 10:04am

The way I ended up doing this, which may be useful for other people is

#' Merge two segmentation dataframes/tibbles
#'
#' @param segment_tbl1 tibble of segmentation, containing the columns: ID, chrom,
#'                      loc.start, loc.end, ID, seg.mean, and additional columns
#' @param segment_tbl2 tibble of segmentation, in the same format as segment_tbl2
#' @param suffix Should be a character vector of length 2. These suffixes will
#'              be added to the metadata column names to disambiguate them.
#'
#' @return A tibble containing the segments, where for each chromosome, the
#'        new segment start points are Union(start1, start2) and the new
#'        segment end points ar Union(end1, end2)
#'        The metadata is processed to match the new segmentation.
#' @export
#'
#' @examples
merge_segments <-
  function(segment_tbl1,
           segment_tbl2,
           suffix = c('.x', '.y')) {
    # Some segments have bp location with 0.5, and then that location appears
    # in both start and end locations (of different segments). Removing, since
    # we don't split bps into parts.
    segment_tbl1 <-
      segment_tbl1 |> dplyr::mutate(loc.start = ceiling(loc.start),
                                    loc.end = floor(loc.end))
    segment_tbl2 <-
      segment_tbl2 |> dplyr::mutate(loc.start = ceiling(loc.start),
                                    loc.end = floor(loc.end))
    segment_borders_tbl <- dplyr::bind_rows(
      segment_tbl1 |>
        dplyr::select(dplyr::any_of(
          c("ID", "chrom", "loc.start", "loc.end")
        )),
      segment_tbl2 |>
        dplyr::select(dplyr::any_of(
          c("ID", "chrom", "loc.start", "loc.end")
        ))
    ) |> dplyr::distinct()
    
    # Make sure that the segments are positive values, and take the minimum
    # width (so smallest possible segment) for each starting point.
    merged_segments <-
      tidyr::expand_grid(
        seg.start = unique(segment_borders_tbl$loc.start),
        seg.end = unique(segment_borders_tbl$loc.end)
      ) |>
      dplyr::mutate(seg.width = seg.end - seg.start) |>
      dplyr::filter(seg.width > 0) |>
      dplyr::slice_min(order_by = seg.width, by = seg.start, n = 1)
    
    segment_merged1 <- dplyr::left_join(merged_segments,
                                        segment_tbl1,
                                        by = dplyr::join_by(overlaps(seg.start,
                                                                     seg.end,
                                                                     loc.start,
                                                                     loc.end))) |>
      dplyr::select(-loc.start, -loc.end, -num.mark) |>
      dplyr::select(-dplyr::any_of("seg.num")) |>
      dplyr::rename_with( ~ paste0(.x, suffix[1], recycle0 = T),
                          seg.mean:dplyr::last_col())
    
    segment_merged2 <- dplyr::left_join(merged_segments,
                                        segment_tbl2,
                                        by = dplyr::join_by(overlaps(seg.start,
                                                                     seg.end,
                                                                     loc.start,
                                                                     loc.end))) |>
      dplyr::select(-loc.start, -loc.end, -num.mark) |>
      dplyr::select(-dplyr::any_of("seg.num")) |>
      dplyr::rename_with( ~ paste0(.x, suffix[2], recycle0 = T),
                          seg.mean:dplyr::last_col())
    
    return(
      dplyr::bind_cols(
        # A bit of a kludge to avoid duplicated segments
        segment_merged1 |>
          dplyr::distinct(seg.start, seg.end, .keep_all = T),
        segment_merged2 |>
          dplyr::distinct(seg.start, seg.end, .keep_all = T) |>
          dplyr::select(
            -dplyr::any_of(c("ID", "chrom", "seg.start", "seg.end", "seg.width"))
          )
      ) |>
        dplyr::relocate(dplyr::any_of(c("ID", "chrom")), .before = seg.start)
    )
    
  }```

system · February 6, 2024, 10:04am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.