In GenomicRanges, the reduce function is used to merge overlapping ranges. However, when using reduce on combined GRanges objects (like reduce(c(segments1, segments2))), the metadata (such as gene names or other annotations) is not retained by default. This is because reduce is designed to simplify the range information, and it's not always clear how to combine the metadata from overlapping ranges.
If you want to keep the metadata, you need to handle it separately. One approach is to identify which ranges are being merged and then decide how to combine their corresponding metadata. Here's a general strategy to do this:
Identify Overlapping Ranges: Before reducing, find out which ranges in segments1 and segments2 overlap.
Aggregate Metadata for Overlapping Ranges: Create a strategy to combine metadata for these overlapping ranges. This might involve concatenating gene names, taking the union of attributes, etc.
Reduce the Ranges: Use reduce to merge overlapping ranges.
Assign Aggregated Metadata to Reduced Ranges: After reducing, assign the aggregated metadata back to the reduced ranges.
Here's an example illustrating these steps:
# Load the rtracklayer package
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#> colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#> get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#> match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#> table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
# Sample data for a BED file
bed_data_1 <- "
chr1\t213941196\t213942363\tGene1\t0\t+
chr1\t213942363\t213943530\tGene2\t0\t-
chr2\t158364697\t158365864\tGene3\t0\t+
"
# Write the original sample data to a file
writeLines(bed_data_1, con = "overlap1.bed")
# Overlapping bed_data
bed_data_2 <- "
chr1\t213941500\t213942500\tGene1_overlap\t0\t+
chr1\t213943000\t213944000\tGene2_overlap\t0\t-
chr2\t158364800\t158365900\tGene3_overlap\t0\t+
"
# Write the overlapping sample data to a file
writeLines(bed_data_2, con = "overlap2.bed")
# Import the original BED file
segments1 <- import("overlap1.bed")
#> Error in import("overlap1.bed"): could not find function "import"
# Import the overlapping BED file
segments2 <- import("overlap2.bed")
#> Error in import("overlap2.bed"): could not find function "import"
# View the contents
segments1
#> Error in eval(expr, envir, enclos): object 'segments1' not found
segments2
#> Error in eval(expr, envir, enclos): object 'segments2' not found
union_segments <- reduce(c(segments1,segments2))
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'reduce': object 'segments1' not found
union_segments
#> Error in eval(expr, envir, enclos): object 'union_segments' not found
combined_segments <- c(segments1, segments2)
#> Error in eval(expr, envir, enclos): object 'segments1' not found
# Identify overlaps
overlaps <- findOverlaps(combined_segments)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'query' in selecting a method for function 'findOverlaps': object 'combined_segments' not found
# Create a data frame to store aggregated metadata (this will depend on your specific metadata)
metadata_agg <- data.frame(matrix(ncol = ncol(mcols(combined_segments)), nrow = length(combined_segments)))
#> Error in eval(expr, envir, enclos): object 'combined_segments' not found
colnames(metadata_agg) <- colnames(mcols(combined_segments))
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colnames': error in evaluating the argument 'x' in selecting a method for function 'mcols': object 'combined_segments' not found
# Loop through overlaps and aggregate metadata
for (i in seq_along(combined_segments)) {
overlapping_indices <- subjectHits(overlaps[queryHits(overlaps) == i])
# Aggregate metadata, e.g., by concatenating names (adjust according to your metadata)
if (length(overlapping_indices) > 0) {
metadata_agg[i,] <- apply(mcols(combined_segments)[overlapping_indices,], 2, paste, collapse = ";")
} else {
metadata_agg[i,] <- mcols(combined_segments)[i,]
}
}
#> Error in eval(expr, envir, enclos): object 'combined_segments' not found
# Now reduce the ranges
reduced_segments <- reduce(combined_segments)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'reduce': object 'combined_segments' not found
# Re-attach the aggregated metadata
# This is a simple approach and may need adjustments based on your data and how overlaps are handled
mcols(reduced_segments) <- metadata_agg[seq_along(reduced_segments),]
#> Error in eval(expr, envir, enclos): object 'metadata_agg' not found
# View
mcols(reduced_segments)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mcols': object 'reduced_segments' not found
Created on 2024-01-08 with reprex v2.0.2
The key point is that union()
discards the metadata on purpose because it does not always make sense to simply slap them together.