Find Overlapping Intervals When Data Frames are Different Lengths

Astoriana · February 21, 2024, 8:57pm

Hello all,

I am looking to compare pollution episodes to find events that overlap. I have 3 lists of pollution episodes for different species, all of different lengths.

I want to be able to pick out the date times where the overlap occurs.

Here is the code that I have so far (I've no idea how to even attempt this and previous attempts revealed a long list of true/false answers that I didn't know how to parse.

Files can be found here.

library(tidyverse)
library(purrr)
library(dplyr)
library(lubridate)
library(intervals)
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

CO <- read.csv("file1.csv")
CO$CO.Time.Start<- ymd_hm(CO$CO.Time.Start)
CO$CO.Time.End <- ymd_hm(CO$CO.Time.End)
CO <- CO %>% mutate(interval = interval(start = CO.Time.Start, end = CO.Time.End))

NO <- read.csv("file2.csv")
NO$NO.Start.Time <- ymd_hm(NO$NO.Start.Time)
NO$NO.Stop.Time <- ymd_hm(NO$NO.Stop.Time)
NO <- NO %>% mutate(interval = interval(start = NO.Start.Time, end = NO.Stop.Time))

PM <- read.csv("C:\\file3.csv")
PM$PM.Start.Time <- ymd_hm(PM$PM.Start.Time)
PM$PM.Stop.Time <- ymd_hm(PM$PM.Stop.Time)
PM <- PM %>% mutate(interval = interval(start = PM.Start.Time, end = PM.Stop.Time))

CO / NO Episodes


NO$Event[which(CO$interval %within% NO$interval) ]

CO / PM_2.5 Episodes

PM$Event[which(CO$interval %within% PM$interval)]

NO / PM_2.5 Episodes

PM$Event[which(NO$interval %within% PM$interval)]

CO / NO / PM_2.5 Episodes

PM$Event[which(NO$interval %within% PM$interval & CO$interval %within% PM$interval)]

There should be overlap with PM and CO on July 5th, for example, and that doesn't get reported.

Number of episodes for each species, if it matters:
CO = 67
NO = 76
PM = 86

jrkrideau · February 21, 2024, 10:42pm

I think you are misreading what the {intervals} package is doing. See Overview of the intervals package.

I think what you may need is the "interval" and "int_overlaps" commands in {lubridate} package though, at the moment I an not sure what the unequal N's are doing.

Here is what I am getting so far. Note I have drastically shortened your file names and renamed your variables to cut down on typing and to remove the spaces in your variable names. R really does not like something like CO Time Start as a variable name.
Does this look like any you want?

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))

CO <- fread("CO.csv")
NO <- fread("NO.csv")
PM <- fread("PM.csv")

# Rename
names(CO) <- names(NO) <-  names(PM) <- c("event", "begn", "fin")

# Convert character data to POSIXct format
CO[, c("begn", "fin") := lapply(.SD, ymd_hm),
      .SDcols = c("begn", "fin")]

NO[, c("begn", "fin") := lapply(.SD, ymd_hm),
      .SDcols = c("begn", "fin")]

PM[, c("begn", "fin") := lapply(.SD, ymd_hm),
      .SDcols = c("begn", "fin")]

 # Get intervals
COInt <-  CO[, interval(begn, fin)]
NOInt <-  NO[, interval(begn, fin)]
PMInt <-  PM[, interval(begn, fin)]

# Look for overlaps
int_overlaps(COInt, NOInt)
int_overlaps(COInt, PMInt)

Astoriana · February 22, 2024, 3:10pm

Sort of - I had tried this method beforehand and was gifted a very long list of TRUE/FALSE.

To save time/manually counting, I want to be able to pick out the event numbers only.

My other gripe with int_overlaps is that it can only check between two intervals - I'm looking for the intersection of all 3 as well. Saving the result to a variable isn't helpful in this instance because it's a logical vector and doesn't check again.

jrkrideau · February 22, 2024, 3:26pm

Yes, it is returning a logical vector so it's doing what it is intended to do. I think I have a bit better idea of what you want though I am not sure of how to go about it. I'll poke around a bit and see what I can do.

With luck someone with a better grasp of {lubridate} data formats may be able to come up with some ideas.

FJCC · February 22, 2024, 3:58pm

The join_by() function of dplyr with its within() and overlaps() helpers may be what you need.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

NO <- read.csv("~/R/Play/NO.csv")
CO <- read.csv("~/R/Play/CO.csv")
PM <- read.csv("~/R/Play/PM.csv")

NO <- NO |> mutate(across(2:3, ymd_hm))
CO <- CO |> mutate(across(2:3, ymd_hm))
PM <- PM |> mutate(across(2:3, ymd_hm))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(2:3, ymd_hm)`.
#> Caused by warning:
#> !  2 failed to parse.

#which NO envents started and ended during a CO event?
NO_CO <- inner_join(NO, CO, join_by(within(NO_Start, NO_End, CO_Start, CO_End)))
head(NO_CO)
#>   Event.x            NO_Start              NO_End Event.y            CO_Start
#> 1       9 2023-07-22 07:00:00 2023-07-22 13:00:00      12 2023-07-21 06:00:00
#> 2      12 2023-08-06 05:00:00 2023-08-06 13:00:00      19 2023-08-05 09:00:00
#> 3      15 2023-08-12 21:00:00 2023-08-13 02:00:00      23 2023-08-12 12:00:00
#> 4      20 2023-08-28 03:00:00 2023-08-28 13:00:00      33 2023-08-27 17:00:00
#> 5      23 2023-09-07 09:00:00 2023-09-07 14:00:00      37 2023-09-06 15:00:00
#> 6      24 2023-09-07 20:00:00 2023-09-08 02:00:00      37 2023-09-06 15:00:00
#>                CO_End
#> 1 2023-07-22 23:00:00
#> 2 2023-08-07 02:00:00
#> 3 2023-08-13 10:00:00
#> 4 2023-08-28 13:00:00
#> 5 2023-09-08 11:00:00
#> 6 2023-09-08 11:00:00

#which NO events overlap CO events
NO_CO_2 <- inner_join(NO, CO, join_by(overlaps(NO_Start, NO_End, CO_Start, CO_End)))
head(NO_CO_2)
#>   Event.x            NO_Start              NO_End Event.y            CO_Start
#> 1       5 2023-07-09 02:00:00 2023-07-09 13:00:00       8 2023-07-08 07:00:00
#> 2       9 2023-07-22 07:00:00 2023-07-22 13:00:00      12 2023-07-21 06:00:00
#> 3      10 2023-07-24 06:00:00 2023-07-24 13:00:00      13 2023-07-23 03:00:00
#> 4      12 2023-08-06 05:00:00 2023-08-06 13:00:00      19 2023-08-05 09:00:00
#> 5      13 2023-08-07 02:00:00 2023-08-07 11:00:00      19 2023-08-05 09:00:00
#> 6      15 2023-08-12 21:00:00 2023-08-13 02:00:00      23 2023-08-12 12:00:00
#>                CO_End
#> 1 2023-07-09 10:00:00
#> 2 2023-07-22 23:00:00
#> 3 2023-07-24 06:00:00
#> 4 2023-08-07 02:00:00
#> 5 2023-08-07 02:00:00
#> 6 2023-08-13 10:00:00

^{Created on 2024-02-22 with reprex v2.0.2}

Astoriana · February 22, 2024, 4:08pm

Thank you! This worked. Been breaking my head with this all week.

jrkrideau · February 22, 2024, 4:15pm

Lovely, I was completely forgetting a join in this situation.

system · February 29, 2024, 4:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Find Overlapping Intervals When Data Frames are Different Lengths

CO / NO Episodes

CO / PM2.5 Episodes

NO / PM2.5 Episodes

CO / NO / PM2.5 Episodes

CO / PM_2.5 Episodes

NO / PM_2.5 Episodes

CO / NO / PM_2.5 Episodes