Tidy way to range join tables, on an interval of dates

AJF · April 30, 2018, 7:41pm

Hi all,

I am wondering if there is a "tidy" way to join two data frames, where the joining variable will not necessarily be an exact match (I will give an example below, but look at this and this to see similar questions, and non-tidy ways of working with it.

suppressPackageStartupMessages(library(tidyverse))

df1 <- tibble::tribble(
  ~id, ~category,       ~date,
   1L,       "a",  "7/1/2000",
   2L,       "b", "11/1/2000",
   3L,       "c",  "7/1/2002"
  ) %>% mutate(date = as.Date(date, format = "%m/%d/%Y"))

df2 <- tibble::tribble(
         ~category, ~other_info,     ~start,         ~end,
               "a",         "x", "1/1/2000", "12/31/2000",
               "b",         "y", "1/1/2001", "12/31/2001",
               "c",         "z", "1/1/2002", "12/31/2002"
         ) %>% mutate_at(vars(start, end), as.Date, format = "%m/%d/%Y")

df1
#> # A tibble: 3 x 3
#>      id category date      
#>   <int> <chr>    <date>    
#> 1     1 a        2000-07-01
#> 2     2 b        2000-11-01
#> 3     3 c        2002-07-01
df2
#> # A tibble: 3 x 4
#>   category other_info start      end       
#>   <chr>    <chr>      <date>     <date>    
#> 1 a        x          2000-01-01 2000-12-31
#> 2 b        y          2001-01-01 2001-12-31
#> 3 c        z          2002-01-01 2002-12-31

Since the rows for id 1 and 3 have a category that matches AND a date in between its corresponding start and end, I want the join to work. Since the row for id 2 does not have a date within the range of the corresponding category, I want it to not match. My final table should be a "LEFT JOIN" and output the following (using the non-tidy package sqldf)

sqldf::sqldf("SELECT a.id, a.category, b.other_info, a.date, b.start, b.end
      FROM df1 a
      LEFT JOIN df2 b on a.category = b.category AND
      a.date >= b.start AND a.date <= b.end") %>%
  as_tibble()
#> # A tibble: 3 x 6
#>      id category other_info date       start      end       
#>   <int> <chr>    <chr>      <date>     <date>     <date>    
#> 1     1 a        x          2000-07-01 2000-01-01 2000-12-31
#> 2     2 b        <NA>       2000-11-01 NA         NA        
#> 3     3 c        z          2002-07-01 2002-01-01 2002-12-31

There also seems to be a way to do this through data.table.

Is there a way with dplyr or one of its tidy friends? The following does not work.

left_join(df1, df2,  by = c("category", "date" >= "start", "date" <= "end"))
#> Error: `by` can't contain join column `FALSE`, `TRUE` which is missing from LHS

Thanks!

mfherman · April 30, 2018, 8:30pm

You can do it with the fuzzyjoin package, which implements various not quite exact matching joins in dplyr syntax.

library(tidyverse)
library(fuzzyjoin)

df1 <- tibble::tribble(
  ~id, ~category,       ~date,
  1L,       "a",  "7/1/2000",
  2L,       "b", "11/1/2000",
  3L,       "c",  "7/1/2002"
  ) %>%
  mutate(date = as.Date(date, format = "%m/%d/%Y"))

df2 <- tibble::tribble(
  ~category, ~other_info,     ~start,         ~end,
  "a",         "x", "1/1/2000", "12/31/2000",
  "b",         "y", "1/1/2001", "12/31/2001",
  "c",         "z", "1/1/2002", "12/31/2002"
  ) %>%
  mutate_at(vars(start, end), as.Date, format = "%m/%d/%Y")

fuzzy_left_join(
  df1, df2,
  by = c(
    "category" = "category",
    "date" = "start",
    "date" = "end"
    ),
  match_fun = list(`==`, `>=`, `<=`)
  ) %>%
  select(id, category = category.x, other_info, date, start, end)
#> # A tibble: 3 x 6
#>      id category other_info date       start      end       
#>   <int> <chr>    <chr>      <date>     <date>     <date>    
#> 1     1 a        x          2000-07-01 2000-01-01 2000-12-31
#> 2     2 b        <NA>       2000-11-01 NA         NA        
#> 3     3 c        z          2002-07-01 2002-01-01 2002-12-31

Created on 2018-04-30 by the reprex package (v0.2.0).

The extra argument, in the fuzzy_left_join() function, match_fun, allows you to define the matching criterion for each pair of columns as a function. In this case, we want category == category, date >= start, and date <= end.