How to subset a dataframe based on mulitple key identifiers in another dataframe.

I need to subset a dataframe which has common combination of values of a certain select number of columns in another dataframe.
So lets say:
df1<- data.frame(
key1= c('a', 'b', 'c', 'd'),
key2=c(1,2,3,4)
var3=........
var4=.......
)

df2<-data.frame(
key1= c('a', 'b', 'e', 'f'),
key2=c(1,2,4,5)
var3=........
var4=.......
)
So i need only those values observations of df2 which have common combinations of key1 and key2 in df1, so like common_df<-data.frame(
key1=c('a','b'),
key2=c(1,2),
var3=....,
var4=......
)
How do i do that.

merge/join your sets based on the common key.
with an inner join which preserves only rows that match in both sets your issue is solved in one step, plus some cleanup to say where they have variables in common like var3 , you only want to keep what came from one side

library(tidyverse)

df1 <- data.frame(
  key1 = c("a", "b", "c", "d"),
  key2 = c(1, 2, 3, 4),
  var3 = c(1, 2, 3, 4)
)

df2 <- data.frame(
  key1 = c("a", "b", "e", "f"),
  key2 = c(1, 2, 4, 5),
  var3 = c(1, 2, 5, 6)
)
# So i need only those values observations of df2 which have common combinations of key1 and key2 in df1, so like
common_df <- data.frame(
  key1 = c("a", "b"),
  key2 = c(1, 2),
  var3 = c(1, 2)
)


result <- inner_join(df1,
  df2,
  by = join_by(key1, key2),
  suffix = c("", "_y")
) |>
  select(-ends_with("_y"))

identical(common_df,result)
# TRUE
1 Like

For this I often concatenate multiple key values to a single key value with paste0 in both tables, and use that to find the matching rows. The join approach also is good, though this would be a little less code, and does not copy any data in memory. This just subsets the rows of df2 in which that row's key value also exists in the df1 key values.

Base R:
common_df <- df2[which(paste0(df2$key1,df2$key2) %in% paste0(df1$key1,df1$key2)), ]

Cheers

1 Like

It sounds like you want what's called a semijoin: The rows of df2 that satisfy the match you want with df1:

library(dplyr)

df1<-
  data.frame(
    key1 = c('a', 'b', 'c', 'd'),
    key2 = c(1, 2, 3,  4),
    var3 = 1,
    var4 = 2
  )

df2<-
  data.frame(
    key1 = c('a', 'b', 'e', 'f'),
    key2 = c(1, 2, 4, 5),
    var3 = 3,
    var4 = 4
  )

df2 |> 
  semi_join(df1, by = join_by(key1, key2))
#>   key1 key2 var3 var4
#> 1    a    1    3    4
#> 2    b    2    3    4

Created on 2024-07-14 with reprex v2.0.2

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.