Merging two data sets based on the id and another variable

Khundiman · October 3, 2018, 3:29pm

Below is my attempt which works. But this can't seem to scale to my project data sets that have records of 4500 and 3000 respectively.

I have two questions:

How do I make the code efficient
How do I return a dataset instead of a list.

library(dplyr)

df1 <- data.frame(id=1:8, ch= c("a","b","c","f","r","i","k","o"), p=9:16)

df2 <- data.frame(id=c(1,2,9,3,8,6,9,10), ch= c("a","b","c","c","r","i","k","o"), p=17:24)


data <- vector("list")

for(i in 1:nrow(df1)){
  
  for(k in 1:nrow(df2)){
    #Merge eligibility form with exit for but only when the date of interview match
    if((df1$id[i]== df2$id[k] & df1$ch[i]== df2$ch[k]) &
       !is.na(df1$id[i]== df2$id[k] & df1$ch[i]== df2$ch[k])){
      data[[i]] <-inner_join(df1[i,],df2[k,], by = "id") }
  }
}

data
#> [[1]]
#>   id ch.x p.x ch.y p.y
#> 1  1    a   9    a  17
#> 
#> [[2]]
#>   id ch.x p.x ch.y p.y
#> 1  2    b  10    b  18
#> 
#> [[3]]
#>   id ch.x p.x ch.y p.y
#> 1  3    c  11    c  20
#> 
#> [[4]]
#> NULL
#> 
#> [[5]]
#> NULL
#> 
#> [[6]]
#>   id ch.x p.x ch.y p.y
#> 1  6    i  14    i  22

^{Created on 2018-10-03 by the reprex package (v0.2.1)}

joels · October 3, 2018, 3:34pm

You can join on id and ch directly:

inner_join(df1, df2, by=c("id","ch"))

  id ch p.x p.y
1  1  a   9  17
2  2  b  10  18
3  3  c  11  20
4  6  i  14  22

You can also explicitly mark which data frame p came from:

inner_join(df1, df2, by=c("id","ch"), suffix=c("_df1", "_df2"))

  id ch p_df1 p_df2
1  1  a     9    17
2  2  b    10    18
3  3  c    11    20
4  6  i    14    22

Khundiman · October 3, 2018, 3:43pm

Thanks so much @joels