Extract data from large dataframes

Hi, I am a recovering SAS addict. Please forgive if my questions seem very basic, but I am trying to grasp the fundementals of R.
One of my usual tasks is to extract data on individuals of interest from large datasets. Let us assume that I have two dataframes, one includes my population:


and the other is a dataset which includes the data I am interested in:


Note that df2 includes some of the individuals from df1, it also includes other individuals, and each id in df1 may appear between zero and several times in df2.

What I want is to create a new variable (df1$newvar) which will assume the value 1 if the following conditions are met (and 0 if they are not):
There exists a record in df2 with

  1. the same value on id
    and 2) any of the variables v1-v6 has the value "a", "b", or "c".

In reality df2 would obviously hold millions of records from thousands of individuals, but if R can do it in a tiny scale , it can also do it in a large scale.

Hi @Catharantus , welcome to the community.
Here's a solution below, using the tidyverse.


#enter your conditions in a string, each condition separate by the pipe
#below it means you're looking for "a", "b" or "c"
condition <- c("a|b|c")

df1 %>% 
  #join df1 and df2
  left_join(df2) %>% 
  #"merge" all columns into a unique column, if you have more columns in your real data, make sure you adapt v1:v6 
  unite("pasted_cols", v1:v6, sep="", na.rm = TRUE) %>% 
  group_by(id) %>% 
  #summarize to have one row per id, `pasted_cols` wil show all possible values of each id from df2
  summarise(pasted_cols = paste(pasted_cols, collapse = "")) %>% 
  ungroup() %>% 
  #detect if any of your condition (a, b or c) is present in the values of each id, put 1 if yes, 0 if no
  mutate(newvar = ifelse(str_detect(pasted_cols, condition), 1, 0)) %>% 

I hope it works on your real data, let me know if any question.

Thank you @xvalda ! It works perfectly! And thanks for the stepwise approach with explications! I feel welcome in the R community!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.