Extract data from large dataframes

Catharantus · April 9, 2022, 8:00am

Hi, I am a recovering SAS addict. Please forgive if my questions seem very basic, but I am trying to grasp the fundementals of R.
One of my usual tasks is to extract data on individuals of interest from large datasets. Let us assume that I have two dataframes, one includes my population:

df1<-data.frame(id=c(1,2,3,4,5))

and the other is a dataset which includes the data I am interested in:

id<-c(2,2,2,2,3,4,4,5,5,5,7,7,8,9,9)
v1<-c("w","h","y","d","o","i","h","a","v","e","t","o","l","e","a")
v2<-c("r","n","r","w","h","y","d","o","e","s","s","a","s","h","a")
v3<-c("v","e","t","o","b","e","s","o","c","o","n","s","u","m","e")
v4<-c("r","u","n","f","r","i","e","n","d","l","y","a","n","d","u")
v5<-c("s","e","a","r","o","t","t","e","n","b","u","s","i","n","e")
v6<-c("s","s","m","o","d","e","l","w","h","y","w","h","y","w","h")
df2<-data.frame(id,v1,v2,v3,v4,v5,v6)

Note that df2 includes some of the individuals from df1, it also includes other individuals, and each id in df1 may appear between zero and several times in df2.

What I want is to create a new variable (df1$newvar) which will assume the value 1 if the following conditions are met (and 0 if they are not):
There exists a record in df2 with

the same value on id
and 2) any of the variables v1-v6 has the value "a", "b", or "c".

In reality df2 would obviously hold millions of records from thousands of individuals, but if R can do it in a tiny scale , it can also do it in a large scale.

xvalda · April 9, 2022, 10:27am

Hi @Catharantus , welcome to the community.
Here's a solution below, using the tidyverse.

library(tidyverse)

#enter your conditions in a string, each condition separate by the pipe
#below it means you're looking for "a", "b" or "c"
condition <- c("a|b|c")


df1 %>% 
  #join df1 and df2
  left_join(df2) %>% 
  #"merge" all columns into a unique column, if you have more columns in your real data, make sure you adapt v1:v6 
  unite("pasted_cols", v1:v6, sep="", na.rm = TRUE) %>% 
  group_by(id) %>% 
  #summarize to have one row per id, `pasted_cols` wil show all possible values of each id from df2
  summarise(pasted_cols = paste(pasted_cols, collapse = "")) %>% 
  ungroup() %>% 
  #detect if any of your condition (a, b or c) is present in the values of each id, put 1 if yes, 0 if no
  mutate(newvar = ifelse(str_detect(pasted_cols, condition), 1, 0)) %>% 
  select(-pasted_cols)

I hope it works on your real data, let me know if any question.

Catharantus · April 11, 2022, 6:57am

Thank you @xvalda ! It works perfectly! And thanks for the stepwise approach with explications! I feel welcome in the R community!

system · April 18, 2022, 6:57am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.