match rows from two data sets and add new variable to correct rows

SveinungA · May 28, 2021, 1:10pm

Hi all,

I am reaching out for tips on data management in tidyverse:

I have three data sets. Datasets 2a and 2b comprise each a randomized half of dataset 1. Dataset 1 contains an extra variable which I want to add to the correct rows in datasets 2a and 2b.
The order of the rows are different between the datasets, and must remain so. Does anyone have a tip about how that can be done? It is a data set with 4000 observations.

library(tidyverse)
library(knitr)

dataset1 <- tibble(name = c('Jane', 'Joe', 'Janet', 'George'), surname = c('Doe', 'Doe', 'Doh', 'Costanza'), phone = c(55512, 55513, 55514, NA))
dataset2a <- tibble(name = c('George', 'Janet'), surname = c('Costanza', 'Doh'))
dataset2b <- tibble(name = c('Joe', 'Jane'), surname = c('Doe', 'Doe'))

I am thinking there must be a way to identify identical rows across the datasets, but am not sure how to go about executing the operation. (I do have more variables than names and surnames, so all rows are uniquely identifiable within each dataset).

I hope that was clear enough, and appreciate any advice!

nirgrahamuk · May 28, 2021, 1:19pm

names would have to be unique for this sort of thing I think

library(tidyverse)

(dataset1 <- tibble(name = c('Jane', 'Joe', 'Janet', 'George'), surname = c('Doe', 'Doe', 'Doh', 'Costanza'), phone = c(55512, 55513, 55514, NA)))
(dataset2a <- tibble(name = c('George', 'Janet'), surname = c('Costanza', 'Doh')))
(dataset2b <- tibble(name = c('Joe', 'Jane'), surname = c('Doe', 'Doe')))

(dataset2ax <- left_join(dataset2a,
                       dataset1))
(dataset2bx <- left_join(dataset2b,
                       dataset1))

of course, you could add a unique row identified to dataset1, and preserve it so it propogates when you sample dataset1 and make datasets 2a and 2b, but if you are doing that, then you may as well sample the entire rows ?

SveinungA · May 28, 2021, 1:41pm

Thanks for the quick reply, it is much appreciated. I think this works I will try it on the actual dataset over the weekend.

SveinungA · May 31, 2021, 9:44am

Update: When trying the solution on my actual data set, the 'phone' variable returns only NA's in the dataset2ax dataset. The variable is there, but no values. Any ideas?

nirgrahamuk · May 31, 2021, 10:42am

That would mean that dataset2a doesn't have name pairs in common with dataset1

system · June 7, 2021, 10:42am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.