match rows from two data sets and add new variable to correct rows

Hi all,

I am reaching out for tips on data management in tidyverse:

I have three data sets. Datasets 2a and 2b comprise each a randomized half of dataset 1. Dataset 1 contains an extra variable which I want to add to the correct rows in datasets 2a and 2b.
The order of the rows are different between the datasets, and must remain so. Does anyone have a tip about how that can be done? It is a data set with 4000 observations.

library(tidyverse)
library(knitr)

dataset1 <- tibble(name = c('Jane', 'Joe', 'Janet', 'George'), surname = c('Doe', 'Doe', 'Doh', 'Costanza'), phone = c(55512, 55513, 55514, NA))
dataset2a <- tibble(name = c('George', 'Janet'), surname = c('Costanza', 'Doh'))
dataset2b <- tibble(name = c('Joe', 'Jane'), surname = c('Doe', 'Doe'))

I am thinking there must be a way to identify identical rows across the datasets, but am not sure how to go about executing the operation. (I do have more variables than names and surnames, so all rows are uniquely identifiable within each dataset).

I hope that was clear enough, and appreciate any advice!

names would have to be unique for this sort of thing I think

library(tidyverse)

(dataset1 <- tibble(name = c('Jane', 'Joe', 'Janet', 'George'), surname = c('Doe', 'Doe', 'Doh', 'Costanza'), phone = c(55512, 55513, 55514, NA)))
(dataset2a <- tibble(name = c('George', 'Janet'), surname = c('Costanza', 'Doh')))
(dataset2b <- tibble(name = c('Joe', 'Jane'), surname = c('Doe', 'Doe')))

(dataset2ax <- left_join(dataset2a,
                       dataset1))
(dataset2bx <- left_join(dataset2b,
                       dataset1))

of course, you could add a unique row identified to dataset1, and preserve it so it propogates when you sample dataset1 and make datasets 2a and 2b, but if you are doing that, then you may as well sample the entire rows ?

Thanks for the quick reply, it is much appreciated. I think this works :slight_smile: I will try it on the actual dataset over the weekend.

Update: When trying the solution on my actual data set, the 'phone' variable returns only NA's in the dataset2ax dataset. The variable is there, but no values. Any ideas?

That would mean that dataset2a doesn't have name pairs in common with dataset1

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.