Look up value comparing two dataset

Hi everyone,

I have two identical dataset: the population and the sample which is a subset of the population dataset drawn with sample_n() function without replacement. In both of the dataset I have a unique id for every observation assigned earlier before drawing this sample subset.
Now my problem is I want to go back to the population dataset and create a new column we may call it respondent category(a binary variable of either population or sample). Here I would like to look up the unique id values in population dataset and compare with the same in the sample dataset. If we find the unique id in the sample data frame corresponding to the population data frame we assign that cell sample indicating that that observation has been assigned to our sample data frame else population

Normally I do this in excel using xlook-up function but I would appreciate if someone can help with a code to execute this on R tidyverse context. Attached dataset helps elaborate my problem.

Cohort 3 Population.pdf (79.6 KB)
cohort_3_sample.pdf (78.9 KB)

Hi @rigs,

Thank you for your question.

Here's how I would go about this using the %in% operator to check if the population$Participant_ID is found within the sample$Participant_ID data:

population <- tibble::tribble(
  ~Participant_ID, ~Gender, ~County,
  25400101259, 'Female', 'Mombasa',
  25402701260, 'Female', 'Uasin Gishu',
  25404701261, 'Male', 'Nairobi',
  25404701262, 'Female', 'Nairobi',
  25402201263, 'Female', 'Kiambu',
  25403901264, 'Female', 'Bungoma',
  25402701265, 'Male', 'Uasin Gishu',
  25400101266, 'Female', 'Mombasa',
  25401901267, 'Female', 'Nyeri',
  25402701268, 'Female', 'Uasin Gishu',
)

# take a sample of 5 people from the population
sample <-
  population |> 
  dplyr::sample_n(size = 5)

# update the population data to reflect which people have been sampled
population <-
  population |> 
  dplyr::mutate(
    Category = dplyr::if_else(
      condition = Participant_ID %in% sample$Participant_ID,
      true = 'sample',
      false = 'population'
    )
  )

# see the result
population
#> # A tibble: 10 × 4
#>    Participant_ID Gender County      Category  
#>             <dbl> <chr>  <chr>       <chr>     
#>  1    25400101259 Female Mombasa     population
#>  2    25402701260 Female Uasin Gishu population
#>  3    25404701261 Male   Nairobi     sample    
#>  4    25404701262 Female Nairobi     population
#>  5    25402201263 Female Kiambu      sample    
#>  6    25403901264 Female Bungoma     sample    
#>  7    25402701265 Male   Uasin Gishu population
#>  8    25400101266 Female Mombasa     sample    
#>  9    25401901267 Female Nyeri       population
#> 10    25402701268 Female Uasin Gishu sample

Created on 2025-01-24 with reprex v2.1.1

Thanks for this help, exactly what am looking for