Finding and subsetting duplicates from two sources

lyrieger · April 1, 2021, 10:39pm

I'm new to R, and need some help. Basically what I have is a list of library call numbers - some come from a list of documents received from elsewhere (Source = State List), and some from our catalog (Source = Alma) - and I need to determine which numbers from the first list are already represented in our catalog.

I've compiled both lists into a dataframe with the source of call number identified:

Sample dataframe

This code gets me all the duplicated call numbers in the dataframe:

    govdoc_compare[duplicated(govdoc_compare$DocNum) | duplicated(govdoc_compare$DocNum, fromLast = TRUE),]

but it includes duplicates that are ONLY in the Alma source and don't have a matching call number in the State List Source (like the highlighted lines in my example dataframe above).

Is there a way I can filter out duplicates that only occur in the Alma source, but keep duplicates in Alma that match a call number in the State List source?

Thank you!

technocrat · April 1, 2021, 11:47pm

This can be done with setops functions (see help(setops))

intersect(1:5, 4:8)
#> [1] 4 5
union(1:5, 4:8)
#> [1] 1 2 3 4 5 6 7 8

setdiff(1:5, 4:8)
#> [1] 1 2 3
setdiff(4:8, 1:5)
#> [1] 6 7 8

If you have trouble with this, see the FAQ: How to do a minimal reproducible example reprex for beginners and we can circle back.

lyrieger · April 2, 2021, 5:10pm

Thank you! I was able to use union to get rid of the dups in individual sources, and then run my original code line to remove anything left that didn't match across sources. This is perfect!

I'm going to run it on a slightly larger test set to make sure I didn't miss anything, but you have saved me HOURS.

system · April 9, 2021, 5:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.