Sidenote: those ID numbers are reeeeaaaaaally long (19 digits, I count). So long, in fact, that R's integer type probably can't hold them, and the double class (which is definitely not appropriate for storing ID numbers) can only represent them to about 15 to 17 significant figures:
If read_csv is reading these ID numbers in as doubles, you're likely losing some of those figures in the import process, which could explain why you're getting so few duplicates. I would've thought you'd get an error or warning if that was happening, but it's hard to tell without seeing what the data frame, read into R, looks like.
If you can't make these ID numbers smaller and you need to use them to find duplicates, maybe have read_csvread them in as strings using them col_types argument
EDIT:Excel also uses double precision, so if you read your data into Excel and let Excel assume this column is numeric, it'll also probably ruin the numbers.
Correct, these are Twitter IDs and so can't be changed, at least not without creating other issues. Thanks for the tip on strings, I will try this. I think the workflow without Excel makes sense, it's just that I originally collected 3 months worth of tweets in Excel format. If I start collecting again, I'll collect 7 days worth in a .csv, then use purrr to combine them all in a tibble.
Having to collect your data again sucks But hopefully if you stick to handling the IDs as strings, you'll have less problems The inconsistency between Excel and R re. duplicate counts might also be related to this, if you were inspecting the data in Excel, saving and then analysing further in R! It's hard to test this in a consistent way, because loss of precision tends not to be very consistent
The workflow works fine with .csv and purr now, also automated the scrip with Windows task scheduler to run weekly. Thanks everyone for the assistance.