Duplicates created by map_df(~ read_csv...), for which R can only detect a handful

rensa · October 17, 2018, 11:04am

Sidenote: those ID numbers are reeeeaaaaaally long (19 digits, I count). So long, in fact, that R's integer type probably can't hold them, and the double class (which is definitely not appropriate for storing ID numbers) can only represent them to about 15 to 17 significant figures:

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision.

If read_csv is reading these ID numbers in as doubles, you're likely losing some of those figures in the import process, which could explain why you're getting so few duplicates. I would've thought you'd get an error or warning if that was happening, but it's hard to tell without seeing what the data frame, read into R, looks like.

If you can't make these ID numbers smaller and you need to use them to find duplicates, maybe have read_csv read them in as strings using them col_types argument

EDIT: Excel also uses double precision, so if you read your data into Excel and let Excel assume this column is numeric, it'll also probably ruin the numbers.

chalg · October 17, 2018, 10:37pm

Correct, these are Twitter IDs and so can't be changed, at least not without creating other issues. Thanks for the tip on strings, I will try this. I think the workflow without Excel makes sense, it's just that I originally collected 3 months worth of tweets in Excel format. If I start collecting again, I'll collect 7 days worth in a .csv, then use purrr to combine them all in a tibble.

rensa · October 17, 2018, 10:47pm

Having to collect your data again sucks But hopefully if you stick to handling the IDs as strings, you'll have less problems The inconsistency between Excel and R re. duplicate counts might also be related to this, if you were inspecting the data in Excel, saving and then analysing further in R! It's hard to test this in a consistent way, because loss of precision tends not to be very consistent

chalg · October 29, 2018, 8:41am

The workflow works fine with .csv and purr now, also automated the scrip with Windows task scheduler to run weekly. Thanks everyone for the assistance.

mara · October 29, 2018, 2:18pm

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks

system · November 5, 2018, 2:18pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.