An element of my data science project maintains a dated archive of our dataset, but something went dramatically strange to blow up the file sizes in September 2024…
-rw-r--r--@ 1 bjorkjcr staff 54747 Nov 19 08:12 data/deaths-entries-2024-08-12.rds
-rw-r--r--@ 1 bjorkjcr staff 49324 Nov 19 08:12 data/deaths-entries-2024-08-12s.rds
-rw-r--r--@ 1 bjorkjcr staff 55075 Nov 19 08:12 data/deaths-entries-2024-09-06.rds
-rw-r--r--@ 1 bjorkjcr staff 511146 Nov 19 08:12 data/deaths-entries-2024-09-15.rds
-rw-r--r--@ 1 bjorkjcr staff 512710 Nov 19 08:12 data/deaths-entries-2024-11-12.rds
-rw-r--r--@ 1 bjorkjcr staff 512939 Dec 4 20:52 data/deaths-entries-2024-12-04.rds
-rw-r--r--@ 1 bjorkjcr staff 513706 Dec 15 22:23 data/deaths-entries-2024-12-11.rds
Investigating the files, I find literally no differences in the corresponding dataframe before and after the shift. And loading the two files from September 6 and September 15 shows them to be literally identical, as verified with diffdf. Even weirder, if I load the smaller older file and save it again today, I now get a much larger file:
-rw-r--r--@ 1 bjorkjcr staff 511146 Nov 19 08:12 data/deaths-entries-2024-09-15.rds
-rw-r--r--@ 1 bjorkjcr staff 511146 Mar 3 15:37 data/deaths-entries-2024-09-small.rds
I’m left to suspect that something changed in the function that saves the dataframe, write_rds(), likely with a package update to readr but I can’t figure out what it is.
Is there a good way to inspect the two files and see differences in format? Are there any plausible settings in write_rds() that could result in a 10x difference in filesize for the same data?