File size of RDS files suddenly grew x10 with write_rds

An element of my data science project maintains a dated archive of our dataset, but something went dramatically strange to blow up the file sizes in September 2024…

-rw-r--r--@ 1 bjorkjcr  staff   54747 Nov 19 08:12 data/deaths-entries-2024-08-12.rds
-rw-r--r--@ 1 bjorkjcr  staff   49324 Nov 19 08:12 data/deaths-entries-2024-08-12s.rds
-rw-r--r--@ 1 bjorkjcr  staff   55075 Nov 19 08:12 data/deaths-entries-2024-09-06.rds
-rw-r--r--@ 1 bjorkjcr  staff  511146 Nov 19 08:12 data/deaths-entries-2024-09-15.rds
-rw-r--r--@ 1 bjorkjcr  staff  512710 Nov 19 08:12 data/deaths-entries-2024-11-12.rds
-rw-r--r--@ 1 bjorkjcr  staff  512939 Dec  4 20:52 data/deaths-entries-2024-12-04.rds
-rw-r--r--@ 1 bjorkjcr  staff  513706 Dec 15 22:23 data/deaths-entries-2024-12-11.rds

Investigating the files, I find literally no differences in the corresponding dataframe before and after the shift. And loading the two files from September 6 and September 15 shows them to be literally identical, as verified with diffdf. Even weirder, if I load the smaller older file and save it again today, I now get a much larger file:

-rw-r--r--@ 1 bjorkjcr  staff  511146 Nov 19 08:12 data/deaths-entries-2024-09-15.rds
-rw-r--r--@ 1 bjorkjcr  staff  511146 Mar  3 15:37 data/deaths-entries-2024-09-small.rds

I’m left to suspect that something changed in the function that saves the dataframe, write_rds(), likely with a package update to readr but I can’t figure out what it is.

Is there a good way to inspect the two files and see differences in format? Are there any plausible settings in write_rds() that could result in a 10x difference in filesize for the same data?

It doesn't look like the source code of readr::write_rds() has changed since at least 2022, so is it possible that your saving script has changed somehow? For example, by default write_rds() does not compress the data, while saveRDS() does, maybe the command called is not the same?

As a test, if you use write_rds(..., compress = "gz"), do you obtain the same file size as September 6?

Okay, well that intervention works:

 write_rds(de.small, here_filename("data/deaths-entries-2024-09-small.rds"), compress = "gz")
> write_rds(de.small, here_filename("data/deaths-entries-2024-09-small-nocompression.rds"))

produces…

-rw-r--r--@ 1 bjorkjcr  staff  511146 Mar  3 23:33 data/deaths-entries-2024-09-small-nocompression.rds
-rw-r--r--@ 1 bjorkjcr  staff   55066 Mar  3 22:16 data/deaths-entries-2024-09-small.rds

However, there really is no change in the code calling write_rds(), so I can't quite puzzle out what happened here. At least I can add the compression preference to the code and proceed.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.