object.size vs. rds size on disk and why models are so heavy?

Dobrokhotov1989 · October 13, 2021, 8:39am

Hi there!

I'm trying to optimize the code of my reproducible research (i.e. .Rmd notebook) in order to reduce the time of re-knitting it.
One thing that I've done is a kind of "caching" of some computationally expansive steps. In my case building an initial tibble with all raw data and several column derived from the raw data (I call it extended data) took a lot of time, so I introduce an if() statement to save extended data on the first run as .rds file and read from that file on the all other occasions (unless I explicitly want to re-build this tibble). It appears to be quite effective - I reduce the time for this step from multi-minutes to couple of seconds at the cost of ~500 Mb of hard drive space. Note: extended data is a simple table with character, double, and logical column types.

Next I've tried to repeat this trick with the tibble containing list-column with gml models created like this:

lm_system <- data %>%
    group_by(`variables`) %>%
    nest() %>%
    mutate(log_reg = map(data,
                         ~glm(.x$Var1 ~ .x$Var2,
                              family = "binomial")),
           log_sum = map(log_reg, ~summary(.x))) %>%
    select(-data) %>%
    ungroup() %>%
    mutate(row_num = 1:n(),
           intercept = map_dbl(.x = log_sum,
                               .f = ~ .x$coefficients["(Intercept)",c("Estimate")]),
           intercept_signif = map_dbl(.x = log_sum,
                                      .f = ~ .x$coefficients["(Intercept)",c("Pr(>|z|)")]),
           slope = map_dbl(.x = log_sum,
                           .f = ~ .x$coefficients[".x$Local_density",c("Estimate")]),
           slope_signif = map_dbl(.x = log_sum,
                                  .f = ~ .x$coefficients[".x$Local_density",c("Pr(>|z|)")]))

From here strange (for me) things began. I've check the object size with object.size(lm_system) - it was ~ 2.5 Gb. But when I write it as rds it occupied > 36 Gb at hard drive. I've tried to remove models themselves and leave only summaries (added %>% select(-data, -log_reg) %>% to the code above). It reduced the return of object.size() almost 10-fold to ~300 Mb but essentially was not affecting the size of rds file (~ 35 Gb).

Can someone explain this behavior? Why extended data (object.size() ~ 300 Mb) occupies 500 Mb at hard drive, while tibble with approximately same object.size()occupies tens of Gb on HD? And why when I removed column with models it significantly reduced object.size() but not the size of rds file?

Thanks in advance!

nirgrahamuk · October 13, 2021, 8:52am

I think this is relevant
Trimming the Fat from glm() Models in R – Win Vector LLC (win-vector.com)

Dobrokhotov1989 · October 13, 2021, 10:31am

@nirgrahamuk,

Thank you. It is useful to understand why models so heavy.

Interestingly, it seems that model summary carries all this "fat" from model. At least when I tried to compare length(serialize(model1, NULL)) and length(serialize(summary(model1), NULL)) the results in almost identical - model summary length is > 99.5% of the model itself. I think it explains why removal of column with models while leaving model summaries was not very efficient in terms of HD space.

system · November 3, 2021, 10:32am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.