Hi there!
I'm trying to optimize the code of my reproducible research (i.e. .Rmd notebook) in order to reduce the time of re-knitting it.
One thing that I've done is a kind of "caching" of some computationally expansive steps. In my case building an initial tibble with all raw data and several column derived from the raw data (I call it extended data) took a lot of time, so I introduce an if()
statement to save extended data on the first run as .rds file and read from that file on the all other occasions (unless I explicitly want to re-build this tibble). It appears to be quite effective - I reduce the time for this step from multi-minutes to couple of seconds at the cost of ~500 Mb of hard drive space. Note: extended data is a simple table with character, double, and logical column types.
Next I've tried to repeat this trick with the tibble containing list-column with gml models created like this:
lm_system <- data %>%
group_by(`variables`) %>%
nest() %>%
mutate(log_reg = map(data,
~glm(.x$Var1 ~ .x$Var2,
family = "binomial")),
log_sum = map(log_reg, ~summary(.x))) %>%
select(-data) %>%
ungroup() %>%
mutate(row_num = 1:n(),
intercept = map_dbl(.x = log_sum,
.f = ~ .x$coefficients["(Intercept)",c("Estimate")]),
intercept_signif = map_dbl(.x = log_sum,
.f = ~ .x$coefficients["(Intercept)",c("Pr(>|z|)")]),
slope = map_dbl(.x = log_sum,
.f = ~ .x$coefficients[".x$Local_density",c("Estimate")]),
slope_signif = map_dbl(.x = log_sum,
.f = ~ .x$coefficients[".x$Local_density",c("Pr(>|z|)")]))
From here strange (for me) things began. I've check the object size with object.size(lm_system)
- it was ~ 2.5 Gb. But when I write it as rds it occupied > 36 Gb at hard drive. I've tried to remove models themselves and leave only summaries (added %>% select(-data, -log_reg) %>%
to the code above). It reduced the return of object.size()
almost 10-fold to ~300 Mb but essentially was not affecting the size of rds file (~ 35 Gb).
Can someone explain this behavior? Why extended data (object.size()
~ 300 Mb) occupies 500 Mb at hard drive, while tibble with approximately same object.size()
occupies tens of Gb on HD? And why when I removed column with models it significantly reduced object.size()
but not the size of rds file?
Thanks in advance!