Exploding file sizes for plots created and saved within drake plan

Hlynur · August 12, 2020, 1:28pm

Edit: The issue seems to be with saving .rds files within a drake pipeline, and unrelated to ggplotly. Apologies.

Original post (where I mistook this for a plotly issue):

Hi there,

I have a project where I've been using plotly::ggplotly() to turn ggplots into interactive graphics for online use in Shiny apps. However, it soon became evident that this resulted in some outrageously large files and objects, resulting in both an unnecessarily slow and unnecessarily large shiny app.

As we have dozens and dozens of graphs that we display in the Shiny app, a plot object is only read from an .rds file when needed, on the fly. This makes the ggplotly objects pretty much unusable as a simple barchart that is 20KB as an .rds file created using plotly, can be 250MB as an .rds file created using ggplot2 and then ggplotly. Below is an example of this behaviour with a heatmap. In short the average plotly-created file/object is 1/1000 the size of the corresponding ggplotly-created file/object.

Edit: I now realise this also happens for regular `plotly` objects. The two examples below differ in the sense that the smaller is not created within a `drake` pipeline, while the larger is created within a `drake` pipeline.

library(readr)
library(lobstr)
library(fs)

## plotly .rds file size
fs::file_size("app_cache/atvl_aldur_fjoldi_plot.rds")
# 61.1K

## ggplotly .rds file size
fs::file_size("app_cache/atvl_aldur_tala_plot.rds")
# 212M

my_plotly_heatmap <- read_rds("app_cache/atvl_aldur_fjoldi_plot.rds")
my_plotly_heatmap

my_ggplotly_heatmap <- read_rds("app_cache/atvl_aldur_tala_plot.rds")
my_ggplotly_heatmap

lobstr::obj_size(my_plotly_heatmap)
# 125,840 B
lobstr::obj_size(my_ggplotly_heatmap)
# 145,767,184 B

I'd love to be able to use the ggplotly approach as a lot of these graphics have already been prepared as ggplots for a report. So my question is whether there are some obvious things that I can access in the ggplotly object to cut down the size? (Font information comes to mind, are there raster graphics being stored but not printed? What explains these massive objects?). Are there any guides on this or how-to blogs? I can't seem to find any.

Many thanks in advance,
Hlynur

Hlynur · August 12, 2020, 5:18pm

Additional info:

I'm starting to get the sense this might have to do with ggplot2 plot environments, and the fact that these files are created within a massive drake pipeline. But honestly don't know. Just learning about the plot environments trying to google this. I tried recreating the ggplotly heatmap in a fresh session and that results in a 2.6MB .rds file. So that's ~1/50th of the files created within the drake pipeline.

So this is something I likely have to fix before turning the ggplot plot into a plotly object using plotly::ggplotly(), right?

nirgrahamuk · August 13, 2020, 10:11am

Have you asked / raised an issue here ? https://github.com/ropensci/plotly/issues

Hlynur · August 13, 2020, 11:21am

Thanks for the reply. I should probably raise an issue. But I don't know if this is something I should be looked at from the drake point of view, or from the plotly point of view.

@cpsievert and @wlandau (I hope it's ok if I tag you here), could you possibly weigh in on whether this should be considered an issue, and if so, whether it is possibly a drake issue or a plotly issue.

Additional info:

I should add that this is not just an issue with ggplotly but also other plotly objects. This simple plotly barchart:

was 20KB when I plotted it and saved it using a clean session, but is now +180MB once I created it inside the drake pipeline.

C47F35B0-AD71-4F96-B7B2-4B44ED7A1C88

Edit: This seems to be a drake issue. Exploding file sizes happen for `ggplot2` plots too. Will edit original post.

wlandau · August 13, 2020, 2:00pm

In drake, this came up at https://github.com/ropensci/drake/issues/882 and https://github.com/ropensci/drake/issues/1258. Unfortunately, there is nothing drake can do at this point to reliably reduce the size of objects from ggplot2 and plotly that hang onto large objects from calling environments. If it ever becomes possible to decouple large datasets without breaking the plot object (maybe with a specialized serialization method) I will consider specialized storage formats for plots, e.g. target(format = "ggplot2") and target(format = "plotly").

Hlynur · August 13, 2020, 2:19pm

Thanks so much for replying Will (and thanks once again for an incredible package). Is the solution for me then, at least for the time being, just saving the .rds files in question outside of the drake_plan as the calling environment is close to empty with a clean session?

Comparisons I was doing (reprex below).

#> # A tibble: 6 x 2
#>   file                              size
#>   <fs::path>                 <fs::bytes>
#> 1 temp/my_ggplot.rds              285.6K
#> 2 temp/my_ggplot_drake.rds        152.9M
#> 3 temp/my_ggplotly.rds            824.7K
#> 4 temp/my_ggplotly_drake.rds      153.4M
#> 5 temp/my_plotly.rds                9.1K
#> 6 temp/my_plotly_drake.rds        153.4M

Reprex:

# Packages
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(drake))
suppressPackageStartupMessages(library(fs))
suppressPackageStartupMessages(library(here))

# Create data
my_tbl = tibble(a = rnorm(1e7, mean = 1),
                b = rnorm(1e7, mean = 2))

my_summary = my_tbl %>%
  summarize(across(a:b, mean)) %>% 
  pivot_longer(everything())

# Plot data
my_ggplot = my_summary %>% 
  ggplot(aes(x = name, y = value)) +
  geom_col()

my_ggplotly = my_ggplot %>% 
  ggplotly()

my_plotly = my_summary %>%
  plot_ly(x = ~name, y = ~value) %>% 
  add_bars()

# Create temp folder
fs::dir_create("temp")

# Saving objects outside of drake plan (to temp folder)
my_ggplot %>% 
  write_rds(here::here("temp", "my_ggplot.rds"))

my_ggplotly %>% 
  write_rds(here::here("temp", "my_ggplotly.rds"))

my_plotly %>% 
  write_rds(here::here("temp", "my_plotly.rds"))

###################
# drake plan starts
###################
## Same as above ##
my_drake_plan <- drake_plan(
my_tbl = tibble(a = rnorm(1e7, mean = 1),
                b = rnorm(1e7, mean = 2)),

my_summary = my_tbl %>%
  summarize(across(a:b, mean)) %>% 
  pivot_longer(everything()),

my_ggplot = my_summary %>% 
  ggplot(aes(x = name, y = value)) +
  geom_col(),

my_ggplotly = my_ggplot %>% 
  ggplotly(),

my_plotly = my_summary %>%
  plot_ly(x = ~name, y = ~value) %>% 
  add_bars(),

my_ggplot_save = my_ggplot %>% 
  write_rds(file_out(!!here("temp", "my_ggplot_drake.rds"))),

my_ggplotly_save = my_ggplotly %>% 
  write_rds(file_out(!!here("temp", "my_ggplotly_drake.rds"))),

my_plotly_save = my_plotly %>% 
  write_rds(file_out(!!here("temp", "my_plotly_drake.rds")))
) %>% 
  make()
#> ▶ target my_tbl
#> ▶ target my_summary
#> ▶ target my_ggplot
#> ▶ target my_plotly
#> ▶ target my_ggplot_save
#> ▶ target my_ggplotly
#> ▶ target my_plotly_save
#> ▶ target my_ggplotly_save

# Checking the file sizes of plots created and saved as .rds outside of drake plan
# and comparing to the same exact plots created and saved within a drake plan
tibble(file = fs::dir_ls("temp")) %>% 
  mutate(size = fs::file_size(file))
#> # A tibble: 6 x 2
#>   file                              size
#>   <fs::path>                 <fs::bytes>
#> 1 temp/my_ggplot.rds              285.6K
#> 2 temp/my_ggplot_drake.rds        152.9M
#> 3 temp/my_ggplotly.rds            824.7K
#> 4 temp/my_ggplotly_drake.rds      153.4M
#> 5 temp/my_plotly.rds                9.1K
#> 6 temp/my_plotly_drake.rds        153.4M

^{Created on 2020-08-13 by the reprex package (v0.3.0)}

cpsievert · August 13, 2020, 2:36pm

@wlandau FWIW I'm pretty sure this removes all environments kept on the plotly object and seems like a fairly safe thing to do pre-serialization

plotly_build2 <- function(...) {
  p <- plotly_build(...)
  p$x[c("attrs", "visdat", "cur_data")] <- NULL
  p
}

Hlynur · August 13, 2020, 3:18pm

THANK YOU!!

This just turned the plotly .rds outputs of our project, which used to be 3.5GB, to a slender 46MB, without any apparent downsides!

Thank you!

wlandau · August 14, 2020, 2:46am

Thanks, Carson! I will recommend this to drake users who run into this problem. From there, they can choose a serialization format drake already supports, e.g. target(plotly_object, get_plot(), format = "qs").

wlandau · August 14, 2020, 2:56am

@Hlynur, another workaround could have been htmlwidgets::saveWidget(my_plotly, drake::file_out("my_plotly.html"), selfcontained = TRUE). For ggplot2 objects that get too big, an equivalent workaround with ggsave() and PNG files will still allow you to have plots as targets in the plan. In other cases like lm() objects, you can swap in stats::lm() for biglm::biglm().

I think https://github.com/tidyverse/ggplot2/issues/3619#issuecomment-552340703 is related. Apparently ggplot objects are trickier to downsize than plotly objects.

system · August 21, 2020, 2:56am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Exploding file sizes for plots created and saved within drake plan

Edit: The issue seems to be with saving .rds files within a drake pipeline, and unrelated to ggplotly. Apologies.

Original post (where I mistook this for a plotly issue):

Edit: I now realise this also happens for regular plotly objects. The two examples below differ in the sense that the smaller is not created within a drake pipeline, while the larger is created within a drake pipeline.

Additional info:

Additional info:

Edit: This seems to be a drake issue. Exploding file sizes happen for ggplot2 plots too. Will edit original post.

Reprex:

THANK YOU!!

Edit: I now realise this also happens for regular `plotly` objects. The two examples below differ in the sense that the smaller is not created within a `drake` pipeline, while the larger is created within a `drake` pipeline.

Edit: This seems to be a drake issue. Exploding file sizes happen for `ggplot2` plots too. Will edit original post.