How to set up RStudio for multi-threaded reading/loading and writing/saving

Hi folks,

To set some background: I'm a sysadmin, not familiar with R but still responsible for the deployment of Rstudio. I've gathered that large workspaces (ergo: large .Rdata files) will cause slowdowns when opening Rstudio, and perhaps closing Rstudio as well. I've never verified closing as I've ran out of time with a researcher who was fluent in R that was impacted by the slowness. But I've been told saving the workspace is even slower.

I can't enforce our users to keep their workspaces small/tidy. All the gentle reminders and prodding will still result in 50+GB RData files which expands to 100+GB workspaces. Said project/workspace took 15+ minutes to open.

We tried moving those files to NVMe SSDs (Intel P6400s). Made zero difference. Took a look at Task Mgr (Windows) and saw only 1 out of 64 threads was being utilized and memory usage tick upwards while Rstudio was opening. Which is how I've arrived at the conclusion the loading process was single threaded.

So my question: any way to speed this up? Ideally parallelized loading/saving. Typical high compute servers don't exactly have the best single core performance compared to HEDT/consumer chips, but they have a bunch of cores instead.

And no, I can't test things independently because we have security policies in place where server admins don't have access to data.

For reference: Rstudio 2022.7.2. Windows Server 2019, Dual Xeon Gold 3160, 768GB RAM

Probably futile, but consider turning off reading .Rdata files at startup perhaps?

Tools > Global Options ...> General > Basic > Workspace uncheck Restore .Rdata into workspace at startup
Project Options > General > Workspace Restore .Rdata into workspace at startup: No.

1 Like

Ask your R users to check out the {qs} package which does "quick serialisation" of R objects, is multithreaded and can replace the use of .Rdata files.
See: https://cran.r-project.org/web/packages/qs/vignettes/vignette.html
If we had more info on what type of objects are being stored in the huge .Rdata files we may be able to suggest a better approach E.g. using a database which is only read as required, and perhaps a piece at time.

consider turning off reading .Rdata files at startup

Yup, one of the first changes made. Helps when starting anew, but folks are used to double-clicking the rproject file and it opens up Rstudio right into their 100GB workspace. (edit: Just heard back from user: it seems even clicking their rproject file opens a blank Rstudio with this setting)

check out the {qs} package

I'll float this to my users but it seems to need implementation on a per-object basis. From what little I've seen, my users load everything within their workspace and rely on the IDE (Rstudio) keeping values within the IDE environment, instead of loading specific files within each R script.

For example: If they use {qs} to read in a dataframe, sure it'll read it quickly into memory, and then they click "close Rstudio -> save workspace" and Rstudio saves the entire workspace :sweat:. Then they open it again and we're back to square one: a massive .Rdata file.

{qs} is definitely useful within fully self contained R scripts. But our researchers rely on the convenience functionality of the Rstudio IDE being able to provide values to scripts via the workspace.

There is no database. These are all files being loaded, most likely CSV or similar.

Are there improvements in the two intervening years of Rstudio updates?

Your researchers could likely save an enormous number of human-hours by adopting best practices that encourage reproducibility: The loading and saving of .RData files on startup and close is a legacy from when there was no IDE for R, but RStudio makes that practice obsolescent if not obsolete. (An image that comes to mind is having to carry your house around every time you travel instead of finding local lodging.)

Reusing objects over time is a highly error-prone process: Users can easily forget changes they may have made the previous session, let alone in the previous week, which can easily lead to having to restart from scratch.

The defualt behavior of RStudio is to "Restore most recently opened project at startup" and "Restore previously open source documents at startup", so it is likely that either those global options were changed, or that an already open project was opened again, which leads to an editor that has no open files.

Every object in a user's environment was built by executing a sequence of statements, so an alternative to carrying an object around indefinitely is to write a script that recreates that sequence — and if the object is large, that also saves that object — and any further changes in the object should be reflected in changes to the script.

In other words, when a project is opened, it could already have 1) pre-saved files corresponding to any individual large objects that are needed , as well as 2) a startup script that either recreates small objects or loads large objects needed by the user. Importantly, every object should be reproducible by running the correct script or sequence of scripts, which helps to both protect the user from the consequences of their inevitable errors, as well as serve as documentation that can be referred to once enough time has passed to make memory unreliable.

I'm not sure what you mean by "improvements", but I expect that adopting practices that encourage and take advantage of reproducibility (including preventing automatic loading or saving of .RData files) would likely lead to big improvements for both you and your researchers.

1 Like

The loading and saving of .RData files on startup and close is a legacy from when there was no IDE for R, but RStudio makes that practice obsolescent if not obsolete.

I have no idea what you're trying to convey. If .RData files are legacy, then Rstudio is still operating with that model. So what's the replacement functionality?

Toss a Python analogy at me: Moving an entire Jupyter notebook?

Is the point "move away from using the Workspace, and directly source data within code (ala: qs per above) "?

I'm not sure what you mean by "improvements"

From the POV of someone dropped into the Rstudio IDE environment and probably organically learned Rstudio without formal instruction, the workspaces functionality is very convenient. As long as the data is in my workspace, I can reference it and massage it all I want, and I don't need to go back to the source file (eg: someone else updated/changed it). So if this is a "noob trap", should not steps be taken to reduce its impact? ie: reducing the amount of time it takes to save/load workspaces.

Keep in mind that R and RStudio are two different entities. The R programming language is the one where .RData files come from. RStudio (now Posit PBC) created the RStudio IDE, which aims to make using R as pleasant as possible. But in the end, it's still just a front-end to the R programming language, which controls how .RData files are loaded.

Perhaps. One of the solutions was already suggested by @mduvekot , which is to default to turning off restoring data into the workspace at start up. The other solution would be for the R consortium (the group in charge of the R programming language) to add parallelization to the base R programming language.

In either case, Posit or the RStudio IDE cannot change how the underlying R language works, we can only make a convenient setting to discourage the workspace usage.

Best,
Randy

Ah, so here's the lightbulb moment. Rstudio treats the project workspace as a special R runtime, but as you've wrote, it's still ultimately beholden to the R mechanism of writing all the dataframes to .RData. ergo: there's no workaround.

Turning off data restore seems to have done the extra step of not restoring even if intended to by opening the rproject file. Not a big issue, users can still File -> Open Project.

we can only make a convenient setting to discourage the workspace usage.

If you're open to it, some form of tooltip popup when workspace exceeds X GBs, or other form of notification when users Save/Exit when crossing the same X GB threshold.

And to wrap up this thread: @dromano's response would be the accepted answer. There's no workarounds for bad practice.

Thanks for the speedy responses all!

1 Like

What you're describing is how I started using R and RStudio (and accummulated the painful experiences that led my recommendation above!).

And just to be clear, if the commands typed out to reference and massage the data are simply typed out in a script instead of at the command prompt, that script effectively becomes the "source" you refer to, and would be saved (by default) in the same location as an .RData file. So the same permissions would apply and the user would be just as vulnerable to someone else updating or changing the script file as they would to someone else updating or changing the .RData file.

Finally, if your researchers have any questions or would like advice about how adapt their workflows — or anything else R, RStudio, or more generally, Posit-related — I'm sure folks here would be happy to help them.

1 Like

Not my expertise but I believe R processes use copy-on-modify. So if the researchers are creating a workspace with lots of data frames that refer to each other, the computing power/memory is less intensive compared to opening an .RData file which no longer has the copy-on-modify references. I.e., opening an RData file means each frame is getting it's own space in memory (or at least trying to).

While it can be convenient to save multiple data frames in one file, I don't consider it a best practice. It may have it's use case, but I prefer running my final R scripts from top to bottom to ensure they are reproduceable, and then saving the outputs I need individually. Using {data.table} package with fread and fwrite make this relatively quick operations.

Thank you to everyone that has chimed in.

A resounding, unified response of: stop using the workspace, these shortcomings are the cost of the convenience.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.