Is it possible to launch long-running non-interactive jobs on RStudio Cloud?

Andrea · December 30, 2018, 10:52pm

I recently found out about the tuning_run function of package tfruns:

https://tensorflow.rstudio.com/tools/tfruns/articles/tuning.html

and I would like to launch an hyperparameter tuning job on RStudio Cloud.

Is it possible to launch a job on RStudio Cloud, which keeps running even after I closed my browser?
What's the maximum time length of such a job?
Do I need to follow some special procedure, or can I simply source a script containing the call to tuning_run, shutdown my laptop and log in the next day, to have a look at the results?

seans · December 31, 2018, 5:59pm

Longer running jobs are not formally supported right now. Currently though a project will go to sleep when the user stops interacting with it at 15 minutes (if the rsession is idle) or 24 hours (of the rsession is busy).

In theory that means that if you run code in a cloud project and come back within 24 hours the code should still be running. Also you should note that there's 1GB memory limit right now in cloud so running this on a larger data set might present some additional challenges.

Sean

Andrea · January 1, 2019, 1:55pm

Thanks! 24 h should be plenty. For what it concerns the 1 GB memory limit, I don't think that'll be a big issue - the dataset, together with a few auxiliary variables, is exceedingly small (barely above 500 kB).

Andrea · January 3, 2019, 1:47pm

Hmmm. Are you sure these limits are being honored? For the second time in a row, I tried to let a long keras job run (rsession busy, but user stopped interacting with the project). In both cases, the job aborted after about 1 hour or less. You can have a look at my project:

https://rstudio.cloud/project/160813

the runs directory contains one folder for each model I wanted to fit. 972 models should have been fit: 52 were actually fit. As each folder is timestamp-named (good job, keras for R guys!), you can tell that execution stopped after less than 1 hour. Any idea why it happened? I'll try to reproduce the issue in another project, so that this one doesn't get modified.

seans · January 3, 2019, 3:11pm

The project is coming up in the logs as exceeding the memory footprint which would terminate the rsession without providing much feedback. The 1GB limit is at the container level - so the IDE and other programs running in the container could impact the amount of memory available to your code. There is an R package that people have had some luck with to track the memory footprint of their executing code,

Sean

Andrea · January 3, 2019, 4:00pm

Thanks Sean! This is weird - the dataset is small (500 kB) and each model by itself is exceedingly small: the biggest ones have about 5000 parameters...even the smallest MobileNet architectures are hundreds of times bigger than this. But apparently, fitting them sequentially increases the memory occupation considerably. I don't understand how I could use ulimit here - its goal seems to set a limit on the memory occupation. However, I think it would be more useful for me to track how memory occupation grows in time, as I fit more models, so that I could choose to fit no more than, say, 300 models.

I wonder if I should ask a question in the Machine Learning and Modeling category - I'm curious to understand if/why Tensorflow is such a memory hog. The computational graph is not being grown progressively, with each fitted model - this was my first suspect, but the console output for each model clearly shows that the computational graph is being re-initialized for each new model, as it should:

> summary(model)
____________________________________________________________________________________________________
Layer (type)                                 Output Shape                            Param #        
====================================================================================================
dense_1 (Dense)                              (None, 128)                             1536           
____________________________________________________________________________________________________
dropout_1 (Dropout)                          (None, 128)                             0              
____________________________________________________________________________________________________
dense_2 (Dense)                              (None, 4)                               516            
====================================================================================================
Total params: 2,052
Trainable params: 2,052
Non-trainable params: 0
____________________________________________________________________________________________________

Andrea · January 3, 2019, 9:23pm

Hi, Sean,

I tried to set the memory limit you were talking about, in this copy of the original project:

https://rstudio.cloud/project/161413

it's the first line of the script

ulimit::memory_limit(900)

according to the documentation of ulimit, this should set a RAM limit of 900 MB. However, now the script crashes as soon as I source it. Any idea on what's happening?

seans · January 4, 2019, 3:47pm

Offhand other people have also run into issues running complex statistical analysis in the cloud environment because of the 1GB memory limit. We are currently working on allowing users to increase the memory footprint of their projects which should open the platform for more complex use cases.

Optimizing/debugging R code is pretty far outside of my wheel house. The machine learning and modeling category is probably a better fit for those types of questions.

Sean

Andrea · January 4, 2019, 8:34pm

Ok, thanks a lot! I'll simplify my project a bit, and then post there, with a link to this very interesting discussion.

Andrea · January 11, 2019, 8:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.