I just noticed that the same codes run much faster in R launched from the server's terminal than in Rstudio Server, and the difference is quite significant. For example, I wrote a heavy R script that handles a large amount of data (>10GB). When I ran it from Rstudio Server, it did not complete after waiting for 2 hours. So I interrupted and ran the exact same script directly from R launched from the server's command line, and it completed in less than 10 minutes. I just wonder if this also happens to others and if I'm doing something wrong. I did not attach my script because I think this issue is not specifically related to a code or example and the script is pretty overwhelming.
OS: Ubuntu 18.04
Rstudio Server version: 1.2.5033
R version: 3.6.3
You can check to see if this is a possibility by turning off auto refresh in the Environment pane. To do this, pull down the menu on the top right hand side of the pane and choose Manual Refresh Only:
By saying "heavy" I mean the script is (1) reading a large amount of data into memory, (2) doing a large amount of calculation (i.e., statistical tests), and (3) writing a large amount of results into the disk. In the analysis part, it also generates a large amount (thousands) of warning messages which I checked and confirmed to be okay.
I'm not reading all the data into memory at once. Instead I'm reading, analyzing, and writing output for each data file one by one. The size of one data file usually ranges from ~200MB to ~1.5GB. I use functions in the tidyverse package for the most parts of reading/transforming/writing the data files.
In Rstudio Server, I'm writing codes in its source pane, and use the "run all" command. In R, I just copy-pasted the whole script from the text editor to the console.
I think the first link you shared is irrelevant to my issue, because this is not about the Rstudio Startup time. The second link seems to be about a similar issue with mine.
Apparently, lines of code always (?) require more execution time in RStudio than in RGui. Even though you may usually not notice the time-difference when executing a couple of lines in the console, it seems to make a huge difference when, say, 3.000 lines of code are being executed in the R Studio console at once.
My understanding is that upon using source("Helperfunctions.R") in the RStudio source pane, the code is not actually sent to the RStudio console (which would have been slow), but is actually executed directly in the R language.
I guess that the main cause of the difference in execution time between Rstudio Server and R in my case is probably the large amount of warning messages my script generates. While R does not print all the warning messages until I explicitly command it to do, the Rstudio Server prints out all of them by default. What do you think?
Nothing really sticks out based on what you are doing.
I think many of us are doing similar things, even with bigger file sizes.
some more suggestions off the top of my head, from easiest to hardiest.
you can try suppressing the warnings and see if it makes any diff. options(warn=-1) documentation here
I would try the suggestion of the 2nd link and source the code into console instead of "run all." I don't know what "run all" actually does behind the scene ... I never use it.
see if there are differences in the read/write actions between the 2 environments. try a couple of bigger files and see if those are somehow taking more time in RStudio.
track the memory usage and see the difference in memory utilization. Maybe RStudio is capping memory more so.
The best way to find the answer is to profile the code and see what the delta is between running in R vs RStudio.
If you plan to continue to do more work on large data sets and performance is an issue, I would recommend looking into data.table package. tidyverse is expressive but, like base R, not great with memory or performance time. I've gotten pretty good performance jumps with data.table.
Lastly, is there a need to run the code in RStudio once it's been developed?
I wouldn't run production code from any IDE.
Thank you for the informative comment. Actually my code is still actively being developed. I just tend to occasionally refresh the session and "run all" the code from top to bottom to ensure that it is good so far. I did a quick check and it turns out that the reading/writing time is not the main source of slow down. I will try you other suggestions and take a look at the data.table package. I will get back with an answer (the primary source of slow down) if I find one.
I think the primary source of slow down was the huge amount of collateral messages (neither warning nor error) produced by statistical test functions used in my script. For example, my script contains codes that execute hundreds of thousands of cor.test (within either apply or for loop) with the option method = "spearman". This function by default produces a collateral message
Cannot compute exact p-value with ties
when there exist ties in the data. In R, these collateral messages are not printed out by default.