As an interpreted programming language, R may not have the most satisfactory speed, but its statistics and data science ecosystem is amazing and has really helped me a lot along my career. However, I do wish that R could ultimately tackle this speed issue. At the same time, Python has made a lot of efforts to fix performance issue, including Numba, Cython, etc.
I have noticed the Renjin project since about 3 years ago, and it seems quite promising. I was wondering if it is possible that R could use the JVM-based interpreter adopted by Renjin and improves performance.
Well, you also have data.table, Rcpp and friends to help you with speed, so I would say there are also some things that are done in the direction of doing faster in R.
Last I checked Renjin it was not playing very nicely with C and C++ packages (e.g., dplyr). Did this situation improve? In other words, what are the downsides of using Renjin instead of GNU R right now from your point of view?
You are right, sometimes dplyr does not work properly in Renjin, and many packages in Renjin are not so up to date. As for Rcpp, I really admire the efforts that have been devoted to it. Meanwhile, I do want to see fast native R code instead of resorting to a compiled language like C/C++, Fortran, etc. As for data.table, I used to use it a lot, but gradually transfer to the Hadleyverse for better readability and consistency, but I do use the fread function from time to time.
I have a piece of code as below:
require(tidyverse)
require(lubridate)
#find the institutional holding horizon of last 3 years (12 quarters or 36 months) for each quarter starting at 2000-03-31
horizon_data[[i]] <- hold_horizon(my_quarters[i], data = institution_investor)
}
#combine date frames in the list
horizon_data_new <- do.call("rbind", horizon_data) %>%
select(cusip, date, long_percent, short_percent)
The data institution_investor has about 100 million rows and 4 columns. This block of code takes more than 30 minutes to run. I suspect that if I replace for loop with apply functions the situation will get better, but I am not sure.
The point of Rcpp is that most of the time users don't need to write in C++, because package authors already have. When you're using data.table or dplyr, you're running a lot of C++, even though you never see it. Many base R functions go straight to C (or sometimes Fortran).
Thus, speed in R is about avoiding the idioms that require lots of work in R itself or which play into its weaknesses (e.g. uninitialized for loops). By writing better code, R can be quite fast. As Renjin isn't playing nice with packages that can make it faster, it's not terribly practical.
The R Inferno, covers (among other things) a number of the idioms that can make base R slow:
Packages built with Rcpp usually avoid these traps for you.
100 million rows can be a lot, especially if these are character columns. Can you execute this command and report the result?
format(object.size(institution_investor), units = "auto")
Also, if you have other sizable objects in memory, can you execute this and report?
library(pryr)
mem_used()
My point is that if you're using so much memory that you've started swapping from this, then the sluggishness you're seeing is not (just) R-related - it's due to the operative system using the virtual memory, which results in a HUGE slowdown.
I forgot to answer about the for cycles: you can indeed substitute the for + do.call("rbind", ...) combo, with map_df from purrr, but I don't think runtimes will go down so dramatically. Also, the first for cycle is not needed, but again, eliminating it won't bring you any benefit (just teach you a bit about vectorization in R i.e., instead than
#create quarter range
my_quarters <- rep(ymd("2000-03-31"), 72)
for (i in 1:73){
my_quarters[i] <- ymd("2000-03-31") %m+% months(3*(i-1))
}
Anyway, if you create a fake data set, I may try a couple things. It looks like institution_investor has four columns qtrdate, sharesheld, cusip, ownercode, which, I guess, are respectively a "Date", a "numeric" and two "character". In R you can easily create vectors of random numbers, random dates or random strings. Thus you could easily provide code to generate a "realistic" institution_investor data frame of, say, 1 million rows.
Finally, apart from the R Inferno suggested by @alistaire (which is a great resource), I have a couple generic suggestions which could help you:
update R to 3.5.0 if you haven't done it already, since it contains some significant speedups.
instead than using an interpreter which is not fully compatible with the packages on CRAN, such as Renjin, you could try an interpreter which is fully CRAN-compatible, but which has been compiled to take advantage of multi-threading, such as Microsoft R Open. I don't think you will see a big difference in your case, since your code is not performing numerical linear algebra operations, but in the worst case you won't see an improvement, and you can easily go back to CRAN R.
I have just updated my R to 3.5 and reinstalled all packages, now the code takes 15 mins to run, which is acceptable. I did try MRO last year, but did not see any substantial performance improvement and I do not like its updating policy.
For pure linear algebra computations, I can use Fortran or Julia, but for data science and statistics stuff, I will stick to GNU R.
One more thing: Once you've vectorized as much as possible, parallelizing any non-vectorizable iteration can provide a nice speedup (depending on the hardware, of course). future:
and furrr (which provides versions of purrr functions run with future):
make parallelizing pretty easy these days, though there are alternatives, if you prefer.
Do you mean the update policy of the packages, or the update policy of the binaries? I.e., you don't like the fact that by default when you install packages, you get the version corresponding to a fixed date (e.g., Jun 1, 2018 for MRO 3.5.0)? Or you don't like the fact that MRO 3.5.0 has only been released this week, while CRAN R 3.5.0 has been around since April?
If you don't like the fact that the binaries lag so much behind CRAN R, well, I also don't like that and there's no fix, so if you prefer to have the latest binaries, scrap MRO: I've stopped using it for this reason, but now that they've released MRO 3.5.0 I was thinking about giving it one more chance (I already use R & Python in my job, no way I'm introducing also a dependency from Julia just for the sake of linear algebra computations).
Instead, if your issue is with the outdated packages, you can either change the default repository permanently or you can use chooseCRANmirror() to modify it on a per-session basis.
Note: I'm not trying to sell MRO - I'm just sharing some info that was useful to me in the past. But if you prefer CRAN R, I can definitely understand - I'm using that right now!
One of the driving ideas behind Renjin is that you should be able to write code in R that is just as fast (or faster) as Fortran/C++. That being said, there's still a lot of work to do on our compiler and the analysis required for efficient compilation, but the following example demonstrates the potential:
library(renjin)
bigsum <- function(n) {
sum <- 0
for(i in seq(from = 1, to = n)) {
sum <- sum + i
}
sum
}
bigsumc <- compiler::cmpfun(bigsum) # GNU R's byte code compiler
system.time(bigsum(1e8))
system.time(bigsumc(1e8))
system.time(renjin(bigsum(1e8)))
How do you envision writing code with Renjin? When I tested it, the GUI wasn't a suitable replacement for RStudio, so I was forced to either use the REPL, or to write code in RStudio, don't run it, and then run the scripts in a JVM using engine.eval(..). Of course, both workflows didn't allow efficient code development/debugging. Opening RStudio, writing scripts which load renjin as a library in GNU R, and testing them from inside RStudio seems a much better idea. Is this how you see writing Renjin code?
Last time I tested Renjin, I found to be incompatible with a few packages I use often (including the current version of dplyr) and I dropped it, but surely things may have changed for the better in meantime. It would be nice to get up to date on the current status of your project. Are you going to present at userR! 2018? Will slides/videos be available afterwards?
I heard a talk on a package called matter that is not on cran but on bio conductor. This is designed to handle larger than memory data. Should be faster than ff or big memory (at least that's what the author claimed). It doesn't store data in ram but a pointer to the file and runs it effectively.
I haven't used it ever but you should try it.
This is the link
Let us know if this reduces any time on your program.
If most of the r programs are written in c or Fortran why do we need a java compiler instead of a gcc or something. Aren't these packages we run are already byte compiled??
Yes and no. Interpreted languages like r, MATLAB have many functions that are written in compiled languages like cpp and fortran, however you still rely on the interpreter when you write your own functions.
Not entirely. R has gradually added JIT compiling, so with R 3.5.0, most closures and top-level loops will be JIT-compiled by default. See ?compiler::compile for details.
I am aware of that, actually jit was added to r way earlier than 3.5.0, but it seems not as efficient as I expect. The 3.5.0 version compiles more packages, which makes it faster. My hope is that there could be a modern r to GNU r just like what nim is to Python, with similar syntax but better performance.