It is often the case that in the development of a data science product the preliminary analysis and prototyping is done in R (thanks to its superior tools for visualization/exploration/fast modelling...) but when it's time to deploy the models in a production environment one switches to Python. One of the main reasons for this is that Python often integrates better with the rest of the developer stack, as it's a general purpose language widely used for web development and so forth. That's fair enough and R cannot compete with Python on this ground, even though this specific advantage of Python in production should only hold for companies where python is used (e.g., in my company everything is done in .NET and C#, and no developer actually uses python in production).
My question is: are there any other limitations to R (apart from the above) that present a problem for its use at large scale in a production environment? And if so, could these limitations be addressed by developing new R packages with utilities that would ease the use of R in a production environment? As an example, I think @hadley strict package (https://github.com/hadley/strict) would be quite helpful in a production environment. Opinions?
To be honest, I hear this sort of thing a lot, but i've never actually seen it. At least not in the corporate world. Actually, I did speak to someone where this had happened, but it was due mainly to a lack of R resource rather than any perceived deficiency in the language.
In general, and with the company's I've worked with, if their production stack is python, they'll do their development in Python. I have worked with a few companies who've converted R prototypes to Java, but even this practice is rapidly shrinking.
Increasingly I see people go straight to production with R though. The tooling and ecosystem is so good now that it's much easier than it used to be, even 5 years ago. R is also becoming increasingly general, in its potential uses. I'm not a data scientist, or a statistician and I use it all the time!
I think the next big thing for R in production, and hopefully the final nail in the coffin for this feeling that R isn't for production, will be Joe Cheng's async package. Good async tooling will finally quiet those occasional but-R-is-single-threaded voices that we sometimes hear
Don't get me wrong, I also agree that R can be definitely used in production. Indeed, I plan to use R in production for data products myself where I work, since the development stack is C# and not python anyway, and we're expanding our R skills by hiring more people. So R in production will be a fact in my company soon.
What I was wondering is just what would make R even better for a production environment. I mentioned @hadley strict package as an example, which I have found useful in that context. Another package that I find amazing for production environments is data.table. While I love the tidyverse for exploration, analyses and reporting, I always switch to data.table in production as it's way faster, and in my experience every bit of performance one can get in production matters.
Also excited about @jcheng async package! Having that will be super amazing indeed.
Morning, can you share with me what the is the actual product that you see delivered using Python? Is it a live API, is it a batched process that records the new scoring back to a back end DB?
I can also see R getting a bad reputation in production from bad experiences. My company doesn't use R systemically, but through word-of-mouth, I'm getting pulled into a group deploying a script that implements a model that an external consulting group developed for us.
While it's not the worst script ever, it has some distinct issues would be likely to cause periodic problems if deployed as-is. This includes:
Needlessly reinstalling packages (and thus pointlessly requiring an internet connection)
No ties to specific versions of packages (so it could break due to future updates)
Rebuilding the model each time the script is run (wasting time and potentially making results less reproducible)
It definitely falls in the "you paid someone for this?" camp. The thing is, with very limited knowledge of R, all of the potential problems listed could become associated with the language in the minds of users and our IT group.
Hi @nick, I know I'm missing something here, but wouldn't for cases where an actual fitted model is "productionized", all you have to run are the predicts on top of the new data, so there's no need to re-fit the model every time. In that instance, you really only need a lightweight Rds file that contains the model, it should also reduce your package dependencies significantly, right? Just trying to understand that point better.
Yeah, that's exactly my point. I still need to get a copy of the training data so I can actually run the script, which shouldn't be necessary for end-use.
Just to make it 100% clear, I won't be letting the script be deployed as-is, but I need more information on the project before determining what is going to be the most straightforward method to fix it and make it end-user friendly. So far, all I have is the somewhat scary script collection.
Thanks @nick , that makes sense. And to add to your points, I guess that's where viewing the effort as a project (maybe executed inside an RStudio Project) is better than to just focus on a single script. This way you can separate the script or RNotebook you used for EDA and modeling, from the RMarkdown you used to present your results, from the Shiny app or Plumber API you use to implement the model.
Agreed. The scripts I received do some work to separate out the different stages, so I'm hoping that I have the opportunity to convert it into a reasonably self-documenting project.
@edgararuiz It can be both a script writing predictions to backend database or an API; I've met people who said they switched to python in production in both these contexts, but then again the rest of the development stack they had was also python, so that makes sense
I have often found, however, that people who write R code are more accustomed to writing quick-and-dirty analysis scripts, R markdown analyses or maybe shiny apps than writing code for a robust system to be deployed in production (@nick anectode fits this). I think this probably comes from the nature of the community, as R is more prevalent in its use among analysts & statisticians and, as Hadley mentions somewhere in its advanced R, the best practices of software engineering are "often patchy" (quoting by memory, but it's something along these lines). So it might be that the alleged unfitness of R in production (vs python, let's say) has little to do with the language itself but rather with the expertise of the people using R - I think this is actually what underlies @sellorm discussion in the slides he linked to above, although this might be changing, I don't know. I did wonder though if there are additional features that could be added to the language to boost its attractiveness for production environments.
For my part, I haven't used R in production yet and certainly don't claim to be a fabulous software engineer, but I plan to set up a production system using R and these are the steps I'll follow:
exploration and preliminary modelling using a static dataset in R notebooks. At this phase different possible models are explored and a selection of them is determined for final consideration.
setting up an automated system (on a development server) for batch retraining of the final selection of models, tracking their error over time (i.e., run a CRON job every period to retrain and re-evaluate the models on new data and write their error measures to file, so I can inspect their time-series). I am particularly concerned about this intermediate step because I want to make sure that my model is always accurate under changing conditions.
The best model selected at point (2) will indeed be deployed as a lightweight .rds to score data and write to a backend DB. In the future I want to experiment with exposing the models as API's too.
For both steps (2) and (3) I'm making heavy use of R packages. It seems to me that being able to write R packages should be a crucial skill for an R programmer, at the same level of importance (or, indeed, even more important) than using R markdown and such; as it makes code much more robust for future use.
I am still trying to figure out how to proceed in regards to the cran R packages that my production system will depend on, i.e. what is the best practice to minimize the chance that an updated package will break the system. I was thinking of Docker containers but perhaps there's a simpler way to go? Any ideas?
As @edgararuiz metnioned, packrat is meant for controlling package dependencies on a per-project basis.
You're right that packaging your own code is the right way to go. Packages are the intended way for sharing code and data in a machine-agnostic way. Creating packages was often confusing and frustrating (especially on Windows), but RStudio now makes it so simple. I suggest using the miniCRAN package for creating your own package repository to store your internal packages (https://cran.r-project.org/package=miniCRAN).
packrat is nice but if you want a particular version of package (or not to update some packages) you might encounter difficulties ...
In general, packrat and versioning is hard and not maintainable in the long term.
For 2nd and 3rd part I think you will find useful Rscript and littleR. Especially littleR is great. I found it only recently, but if you use scripts from example directory you can save a lot of time/coding and your R scripts will be on the same level as bash, or any other language.
There is a bit of issue with packages that gel with external applications. e.g. I was using the RSelenium package and then there was an update from the Selenium folks that broke my code. Such things can happen with any other language if the underlying technology gets upgraded. On the other hand I use the Rwordpress package to update my blog which I feel works absolutely fine and seamless since the last year or so.
I believe its the way one picks and chooses the packages makes R more production ready. As such there are many ways to do stuff in R the wrong way and its very easy to fall for them.
I second MRAN snapshots. One thing I've noticed with them is that if you include installing from the snapshot in your Docker container it won't cache it and will re-install everything every single time image is built. Not the end of the world, but with lots and lots of C++ in dplyr, for example, compilation takes time.
In fact, I wanted to add that in my company dependency management has been the biggest sticking point. Every time we need to make sure that dependencies won't break and every time we must guarantee that package versions are fixed (this is done in Python trivially with requirements.txt file, so there is an expectation that it should be as easy with R too).
Other than that, most of the complaints from DevOps tend to be: "Well, it's not Python and I know Python, so use it instead". Those are not valid complaints, of course, but there is this prevailing attitude for sure.
Another point that has been often a stickler is the ability to easily switch environments when deploying to staging/production. We are doing it right now with different yaml files for staging/production, but there is a package that I can't remember right now that allows to do that in one yaml file and all you need to do is set one environment variable and correct parameters will be loaded (UPDATE I've found it - it is called config - who would've thought ).
In general, from my experience, all the biggest complaints about R in production tend to be lack of education since most of them can be solved, so it is up to us, R developers, to make sure that we explain in detail why certain fears are at least exaggerated and can be solved without too much pain.
@spiritus87 - really good reasoning and structuring of productionizing the project. What actually would be great is to have guidelines, best practices and tangible code examples for productionizing R scripts (models and others). Does anyone over here have good, practical references?
packrat is nice but if you want a particular version of package (or not to update some packages) you might encounter difficulties …
In general, packrat and versioning is hard and not maintainable in the long term.
So I definitely agree that packrat has its difficulties, but I wanted to be clear (cf. @xhudik) that it can handle a particular version of a package, not updating packages, etc. so long as your computer can install packages from source. "difficulties" is the operative word in his comment - IMO they are not insurmountable difficulties, though. As mentioned, dependencies on third party tools outside of the R universe (cf. Selenium above, @s_maroo) would need to be handled separately.
Packrat also has an advantage over MRAN in that it can include versions of locally developed packages (not on CRAN) and git repos. In the past, I have used the drat package to build local CRAN-like repositories of locally developed packages. Also, after getting used to some of the nuances of packrat, I really like that it declares explicit version numbers (like requirements.txt) and does make my code stable / reproducible using CRAN's archived package sources.
@konradino A budding discussion on guidelines for using R in production is here.