I am a longtime R and Tidyverse user, and I recently joined a small data team of 3 at a company where our data team of 3 uses GCP extensively for our data pipelines and data analysis. Our data engineer is piping data from our data sources into BigQuery, and I am creating many views / tables in BigQuery from the data.
I must say BigQuery is great, and handles big data sources of GB-sized tables with ease, but I'd also like to use R on some of the smaller tables saved in BigQuery. There are some fairly clear pros and cons of using each of BigQuery (power to handles big data) and R (power of flexibility when working with smaller data), and I'd like to introduce R into our stack to handle what it's good at. When I proposed this, I received the following feedback:
"deploying r onto our pipelines environment will not be straightforward. we'd have to make a special kubernetes pod just for r and maintain that. also r isn't as performant and library management isn't as easy. it isn't a language built for building production pipelines."
With that said, I am interested in hearing if anybody here has been successful integrating R / BigQuery into a production GCP data pipeline. I don't have enough knowledge of data engineering work and our data pipeline to know better than our engineer on this. He seems convincing on R's weaknesses in this regard (library management, not as performant, lots of setup work), however I am highly proficient in R and am confident that if my team allowed me to introduce R into our stack / pipeline, that it would be helpful.
Any thoughts or related experiences on this would be greatly appreciated, thanks!
It's difficult to say what exactly he means by that, but it's definitely not as straightforward as that. It's possible (and not at all difficult) to use R with bigrquery to provide at least frontend for some of the work you are doing with (I assume) SQL. Moreover, I'm not entirely sure what is the reason to have dedicate pod with R running. Managing dependencies is also not that difficult with renv, for example, or even with Docker...
I guess, there is a lot more I can say about that, but without actually understanding what is your goal with using R in this setup, it's difficult to say whether it's a good fit or not.
Hi, here are some links that may help you with seeing how to interact with a BigQuery database. The biggest thing is to try and run predictions, visualizations, and even modeling inside the database, and only retrieve results if needed. In other words, think of R as a way to orchestrate the pipeline inside BigQuery:
Our data pipeline currently pipes in a handful of fairly large (1GB - 100GB) tables containing raw data that we need. All tables in the data have dates, and we're using BQ to aggregate / group by the data by day, and to then compute metrics from the raw data at the day level (e.g. how many times have certain things happened each day, etc.).
Once we get data at the day level, there are many different analyses that we are then going to do using the day-level data. Further, the day-level data is small, with none of these tables larger than 10 - 100MBs.
The article by Mark Edmonson will only point you towards containers again, which is exactly what your colleagues seem to have a problem with.
If you are using Java to build your data pipelines on GCP, then you can have a look at using Renjin as a Java library. The GCP blog on Medium had an article on Using Renjin with Cloud Dataflow posted in 2016. The benefit of this is that you can write parts of your analysis in R and still make optimal use of GCP's ability to scale automatically (without using containers). Then again, it is not clear from you question if this is what you need.
Disclaimer: I work for the company behind Renjin therefore I am naturally biased