Case studies of using R as ETL tool for data warehousing?

josiah · July 30, 2020, 7:21pm

@Hlynur I'm really thankful for this discussion! My response here is going to be obviously biased, but I hope it will also be rather informed. I'm like 92.5% percent confident in my response here, but i'd be happy for others to chime in as well.

At this stage in development we can think of RStudio Connect (RSC) as a really nice and fancy CRON scheduler (I'll definitely be adding @mic0331's vote for Airflow integration into RSC as a feature request!).

Your thoughts about using tools like sparklyr, dbi, and httr are 100% inline with what I would recommend. As @mic0331 points out, languages like Python and R and not going to scale in data analysis super effectively when you get to some rather massive data sizes. As such, by using dbplyr / dbi and its family, you will actually not be doing your work in R so to speak but rather in the database that houses the data. This will then actually adhere to @mic0331's recommendations—particularly if you're using sparklyr.

I have seen a lot of success in this approach particularly for institutions similar to yours. By being able to stay inside of R and use other tools such as spark (which is using a scala runtime) you will not have to learn new languages or frameworks, you will be able to use other languages via abstraction (R has a wonderful history as an interface language), and reduce the overall footprint of your RSC server by utilizing your databases and APIs (if using httr).

To @joseluizferreira's statement: you will not be able to load balance the etl via connect because the etl will actually not be occurring on RSC. This does not mean that you cannot configure a load balanced RSC cluster, however (see https://docs.rstudio.com/connect/admin/load-balancing/ for this. You'll need to change your DB from SQLite to Postgres DB). RSC will be sending the code over to the databases to do the heavy lifting. The data will only be returned as an R object if collect()ed. And if you want to be super extra conscientious about memory you do it all with dbplyr or spark_apply() functions.

TL;DR

Big ETL should be done by big data tools (e.g. databases & spark)
You can use big data tools via R with sparklyr, DBI, httr, dbplyr , etc. (see db.rstudio.com)
RSC can be load balanced.
RSC does not have DAG scheduling (though you can get really far with standard scheduling)