I’m working on Databricks, which has deprecated the SparkR package and recommends using sparklyr. While transitioning, I’ve encountered a major limitation:
sparklyr::sdf_sql()anddplyr::sql()are designed forSELECTqueries only — not for DDL commands likeCREATE OR REPLACE TABLE.DBI::dbExecute(sc, ...)can handle DDL, but it is not safe in parallel workflows. When I launch three parallel tasks (e.g.,cv11,cv12, andcv13), each executing aCREATE OR REPLACE TABLEcommand, Databricks throws path overlap errors — classic race conditions.sparkR::sql()works perfectly here — it’s thread-safe, and handles DDL well in parallel. But since SparkR is deprecated, I want to find a better long-term alternative.
I've tried:
sparklyr::invoke()to access Spark’s JVM backend. It works for DDL, but comes with side effects — for example, built-in functions likeREGEXP_SUBSTR()fail due to an incorrect session context, breaking compatibility with other Spark SQL features.
Other options:
- Writing
%sqlcells in Databricks Notebooks is safe and handles DDL — but this doesn’t integrate well into R scripts or R-based pipelines.
My question:
Is there any reliable method or function (in sparklyr, dplyr, or elsewhere in the R ecosystem) that supports running DDL like CREATE OR REPLACE TABLE safely in parallel, within an R context on Databricks?
This gap is frustrating because parallelism is core to big data, and the lack of parallel-safe DDL support in sparklyr limits its usefulness in real production pipelines. At present, I still rely on SparkR::sql() because I have over 150k lines of R code and parallel execution is essential.
It would be great if sparklyr could introduce a function like spark_sql() or support parallel-safe DDL operations directly.
Thanks in advance for any insight, workarounds, or best practices!