I’m working on Databricks, which has deprecated the SparkR
package and recommends using sparklyr
. While transitioning, I’ve encountered a major limitation:
sparklyr::sdf_sql()
anddplyr::sql()
are designed forSELECT
queries only — not for DDL commands likeCREATE OR REPLACE TABLE
.DBI::dbExecute(sc, ...)
can handle DDL, but it is not safe in parallel workflows. When I launch three parallel tasks (e.g.,cv11
,cv12
, andcv13
), each executing aCREATE OR REPLACE TABLE
command, Databricks throws path overlap errors — classic race conditions.sparkR::sql()
works perfectly here — it’s thread-safe, and handles DDL well in parallel. But since SparkR is deprecated, I want to find a better long-term alternative.
I've tried:
sparklyr::invoke()
to access Spark’s JVM backend. It works for DDL, but comes with side effects — for example, built-in functions likeREGEXP_SUBSTR()
fail due to an incorrect session context, breaking compatibility with other Spark SQL features.
Other options:
- Writing
%sql
cells in Databricks Notebooks is safe and handles DDL — but this doesn’t integrate well into R scripts or R-based pipelines.
My question:
Is there any reliable method or function (in sparklyr, dplyr, or elsewhere in the R ecosystem) that supports running DDL like
CREATE OR REPLACE TABLE
safely in parallel, within an R context on Databricks?
This gap is frustrating because parallelism is core to big data, and the lack of parallel-safe DDL support in sparklyr limits its usefulness in real production pipelines. At present, I still rely on SparkR::sql()
because I have over 150k lines of R code and parallel execution is essential.
It would be great if sparklyr
could introduce a function like spark_sql()
or support parallel-safe DDL operations directly.
Thanks in advance for any insight, workarounds, or best practices!