Seeking solution for parallel CREATE/REPLACE TABLE with sparklyr in Databricks

Chandler · August 3, 2025, 1:22am

I’m working on Databricks, which has deprecated the SparkR package and recommends using sparklyr. While transitioning, I’ve encountered a major limitation:

sparklyr::sdf_sql() and dplyr::sql() are designed for SELECT queries only — not for DDL commands like CREATE OR REPLACE TABLE.
DBI::dbExecute(sc, ...) can handle DDL, but it is not safe in parallel workflows. When I launch three parallel tasks (e.g., cv11, cv12, and cv13), each executing a CREATE OR REPLACE TABLE command, Databricks throws path overlap errors — classic race conditions.
sparkR::sql() works perfectly here — it’s thread-safe, and handles DDL well in parallel. But since SparkR is deprecated, I want to find a better long-term alternative.

I've tried:

sparklyr::invoke() to access Spark’s JVM backend. It works for DDL, but comes with side effects — for example, built-in functions like REGEXP_SUBSTR() fail due to an incorrect session context, breaking compatibility with other Spark SQL features.

Other options:

Writing %sql cells in Databricks Notebooks is safe and handles DDL — but this doesn’t integrate well into R scripts or R-based pipelines.

My question:
Is there any reliable method or function (in sparklyr, dplyr, or elsewhere in the R ecosystem) that supports running DDL like CREATE OR REPLACE TABLE safely in parallel, within an R context on Databricks?

This gap is frustrating because parallelism is core to big data, and the lack of parallel-safe DDL support in sparklyr limits its usefulness in real production pipelines. At present, I still rely on SparkR::sql() because I have over 150k lines of R code and parallel execution is essential.

It would be great if sparklyr could introduce a function like spark_sql() or support parallel-safe DDL operations directly.

Thanks in advance for any insight, workarounds, or best practices!