Seeking solution for parallel CREATE/REPLACE TABLE with sparklyr in Databricks

I’m working on Databricks, which has deprecated the SparkR package and recommends using sparklyr. While transitioning, I’ve encountered a major limitation:

  • sparklyr::sdf_sql() and dplyr::sql() are designed for SELECT queries only — not for DDL commands like CREATE OR REPLACE TABLE.
  • DBI::dbExecute(sc, ...) can handle DDL, but it is not safe in parallel workflows. When I launch three parallel tasks (e.g., cv11, cv12, and cv13), each executing a CREATE OR REPLACE TABLE command, Databricks throws path overlap errors — classic race conditions.
  • sparkR::sql() works perfectly here — it’s thread-safe, and handles DDL well in parallel. But since SparkR is deprecated, I want to find a better long-term alternative.

I've tried:

  • sparklyr::invoke() to access Spark’s JVM backend. It works for DDL, but comes with side effects — for example, built-in functions like REGEXP_SUBSTR() fail due to an incorrect session context, breaking compatibility with other Spark SQL features.

Other options:

  • Writing %sql cells in Databricks Notebooks is safe and handles DDL — but this doesn’t integrate well into R scripts or R-based pipelines.

My question:
:backhand_index_pointing_right: Is there any reliable method or function (in sparklyr, dplyr, or elsewhere in the R ecosystem) that supports running DDL like CREATE OR REPLACE TABLE safely in parallel, within an R context on Databricks?

This gap is frustrating because parallelism is core to big data, and the lack of parallel-safe DDL support in sparklyr limits its usefulness in real production pipelines. At present, I still rely on SparkR::sql() because I have over 150k lines of R code and parallel execution is essential.

It would be great if sparklyr could introduce a function like spark_sql() or support parallel-safe DDL operations directly.

Thanks in advance for any insight, workarounds, or best practices!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.