tidymodels::tune_grid() fails silently on Azure Databricks after registerDoSpark(sc)

I want to tune a random forest model (with ranger) using the tidymodels framework. Because it takes too long locally, I'm trying to make it work on Azure Databricks.

  1. I first called the tune_grid() function in the R notebook with a tiny grid and it worked (I verified the output) but without parallelization.

  2. Next, I loaded the sparklyr package and called the following code, before trying the same tune_grid() call again:

sc <- spark_connect(method = 'databricks')


According to this blog post, this method should work.

However, while the Spark Jobs pop up in the notebook, they finish quickly with lots of 'skipped stages'. There is no error, however.

When I run the resulting object, I get:

# Tuning results

# Validation Set Split (0.75/0.25)  

# A tibble: 1 x 4

  splits                id         .metrics .notes          

  <list>                <chr>      <list>   <list>          

1 <split [66222/22074]> validation <NULL>   <tibble [0 × 1]>

Does anyone have any idea why this happens or how to diagnose the problem? Is there a better way of doing tuning with tidymodels on Azure Databricks (I'm a complete novice when it comes to cluster computing). I have seen several other options for doing machine learning on Azure Databricks, but if possible I'd prefer to stick with tidymodels as I like the framework and to keep using essentially the same code as on my laptop.


1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.