Poor performance with sparklyr::spark_apply() in Azure databricks

sparklyr::spark_apply(x,nrow,group_by) is taking 33 mins in Azure DBX on DS14v2 with 2(min)-8(max) workers to finish on an input data frame with around 100k rows and 50 columns (string & double).

I don't have a reproducible example as my data is proprietary, but my code looks like:

dfResult <- sdf_input %>%
sparklyr::spark_apply(nrow,group_by='ID')

(group by ID results in 34 distinct groups of data)

I understand there is an overhead for serialization/deserialization between driver and worker nodes, but 33 mins to do an 'nrow' on 100k rows x 50 cols seems excessive...

@jeade

IMHO there are at least 3 perf-related things to be considered when running spark_apply() on non-trivial Spark dataframes:

  • Is sdf_input cached? If it is not, then some recomputations could slow down the entire process.
  • Have you considered using arrow to speed up (de)serializations between R and Spark? (Using Apache Arrow)
  • spark_apply() becomes necessary when you must apply some R function to a Spark dataframe. It is not designed to be performant, because when the R function is being applied, the following things will happen:
    • Each Spark worker node processing your input will need to launch a R process
    • Input data to the R function is fetched from Spark to R
    • The R function processes the input (which could also be slow -- see Performance ยท Advanced R.)
    • And finally the results are serialized from R back to Spark

and all of them add some non-negligible overhead to your computation.

Have you considered non-spark-apply alternatives such as using dplyr or tidyr verbs supported by sparklyr to perform the same transform on your Spark dataframe instead of applying an R function?

Also, there are some spark_apply() perf improvements in recent versions of sparklyr. Have you tried running your computation with sparklyr 1.7?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.