Poor performance with sparklyr::spark_apply() in Azure databricks

jeade · December 1, 2021, 5:17pm

sparklyr::spark_apply(x,nrow,group_by) is taking 33 mins in Azure DBX on DS14v2 with 2(min)-8(max) workers to finish on an input data frame with around 100k rows and 50 columns (string & double).

I don't have a reproducible example as my data is proprietary, but my code looks like:

dfResult <- sdf_input %>%
sparklyr::spark_apply(nrow,group_by='ID')

(group by ID results in 34 distinct groups of data)

I understand there is an overhead for serialization/deserialization between driver and worker nodes, but 33 mins to do an 'nrow' on 100k rows x 50 cols seems excessive...

yitaoli · December 6, 2021, 1:04pm

@jeade

IMHO there are at least 3 perf-related things to be considered when running spark_apply() on non-trivial Spark dataframes:

Is sdf_input cached? If it is not, then some recomputations could slow down the entire process.
Have you considered using arrow to speed up (de)serializations between R and Spark? (Using Apache Arrow)
spark_apply() becomes necessary when you must apply some R function to a Spark dataframe. It is not designed to be performant, because when the R function is being applied, the following things will happen:
- Each Spark worker node processing your input will need to launch a R process
- Input data to the R function is fetched from Spark to R
- The R function processes the input (which could also be slow -- see Performance · Advanced R.)
- And finally the results are serialized from R back to Spark

and all of them add some non-negligible overhead to your computation.

Have you considered non-spark-apply alternatives such as using dplyr or tidyr verbs supported by sparklyr to perform the same transform on your Spark dataframe instead of applying an R function?

Also, there are some spark_apply() perf improvements in recent versions of sparklyr. Have you tried running your computation with sparklyr 1.7?

system · December 27, 2021, 1:05pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.