sparklyr::spark_apply(x,nrow,group_by) is taking 33 mins in Azure DBX on DS14v2 with 2(min)-8(max) workers to finish on an input data frame with around 100k rows and 50 columns (string & double).
I don't have a reproducible example as my data is proprietary, but my code looks like:
(group by ID results in 34 distinct groups of data)
I understand there is an overhead for serialization/deserialization between driver and worker nodes, but 33 mins to do an 'nrow' on 100k rows x 50 cols seems excessive...
IMHO there are at least 3 perf-related things to be considered when running spark_apply() on non-trivial Spark dataframes:
Is sdf_input cached? If it is not, then some recomputations could slow down the entire process.
Have you considered using arrow to speed up (de)serializations between R and Spark? (Using Apache Arrow)
spark_apply() becomes necessary when you must apply some R function to a Spark dataframe. It is not designed to be performant, because when the R function is being applied, the following things will happen:
Each Spark worker node processing your input will need to launch a R process
Input data to the R function is fetched from Spark to R
And finally the results are serialized from R back to Spark
and all of them add some non-negligible overhead to your computation.
Have you considered non-spark-apply alternatives such as using dplyr or tidyr verbs supported by sparklyr to perform the same transform on your Spark dataframe instead of applying an R function?
Also, there are some spark_apply() perf improvements in recent versions of sparklyr. Have you tried running your computation with sparklyr 1.7?