I would like to know if there are any news about the implementation of some SparkR
features (such as parallel execution of Spark nodes) in sparklyr
. I have used neither of them, but I was willing to test sparklyr
because I like the philosophy behind it more. However, I've been told that SparkR
is more mature for what it concerns features. But maybe this has changed (I see sparklyr
being actively developed). Could you please enlighten me? Even knowing which is the favoured package among users here, could be enough to motivate me in picking up one or the other.
There are many features available in both, SparkR
and sparklyr
, as in:
- Support for MLlib models.
- Support for parallel execution.
- Support for structured streaming.
- Support to create, transform and collect Spark data frames.
- Support for executing arbitrary Scala code.
- Support for Yarn Client, Mesos, Spark Standalone, local, Kubernetes, etc.
Other features are, as of this writing, only available in sparklyr
:
- Support for major R packages:
dplyr
,DBI
,broom
, etc. - Support for ML Pipelines.
- Support for Graph processing.
- Support for Apache Livy and Yarn Cluster mode.
- Support exporting models to Java using MLeap.
- Support for connections and jobs in RStudio.
- Support for custom Scala extensions.
- Extensions that enable support for H2O, nested data, SAS data, etc.
- Installable from CRAN and certified with Cloudera.
I'm not aware of any features available in SparkR
but missing in sparklyr
. However, if there are improvements you would like considered, feel free to open feature requests under https://github.com/rstudio/sparklyr/issues. In addition, you can find detailed documentation about sparklyr
features and extensions under http://spark.rstudio.com.
Hi Javier,
thanks for your answer. I cannot personally testify to the existence of features available in SparkR
, which are not available in sparklyr
, as I used neither of them. I was only told thatSparkR
was "more mature in terms of features", and since Kevin Ushey mentioned in this post that SparkR
had better parallel execution of R code across Spark nodes than sparklyr
, I thought that was true, at some point in time and for some features. From what you say, this is clearly not true anymore, thus I'll start using sparklyr
. Thanks!
Thanks Andrea, that's an old comment in the post. I added a follow up comment to that post, in short, in early 2017 sparklyr
was lacking support for custom parallel execution, which got added in late 2017. Since then, we've kept improving diagnostics and performance for parallel execution, more recently by adding support for arrow
which makes custom parallel execution in sparklyr
orders of magnitude faster, see details under /rstudio/sparklyr/pull/1611.
Sparklyr is more popular than SparkR, because you only need to learn dplyr grammar instead of
scala grammar.
However, when it comes to spark_apply
function, the better option is not Sparklyr or SparkR, it should be spark-shell or pyspark which is more robust for your program and better error tracing.
I was trying to apply xgboost model through spark_apply
, the user experience of pyspark version is more smooth than Sparklyr or SparkR one.
reference:
Interesting! One more reason to use sparklyr
. Do you know of any R Markdown report/GitHub repository containing a Data Science analysis performed with sparklyr
? Preferably on open data (otherwise I won't be able to reproduce it).
However, when it comes to
spark_apply
function, the better option is not Sparklyr or SparkR, it should be spark-shell or pyspark which is more robust for your program and better error tracing.
I'd like to show my Python-purist coworkers that R can be successfully deployed in production for actual Big Data (TB-scale) projects, so it's probably better not to use a Python module for that
Thanks! But I can't read Chinese personally, I'm already convinced that sparklyr
> sparkr
. However, if you translate your blog post to English, I'll be happy to tweet about it. It sounds like an interesting resource.
Does sparklyr enable me to run native r packages like modelr across all the nodes of a Spark cluster? Or do I invoke MLlib instead?
I recommend following Shirin Glander. Here is one of her blog posts for an example with using Sparklyr and H2O in spark
@Ed_Purcell Definitely right. sparklyr::spark_apply
can help you run any R code across all the nodes of Spark cluster. However, when it comes to big data, I recommend you use sparklyr ml pipeline to manipulate data, which have optimized lots of feature engineering job and cover 95% common jobs.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.