Sparklyr with Zeppelin?

ijlyttle · February 16, 2018, 2:59pm

Hi All,

I am working with a colleague who has access to data on a Spark cluster, but access to that cluster is restricted to using Zeppelin notebooks (https://zeppelin.apache.org/)

In the past, R was one of the Zeppelin back-ends that was made available, but it was removed because of performance issues. This may have been before the advent of sparklyr.

I was wondering if anyone has had success using R + sparklyr in a Zeppelin environment, and if so, could you point me towards any how-to's that you may have come across.

Thanks!

edgararuiz · February 16, 2018, 3:20pm

Hi Ian! I played with Zeppeling briefly some time back, but I didn't test sparklyr with it though. I do think that a simple install.packages("sparklyr") should work to get started. Are you looking for guidance beyond that?

ijlyttle · February 16, 2018, 3:47pm

Hi Edgar!

I am happy that you have been down this path (at least a little bit). I have spoken to the end-users (people who are using Zeppelin), but not to the administrators of the Spark cluster.

From what I had heard, part of the objection was that R was installed and running on all of the nodes of the Spark cluster.

Being a third-party to all of this, I have to ask forgiveness from all concerned (including you) for asking very basic questions.

Would sparklyr work if R/sparklyr is made available only as a part of the Zeppelin container, rather than installing R on all the nodes? I suspect that we would be restricted to doing things that sparklyr can translate to native Spark.

If this might possibly work, I think my next step would be to work with my end-user colleagues.

Thanks again!

edgararuiz · February 16, 2018, 4:26pm

Yes, you're correct. Unless you're using spark_apply(), there is no need to have R installed in all of the nodes.

harryzhu · February 17, 2018, 12:57am

I am using sparklyr very deep, indeed, I like RStudio IDE more than Zeppelin, because I addict to code-autocomplete so far and better terminal integration.

sparklyr can only run like a mysql client. once you configure the spark conf (hdfs-site.xml,hive-site.xml,yarn-site.xml asking IT staff), you can use yarn-client mode to explore spark very easy.

using sparklyr in Zeppelin just like using DBI in Zeppelin, if you are seeking a more light way, I recommend you pursuit IT staff to lauch a livy service for you. Once you are using sparklyr just forget tedious spark-submit command and play dplyr with fun.

However, most of IT staff only know SparkR instead of sparklyr, and fail to get the convenience and importance of livy and sparklyr.

ijlyttle · February 17, 2018, 4:28pm

Thanks @harryzhu!

To persuade IT staff to change what they make available is - as you know - a task that requires a large and unknown amount of effort.

It is useful to have a direction in mind, so I am grateful to you for suggesting a direction.