I am just exploring the spark functionalities within the tidyverse.
One question that I habe is that there is a new technology called "Spark Connect", which let's you run your connection to spark in a client/server mode.
Looking at the documentation I see that phyton is supported, but not R.
Hi, yes, a new sparklyr extension called pysparklyr does exactly that. It wraps the Python components in order to connect and interact. You will need a working version of Python working in your machine. But, if you are connecting to an external cluster, you won't need Java (JVM)!!
If you are testing Spark Connect inside your laptop, then pysparklyr has a function to start a new Spark session locally. You will need JVM for that though. Here's sample code for that:
install.packages(pysparklyr)
pysparklyr::install_pyspark("3.5") #Install Spark 3.5 locally so you can run Spark Connect
sparklyr::spark_install("3.5") # Creates a Python environment with the necessary components
library(sparklyr)
pysparklyr::spark_connect_service_start("3.5")
sc <- spark_connect("sc://localhost/" "spark_connect")
# Interact with Spark Connect
spark_disconnect(sc)
pysparklyr::spark_connect_service_stop()