Sparklyr: spark_apply error: f is not a function

Charlesvandamme · August 22, 2018, 11:45am

Hello,

I've been trying to use spark_apply but I am not sure how to give the parameters.

'analysis' is a spark data frame that is grouped by 'customer', why am I getting the 'f is not a function' error in the code below?

analysis1 <- analysis %>%
  filter(count() > 1 & k < count()) %>%
  spark_apply(analysis1, function(select_data) {
    
    if(select_data$type == 'offer'){
      mutate(analysis1, Won_offer = 1)
    }
    
    }, group_by = c("updated_actual_related_customer", "sku")) %>%
  compute("analysis1")

javierluraschi · August 22, 2018, 3:09pm

spark_apply() signature is as follows:

spark_apply <- function(x,
                        f,
                        columns = NULL,
                        memory = TRUE,
                        group_by = NULL,
                        packages = NULL,
                        context = NULL)

Meaning, you can only pass one data frame, if you need to pass two data frames, you have two options.

If the data is small enough, you can use the context parameter as follows:

analysis1 <- analysis %>%
  filter(count() > 1 & k < count()) %>%
  spark_apply(function(select_data, context) {
    
    if(select_data$type == 'offer'){
      mutate(context$analysis1, Won_offer = 1)
    }
    
    }, group_by = c("updated_actual_related_customer", "sku"), context = list(analysis1 = analysis1)) %>%
  compute("analysis1")

If the analysis1 data is too large, you would have to join first the two data frames by row with dplyr and then operate over the joined data frame inside spark_apply().

Charlesvandamme · August 22, 2018, 3:16pm

There is only one dataframe, the original spark dataframe is called 'country_new1' analysis and analysis1 are just deriviates from country_new1

javierluraschi · August 22, 2018, 3:18pm

However, when you group_by, you would want to make sure all the data in the country_new1 partition contains analysis1. The suggestion in the other post of grouping by country is probably easier to implement since you already know this code works for a single country and would not require as much refactoring, etc.

Charlesvandamme · August 22, 2018, 3:47pm

This code works for 1 country but takes up to An hour to complete running, I want to get it to a few minutes through spark.

I'll try your suggestion though and group it by country. But for me not the easiest spark_apply function has worked.

Could you write an example? Knowing that country_new1 is my spark dataframe and you want to for example multiply the qty_disc column by 10.

javierluraschi · August 22, 2018, 4:26pm

Please refer to http://spark.rstudio.com/guides/distributed-r/ for examples on how to use spark_apply(). We are not staffed to help transform specific R code to work in Spark, but if you have additional generic spark_apply() questions after going through the examples, we should be able to help with those.