I'm trying to implement simple query on Spark using "gapply", but face troubles.
This code works well.
library(dplyr)
df <- createDataFrame(iris)
createOrReplaceTempView(df, "iris")
display(SparkR::sql("SELECT *, COUNT(*) OVER(PARTITION BY Species) AS RowCount FROM iris"))
But I can't realize it via gapply
display(df %>% SparkR::group_by(df$Species)
%>% gapply(function(key, x) { y <- data.frame(x, SparkR::count()) },
"Sepal_Length double, Sepal_Width double, Petal_Length double, Petal_Width double, Species string, RowCount integer"))
returns error
SparkException: R unexpectedly exited. Caused by: EOFException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 235.0 failed 4 times, most recent failure: Lost task 0.3 in stage 235.0 (TID 374) (10.150.202.5 executor 1): org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘count’ for signature ‘"missing"’ Calls: compute ... computeFunc -> data.frame -> -> Execution halted
Is it possible to implement the window function "count" with gapply using pipes from dplyr ?