Sparklyr Pipelines & Join

Hi,

I'm starting to work with the ML pipelines in spark, and i'm not sure how complicated they can be / what is possible to do with them.

let's say I have a dataset I modified with ft_dplyr_transformer, and I want to 'reuse' that dataset to compute summary statistics and join it to the first dataset.

As an example, outside of a pipeline, I would do something like:

data_1 <- bob %>%
  mutate(
    ...
  )

data_2 <- data_1 %>%
  group_by(...) %>%
  summarise(...)

final_data <- data_1 %>% data_2

Is this possible to do this inside a spark ML pipeline?

Thank you for your help

Cheers,
Richard

1 Like

Yeah you can do something like this:

library(dplyr)
iris_tbl <- sdf_copy_to(sc, iris)
my_pipeline <- ml_pipeline(sc) %>%
  ft_dplyr_transformer(
  iris_tbl %>%
    group_by(Species) %>%
    summarize(avg_petal_length = mean(Petal_Length))
) %>%
  ml_fit(iris_tbl)

new_data <- data.frame(
  Species = c("setosa", "setosa", "versicolor", "versicolor"),
  Petal_Length = c(1, 2, 3, 4)
)
new_data_tbl <- sdf_copy_to(sc, new_data)
my_pipeline %>%
  ml_transform(new_data_tbl)

This will work as long as the stuff in ft_dplyr_transformer() are single-table verbs.

1 Like

Hi,

Reading it again, I'm sorry, my question wasn't formulated properly.

What I want to do is an inner join with a previously computed pipeline, to avoid computing it twice.

for example, that's my pipeline, a bit more complicated:

dplyr_pipeline <- data_1 %>%
  inner_join(
    data_1 %>% group_by(bob) %>% summarise(mean(whatever))
  ) %>%
inner_join(data_2) # new data set

ml_pipeline <- ml_pipeline(sc) %>%
  ft_dplyr_transformer(dplyr_pipeline) %>%
   ml_random_forest_regressor()

Does this make sense? (I hope :sweat_smile:)

Thank you
Richard

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.