the_dub
December 11, 2018, 3:53pm
1
Hi,
I'm starting to work with the ML pipelines in spark, and i'm not sure how complicated they can be / what is possible to do with them.
let's say I have a dataset I modified with ft_dplyr_transformer
, and I want to 'reuse' that dataset to compute summary statistics and join it to the first dataset.
As an example, outside of a pipeline, I would do something like:
data_1 <- bob %>%
mutate(
...
)
data_2 <- data_1 %>%
group_by(...) %>%
summarise(...)
final_data <- data_1 %>% data_2
Is this possible to do this inside a spark ML pipeline?
Thank you for your help
Cheers,
Richard
1 Like
Yeah you can do something like this:
library(dplyr)
iris_tbl <- sdf_copy_to(sc, iris)
my_pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(
iris_tbl %>%
group_by(Species) %>%
summarize(avg_petal_length = mean(Petal_Length))
) %>%
ml_fit(iris_tbl)
new_data <- data.frame(
Species = c("setosa", "setosa", "versicolor", "versicolor"),
Petal_Length = c(1, 2, 3, 4)
)
new_data_tbl <- sdf_copy_to(sc, new_data)
my_pipeline %>%
ml_transform(new_data_tbl)
This will work as long as the stuff in ft_dplyr_transformer()
are single-table verbs.
1 Like
the_dub
December 20, 2018, 11:02am
3
Hi,
Reading it again, I'm sorry, my question wasn't formulated properly.
What I want to do is an inner join with a previously computed pipeline, to avoid computing it twice.
for example, that's my pipeline, a bit more complicated:
dplyr_pipeline <- data_1 %>%
inner_join(
data_1 %>% group_by(bob) %>% summarise(mean(whatever))
) %>%
inner_join(data_2) # new data set
ml_pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(dplyr_pipeline) %>%
ml_random_forest_regressor()
Does this make sense? (I hope )
Thank you
Richard
system
Closed
January 10, 2019, 11:05am
4
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.