Hi, folks.
I have seen many solutions to the problem of exploding a comma-separated
string column (e.g., this nice SO answer).
However, how can I go from this data frame
df <- data.frame(x = 1, y = "A,B,C,D")
to this one:
tb <- data.frame(x = c(1,1,1), y = c("A,B", "B,C", "C,D"))
?
The goal is to split the string column by pairs of consecutive letters. It is easy to solve this problem with non-vectorized R code. The main problem for me is that the table is too big to fit in memory, so a solution with sparklyr would be great.
Thanks, in advance!
What do you mean in this context with unvectorized R code?
I would usually approach this via an index shift:
library(tidyverse)
solo <- stringr::str_split("A,B,C,D", pattern = ",") %>% pluck(1)
paste0(solo[-length(solo)], ",", solo[-1])
#> [1] "A,B" "B,C" "C,D"
Created on 2018-05-08 by the reprex package (v0.2.0).
1 Like
This is nice, thanks. However, I am having trouble when trying such a solution with sparklyr, e.g. using the spark_apply
function. I am thinking that it would be enough to map each "A,B,C,D"
entry into list("A,B", "B,C", "C,D")
, which can be exploded with sparklyr.nested::sdf_explode
.
For instance, consider this function (using @Tazinho's idea):
get_pairs <- function(sequence_of_letters) {
x <- strsplit(sequence_of_letters, ",")[[1]]
paste0(x[-length(x)], ",", x[-1])
}
Then, tb <- mutate(df, z = sapply(y, get_pairs))
works just fine if df
is a local data frame.
The column z
is a list, which can be "exploded" with tidyr::unnest(tb, z)
.
However, if df
is a tbl_spark
, this code (analogous to this one)
tb <- df %>%
spark_apply(function(d) {
library(dplyr)
get_pairs <- function(sequence_of_letters) {
x <- strsplit(sequence_of_letters, ",")[[1]]
paste0(x[-length(x)], ",", x[-1])
}
mutate(d, z = sapply(y, get_pairs))
})
raises warnings and errors. The natural step after that would be sparklyr.nested::sdf_explode(tb, z)
.
The error messages start like:
Warning messages:
1: In if (is.na(object)) { :
the condition has length > 1 and only the first element will be used
2: In if (is.na(object)) { :
the condition has length > 1 and only the first element will be used
ERROR sparklyr: Worker (X) failed to complete R process
The following does the trick. df_tbl
is the tbl_spark
version of df
.
library(sparklyr)
library(sparklyr.nested)
library(dplyr)
sc <- spark_connect(master = "local")
df <- data.frame(x = 1, y = "A,B,C,D")
df_tbl <- copy_to(sc, df)
df_split <- df_tbl %>%
mutate(y = split(y, ",")) %>%
sdf_explode(y)
tb1 <- df_split %>%
arrange(x) %>%
mutate(rn = row_number())
tb2 <- df_split %>%
arrange(x) %>%
mutate(rn = row_number() - 1L)
tb <- sdf_bind_rows(tb1, tb2) %>%
arrange(x, rn) %>%
group_by(x, rn) %>%
summarise(seq = collect_list(y)) %>%
filter(size(seq) == 2) %>%
sdf_explode(seq) %>%
group_by(x, rn) %>%
summarize(pairs = paste(collect_list(seq), sep = ",")) %>%
ungroup()
The functions not available in the sparklyr package (split()
, collect_list()
and size()
) are spark functions.
Big thanks to @samuelmacedo83 who showed me this solution!
3 Likes