Hi, folks.

I have seen many solutions to the problem of exploding a comma-separated

string column (e.g., this nice SO answer).

However, how can I go from this data frame

`df <- data.frame(x = 1, y = "A,B,C,D")`

to this one:

`tb <- data.frame(x = c(1,1,1), y = c("A,B", "B,C", "C,D"))`

?

The goal is to split the string column by pairs of consecutive letters. It is easy to solve this problem with non-vectorized R code. The main problem for me is that the table is too big to fit in memory, so a solution with sparklyr would be great.

Thanks, in advance!

What do you mean in this context with unvectorized R code?

I would usually approach this via an index shift:

```
library(tidyverse)
solo <- stringr::str_split("A,B,C,D", pattern = ",") %>% pluck(1)
paste0(solo[-length(solo)], ",", solo[-1])
#> [1] "A,B" "B,C" "C,D"
```

Created on 2018-05-08 by the reprex package (v0.2.0).

1 Like

This is nice, thanks. However, I am having trouble when trying such a solution with sparklyr, e.g. using the `spark_apply`

function. I am thinking that it would be enough to map each `"A,B,C,D"`

entry into `list("A,B", "B,C", "C,D")`

, which can be exploded with `sparklyr.nested::sdf_explode`

.

For instance, consider this function (using @Tazinho's idea):

```
get_pairs <- function(sequence_of_letters) {
x <- strsplit(sequence_of_letters, ",")[[1]]
paste0(x[-length(x)], ",", x[-1])
}
```

Then, `tb <- mutate(df, z = sapply(y, get_pairs))`

works just fine if `df`

is a local data frame.

The column `z`

is a list, which can be "exploded" with `tidyr::unnest(tb, z)`

.

However, if `df`

is a `tbl_spark`

, this code (analogous to this one)

```
tb <- df %>%
spark_apply(function(d) {
library(dplyr)
get_pairs <- function(sequence_of_letters) {
x <- strsplit(sequence_of_letters, ",")[[1]]
paste0(x[-length(x)], ",", x[-1])
}
mutate(d, z = sapply(y, get_pairs))
})
```

raises warnings and errors. The natural step after that would be `sparklyr.nested::sdf_explode(tb, z)`

.

The error messages start like:

```
Warning messages:
1: In if (is.na(object)) { :
the condition has length > 1 and only the first element will be used
2: In if (is.na(object)) { :
the condition has length > 1 and only the first element will be used
ERROR sparklyr: Worker (X) failed to complete R process
```

The following does the trick. `df_tbl`

is the `tbl_spark`

version of `df`

.

```
library(sparklyr)
library(sparklyr.nested)
library(dplyr)
sc <- spark_connect(master = "local")
df <- data.frame(x = 1, y = "A,B,C,D")
df_tbl <- copy_to(sc, df)
df_split <- df_tbl %>%
mutate(y = split(y, ",")) %>%
sdf_explode(y)
tb1 <- df_split %>%
arrange(x) %>%
mutate(rn = row_number())
tb2 <- df_split %>%
arrange(x) %>%
mutate(rn = row_number() - 1L)
tb <- sdf_bind_rows(tb1, tb2) %>%
arrange(x, rn) %>%
group_by(x, rn) %>%
summarise(seq = collect_list(y)) %>%
filter(size(seq) == 2) %>%
sdf_explode(seq) %>%
group_by(x, rn) %>%
summarize(pairs = paste(collect_list(seq), sep = ",")) %>%
ungroup()
```

The functions not available in the sparklyr package (`split()`

, `collect_list()`

and `size()`

) are spark functions.

Big thanks to @samuelmacedo83 who showed me this solution!

3 Likes