Filter an arrow table based on a list column

How can I filter an arrow table based on whether a list column contains certain values?

For example, I'd like to filter the dd arrow table to keep all rows where y contains either 2 or 4.

library(tidyverse)
library(arrow)

# data
dd <- tibble(
  x = 1:3,
  y = c(list(1:3), list(2:5), list(c(1L, 5L)))
) %>%
  arrow_table()

I'd like this filtering to occur before collect() is run. The desired output from above example is the first two rows (since y contains 2 or 4 for those two rows).

Hi @dchilders. I believe the following gets to your desired output.

library(tidyverse)
library(arrow)

# data
dd <- tibble(
  x = 1:3,
  y = c(list(1:3), list(2:5), list(c(1L, 5L)))
  ) %>%
  rowwise() |>
  filter(any(c(2,4) %in% unlist(y))) |>
  ungroup() %>%
  arrow_table()

collect(dd)
#> # A tibble: 2 × 2
#>       x               y
#>   <int> <list<integer>>
#> 1     1             [3]
#> 2     2             [4]

Created on 2023-08-23 with reprex v2.0.2.9000

Your example filters on an R tibble. My goal is to filter on the arrow table.

My apologies. I was thinking you could convert it to a tibble, filter, and then convert back to an arrow table.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.