Dataframe filter gives incorrect results with any + %in%

kenglish · November 7, 2023, 3:23am

Using the construct any(X %in% Y) as a filter for a data frame containing vector elements does not give the correct results.

Reproducible example:

> df <- data.frame(K=c(1,2,3), L=rep(NA, 3))
> df$L<-list(c("a","b","c"), c("b","c","d"), c("c","d","e"))
> df
  K       L
1 1 a, b, c
2 2 b, c, d
3 3 c, d, e
> 
> M <- c("a", "e")
> 
> M %in% c("a","b","c")
[1]  TRUE FALSE
> any(M %in% c("a","b","c"))
[1] TRUE
> M %in% c("b","c","d")
[1] FALSE FALSE
> any(M %in% c("b","c","d"))
[1] FALSE
> M %in% c("c","d","e")
[1] FALSE  TRUE
> any(M %in% c("c","d","e"))
[1] TRUE
> 
> filtered2 <- df[any(M %in% df$L),]
> filtered2
[1] K L
<0 rows> (or 0-length row.names)
> nrow(filtered2)
[1] 0
> 
> # Dplyr gives the same results, as does mutate
> 
> library(dplyr)
> 
> filtered <- df %>% filter(any(M %in% L))
> nrow(filtered)
[1] 0
> 
> df %>% mutate(X=any(M %in% L))
  K       L     X
1 1 a, b, c FALSE
2 2 b, c, d FALSE
3 3 c, d, e FALSE

As demonstrated in the direct value testing, the first and third rows should have been returned from the filter and the values of X for mutate should have been TRUE, FALSE, TRUE. Not FALSE, FALSE, FALSE.

kenglish · November 7, 2023, 3:57am

I found a workaround:

I found a work around:

```R
> FF_closure <- function(V) {
+   if (length(V[[1]]) > 1) {
+     RR = vector()
+     for (vv in V) {
+       RR[[length(RR)+1]] = FF_closure(vv)
+     }
+     return(RR)
+   }
+   for (m in M) {
+     for (v in V) {
+       if (v == m) {
+         return(TRUE)
+       }
+     }
+   }
+   return(FALSE)
+ }
> 
> filtered3 <- df[FF_closure(df$L),]
> filtered3
  K       L
1 1 a, b, c
3 3 c, d, e
> nrow(filtered3)
[1] 2

jrkrideau · November 7, 2023, 5:21am

I am not sure what you are doing here but if we look at it like this

df <- data.frame(K=c(1,2,3), L=rep(NA, 3))
df$L<-list(c("a","b","c"), c("b","c","d"), c("c","d","e"))

M <- c("a", "e")

M %in% df$L[[1]]

we get what we expect for the first element.

technocrat · November 7, 2023, 8:23am

More simply

d <- data.frame(K=c(1,2,3), L=rep(NA, 3))
d$L <- list(c("a","b","c"), c("b","c","d"), c("c","d","e"))
M <- c("a", "e")

(result <- sapply(d$L, function(x) M %in% x))
#>       [,1]  [,2]  [,3]
#> [1,]  TRUE FALSE FALSE
#> [2,] FALSE FALSE  TRUE

^{Created on 2023-11-07 with reprex v2.0.2}

kenglish · November 21, 2023, 2:59pm

Right, but the point is that this DOES NOT WORK when trying to use any ... %in% in a filtering context.

kenglish · November 21, 2023, 3:01pm

You didn't use "any" in your filter function.

technocrat · November 21, 2023, 7:09pm

Neither does the workaround. Because of the order of evaluation any() takes as its argument not the vector but the return of the %in% operator, which is typeof logical.

kenglish · November 25, 2023, 1:20am

see below ____________________

kenglish · November 25, 2023, 1:31am

I think this is a fundamental issue with R unable to tell the difference between a collection and a scalar. The interpretation of filter should be "for each record do " . For each record I want to know if ANY of a given fixed set are found in a multi-valued array in the row.

My final workaround looks like:

Cannot test vectors for NA directly since cannot test vectors of length > 1

vector_is_na <- function(vector_element) {
if (test(length(vector_element) == 1)) {
if (is.na(vector_element)) {
return(TRUE)
}
}
return(FALSE)
}

TRUE if a vector_element from a dataframe (e.g. when a dataframe element itself is a vector)

contains at least one of the items in test_items

simple_any_in <- function(test_items, vector_element) {
if (!vector_is_na(vector_element)) {
for (ve in vector_element) {
for (it in test_items) {
if (test(ve == ti)) {
return(TRUE)
}
}
}
}
return(FALSE)
}

filter_dataframe_vector_by_any_in <- function(df_to_filter, df_col, test_items) {
match_closure <- function(column_vect) {
RR = rep(NA, nrow(df_to_filter))
i = 1
for (vector_str_element in column_vect) {
if (is.na(vector_str_element)) {
RR[[i]] = FALSE
} else {
vector_str_element = str_split_1(string=vector_str_element, pattern=", ")
RR[[i]] = simple_any_in(test_items, vector_element)
}
i = i + 1
}
# return a truthy-vector the length of nrow(df_to_filter) to filter the dataframe
return(RR)
}
truthy_vec = match_closure(df_col)
df_to_filter[truthy_vec, ]
}

AlexisW · November 28, 2023, 9:13pm

R does not have "collection" or "scalar" types, it has vectors (equivalent of arrays in other languages), which can be of length 1 (thus be equivalent to scalars), and it has lists.

The operator x %in% table is defined such that it takes as input a vector x (optionally of length 1), and a vector table, and looks if each element of x is in table. In your code, you want to look in a list, which is not an appropriate data type for table.

What you want is a different operator, which can be defined as:

`%inlist%` <- function(x, lst){
  x %in% unlist(lst)
}

then you can use it as expected:

lst <- list(c("a", "b"),
            c("d", "e"))

c("a","f") %inlist% lst
#> [1]  TRUE FALSE

You could argue that where R is wrong is that %in% should give an error if you use a list as second argument. I guess the reason it's not the case is that %in% is agnostic to the type, so you can use it on list elements:

list(c("a","b"), c("a","f")) %in% lst
#> TRUE FALSE

and that would stay true if you define a custom class derived from list(), you wouldn't have to define a custom %in% for that class.

AlexisW · November 28, 2023, 9:35pm

As for this filtering, the problem is that you want to index the rows. So the vector on the left of %in% has to have one entry per row of df. You want a function that does "for each element of df$L, is any of them in M"?

So you have to have a separate loop:

sapply(df$L, \(el) any(el %in% M))

applied to the data frame:

df <- data.frame(K=c(1,2,3), L=rep(NA, 3))
df$L<-list(c("a","b","c"), c("b","c","d"), c("c","d","e"))

M <- c("a", "e")

filtered2 <- df[sapply(df$L, \(el) any(el %in% M)),]

filtered2
#>   K       L
#> 1 1 a, b, c
#> 3 3 c, d, e

^{Created on 2023-11-28 with reprex v2.0.2}

AlexisW · November 28, 2023, 10:55pm

Oh and I was forgetting: this one has a problem for a different reason. Well, the basic reason is the same as above, when doing M %inlist% L you are asking backward, you are asking which element of M is in the data frame, whereas to filter you would need to ask with L on the left side (something like L %in% M).

So the result of M %in% L is of length 2, the same length as M. But then you add the any(), which collapses the vector of length 2 into a single value. And mutate() does something "clever": it recycles values. So the reason your column X has FALSE FALSE FALSE is not because you got three FALSE, it's because you ask if any element of M is in L, and get FALSE once (because L is a list, its elements are vectors), but then you recycle the FALSE to fill df$X.

This recycling "trick" is also something that is really useful in some situations, but can be dangerous as it looks like you correctly did three tests, even though you actually only did one.

kenglish · November 29, 2023, 5:09pm

Thanks for the helpful and informative responses.

system · January 10, 2024, 5:10pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.