negated str_detect doesn't match NAs

zoowalk · December 31, 2021, 9:48am

I just came across a behavior of str_detect which was new to me and I was wondering how others are dealing with such cases:

library(tidyverse)

df_test<- data.frame(
  stringsAsFactors = FALSE,
             names = c("tom", "max", "ella", "franz"),
             family = c("huber", "huber", "bauer", NA),
               age = c(10L, 4L, 7L, NA)
)

df_test %>% 
  filter(str_detect(family, "huber"))
#>   names family age
#> 1   tom  huber  10
#> 2   max  huber   4

The result below surprises me. Why doesn't the negated version of str_detect also return franz whose familyname is NA. My - apparently wrong - understanding was that NA !="huber" and hence the row with NA should be returned.

df_test %>% 
  filter(!str_detect(family, "huber"))
#>   names family age
#> 1  ella  bauer   7

df_test %>% 
  filter(str_detect(family, "huber", negate=T))
#>   names family age
#> 1  ella  bauer   7

Since such cases can be quite often, does this mean that every negated str_detect should specifically account for NAs (as below) ?

df_test %>% 
  filter(str_detect(family, "huber", negate=T) | is.na(family))

  names family age
1  ella  bauer   7
2 franz   <NA>  NA

I find this behavior surprising. Personally, I think would prefer to have an option in str_detect to match also NAs, but I strongly assume that there's an explanation for it. Many thanks.

^{Created on 2021-12-31 by the reprex package (v2.0.1)}

xvalda · December 31, 2021, 1:14pm

I can only venture a guess here. Since the core function of str_detect() is to match a pattern to a string, matching a non-existing string seems a moot point, so it seems normal (to me) that NAs are silently dropped.

I get your point for the negate function but I’m not sure to which extent it would make sense to have arguments that apply only to a particular case of the function impacted by another argument.

So I guess that in any case you’ll have to use your last line of code.

Or to make your code more explicit, use an anti-join:

df_test %>% 
  filter(str_detect(family, "huber")) %>% 
  anti_join(df_test, ., by = "family")

#>   names family age
#> 1  ella  bauer   7
#> 2 franz   <NA>  NA

Another way could be to replace missing values with an empty string (or a single space " ") that we could consider implicit NAs:

df_test %>% 
  mutate(family = ifelse(is.na(family), "", family)) %>% 
  filter(str_detect(family, "huber", negate = TRUE))

#>   names family age
#> 1  ella  bauer   7
#> 2 franz         NA

Just a note that in your example, str_detect() is not necessary since you’re matching the entire family name:

df_test %>% 
  mutate(family = ifelse(is.na(family), "", family)) %>% 
  filter(family != "huber")

#>   names family age
#> 1  ella  bauer   7
#> 2 franz         NA

startz · December 31, 2021, 4:07pm

Note that

> NA == NA
[1] NA

So NA !="huber" is NA, neither TRUE nor FALSE.

nirgrahamuk · January 2, 2022, 12:32pm

There is a good conceptual basis for such behaviour. NA represents a lack of knowledge. You don't know the value , I.e. you don't know whether it is not Huber or that it is Huber. You cant say its huber and you cant say its not huber. Therefore its up to the programmer to use additional primitives like is.na() for detecting NA cases and deciding an appropriate treatment . I.e. in our use case should we assume that NA doesn't match the thing we are looking for ?

system · January 9, 2022, 12:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.