agrep doesn't work within mutate

I'm trying to use agrep to perform 'fuzzy' matching, and it works in principle, but when I try applying it to a dataframe using mutate it no longer works. Instead it throws an error I do not understand:

library(tidyverse)

df <- tibble::tribble(
        ~bad_name,  ~expected,
        "newyork", "New York",
        "alabama",  "Alabama"
        )

# Test
agrep("newyork", state.name, max.distance = 3, value = TRUE)
#> [1] "New York"

# Take the poorly formatted column and format according to state.name using agrep
df %>%
  mutate(good_name = agrep(bad_name, state.name, max.distance = 3, value = TRUE))
#> Warning in agrep(bad_name, state.name, max.distance = 3, value = TRUE): argument
#> 'pattern' has length > 1 and only the first element will be used
#> # A tibble: 2 x 3
#>   bad_name expected good_name
#>   <chr>    <chr>    <chr>    
#> 1 newyork  New York New York 
#> 2 alabama  Alabama  New York

As you can see it throws an error and returns New York for both rows.

I'm not sure what the error means here... In both rows, the pattern (bad_name) is only 1 element. Perhaps it refers to the output? Some cases return two or more matches, e.g.:

agrep("alabama", state.name, max.distance = 3, value = TRUE)
#> [1] "Alabama"  "Oklahoma"

I can explicity limit the results by indexing:

agrep("alabama", state.name, max.distance = 3, value = TRUE)[1]
#> [1] "Alabama"

and I can do the same within the mutate call, but I get the same error and output as above:

df %>%
  mutate(good_name = agrep(bad_name, state.name, max.distance = 3, value = TRUE)[1])
#> Warning in agrep(bad_name, state.name, max.distance = 3, value = TRUE): argument
#> 'pattern' has length > 1 and only the first element will be used
#> # A tibble: 2 x 3
#>   bad_name expected good_name
#>   <chr>    <chr>    <chr>    
#> 1 newyork  New York New York 
#> 2 alabama  Alabama  New York
1 Like

Currently what you do in mutate for obtaining good_name is equivalent to

df <- tibble::tribble(
  ~bad_name,  ~expected,
  "newyork", "New York",
  "alabama",  "Alabama"
)
agrep(df$bad_name, state.name, max.distance = 3, value = TRUE)
#> Warning in agrep(df$bad_name, state.name, max.distance = 3, value = TRUE):
#> l'argument pattern a une longueur > 1 et seul le premier élément est utilisé
#> [1] "New York"

You see I obtain the same error. This is because, doing that, you are passing a character vector to pattern argument in agrep but it doesn't accept it, so it takes only the first one. see ?agrep. The agrep function is not vectorized, you need to vectorize it, or apply element by element.

Example by using a vectorise version:

df <- tibble::tribble(
  ~bad_name,  ~expected,
  "newyork", "New York",
  "alabama",  "Alabama"
)

vagrep <- Vectorize(agrep, "pattern")
vagrep(df$bad_name, state.name, max.distance = 3, value = TRUE)
#> $newyork
#> [1] "New York"
#> 
#> $alabama
#> [1] "Alabama"  "Oklahoma"

dplyr::mutate(
  df,
  good_name = vagrep(bad_name, state.name, max.distance = 3, value = TRUE)
)
#> # A tibble: 2 x 3
#>   bad_name expected good_name   
#>   <chr>    <chr>    <named list>
#> 1 newyork  New York <chr [1]>   
#> 2 alabama  Alabama  <chr [2]>

You see you get the correct result now, with sometime several results on your fuzzy matching. You can use directly in mutate but you'll get a list result you need to proceed. (by selecting the first on as you did for example)

Without Vectorise, you need to apply on each row. With dplyr here is a way using purrr iteration:

df <- tibble::tribble(
  ~bad_name,  ~expected,
  "newyork", "New York",
  "alabama",  "Alabama"
)

library(dplyr)
df %>% 
  mutate(
    good_name = purrr::map(bad_name, 
                    ~ agrep(.x, state.name, max.distance = 3, value = TRUE)
    )
  )
#> # A tibble: 2 x 3
#>   bad_name expected good_name
#>   <chr>    <chr>    <list>   
#> 1 newyork  New York <chr [1]>
#> 2 alabama  Alabama  <chr [2]>

With new dplyr 1.0.0, I think it will be easier as using improved rowise() operation

You need to install dev version for now to try that.

Hope it helps

4 Likes

Thank you @cderv. This works perfectly now :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.