I'm trying to use agrep to perform 'fuzzy' matching, and it works in principle, but when I try applying it to a dataframe using mutate
it no longer works. Instead it throws an error I do not understand:
library(tidyverse)
df <- tibble::tribble(
~bad_name, ~expected,
"newyork", "New York",
"alabama", "Alabama"
)
# Test
agrep("newyork", state.name, max.distance = 3, value = TRUE)
#> [1] "New York"
# Take the poorly formatted column and format according to state.name using agrep
df %>%
mutate(good_name = agrep(bad_name, state.name, max.distance = 3, value = TRUE))
#> Warning in agrep(bad_name, state.name, max.distance = 3, value = TRUE): argument
#> 'pattern' has length > 1 and only the first element will be used
#> # A tibble: 2 x 3
#> bad_name expected good_name
#> <chr> <chr> <chr>
#> 1 newyork New York New York
#> 2 alabama Alabama New York
As you can see it throws an error and returns New York
for both rows.
I'm not sure what the error means here... In both rows, the pattern (bad_name
) is only 1 element. Perhaps it refers to the output? Some cases return two or more matches, e.g.:
agrep("alabama", state.name, max.distance = 3, value = TRUE)
#> [1] "Alabama" "Oklahoma"
I can explicity limit the results by indexing:
agrep("alabama", state.name, max.distance = 3, value = TRUE)[1]
#> [1] "Alabama"
and I can do the same within the mutate call, but I get the same error and output as above:
df %>%
mutate(good_name = agrep(bad_name, state.name, max.distance = 3, value = TRUE)[1])
#> Warning in agrep(bad_name, state.name, max.distance = 3, value = TRUE): argument
#> 'pattern' has length > 1 and only the first element will be used
#> # A tibble: 2 x 3
#> bad_name expected good_name
#> <chr> <chr> <chr>
#> 1 newyork New York New York
#> 2 alabama Alabama New York