replacement argument in str_replace has differen length than string

andresrcs · March 31, 2022, 12:54am

Trying to replace words based on the first two and last two characters doesn't seem like a reliable method, I think you should consider using string distance metrics like in this example:

library(tidyverse)
library(fuzzyjoin)

clean_db <- tibble(provincia = c("AZUY", "BOLI$BAR", "CAN_AR", "GUY$AS", "PICHI.CHA",
                                 "COTPAXI", "MORON/A SANTIAGO"),
                   ciudad = c("QUITO", "CUENCA", "GUAYAQUIL", "MANTA", "PORTOVIEJO",
                              "AZOGUES", "SALINAS"))

Provincia <- tibble(codigo = c(1:17),
                    descripcion = c("AZUAY",
                                    "BOLIVAR",
                                    "CAÑAR",
                                    "CARCHI",
                                    "CHIMBORAZO",
                                    "COTOPAXI",
                                    "EL ORO",
                                    "ESMERALDAS",
                                    "GALAPAGOS",
                                    "GUAYAS",
                                    "IMBABURA",
                                    "LOJA",
                                    "LOS RIOS",
                                    "MANABI",
                                    "MORONA SANTIAGO",
                                    "NAPO",
                                    "SANTO DOMINGO DE LOS TSACHILAS"))

clean_db %>% 
    stringdist_left_join(Provincia %>% select(descripcion),
                         by = c(provincia = "descripcion"),
                         method = "osa") %>% 
    mutate(provincia = coalesce(descripcion, provincia)) %>% 
    select(-descripcion)
#> # A tibble: 7 × 2
#>   provincia       ciudad    
#>   <chr>           <chr>     
#> 1 AZUAY           QUITO     
#> 2 BOLIVAR         CUENCA    
#> 3 CAÑAR           GUAYAQUIL 
#> 4 GUAYAS          MANTA     
#> 5 PICHI.CHA       PORTOVIEJO
#> 6 COTOPAXI        AZOGUES   
#> 7 MORONA SANTIAGO SALINAS

^{Created on 2022-03-30 by the reprex package (v2.0.1)}

Or, if possible, manually define a vector with equivalences e.g. c('misspelling' = 'correct'), which would have the most accurate results.

Note: Next time please provide a proper REPRoducible EXample (reprex) illustrating your issue.