I have a vector of sentences which contains many words that need to be replaced (bolded):
TEXT |
---|
Lorem ipsum dolor sit amet, consectetur adipiscing elit. |
Fusce nec quam ut tortor interdum pulvinar id vitae magna. |
Curabitur commodo consequat arcu et lacinia. |
Proin at diam vitae lectus dignissim auctor nec dictum lectus. |
Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida. |
I also have a tibble which contains a column for the target words (ORIG
) and a column for their replacements (NEW
):
ORIG | NEW |
---|---|
lorem | APPLE |
ipsum | BANANA |
magna | CHERRY |
fusce | DAIKON |
lectus | EGGPLANT |
In this example there are only five words to be replaced but my actual use case involves about 100 target words, so I'd like to find an efficient, programmatic way of returning the following result (bolding for clarity only):
TEXT | NEW TEXT |
---|---|
Lorem ipsum dolor sit amet, consectetur adipiscing elit. | APPLE BANANA dolor sit amet, consectetur adipiscing elit. |
Fusce nec quam ut tortor interdum pulvinar id vitae magna. | DAIKON nec quam ut tortor interdum pulvinar id vitae CHERRY. |
Curabitur commodo consequat arcu et lacinia. | Curabitur commodo consequat arcu et lacinia. |
Proin at diam vitae lectus dignissim auctor nec dictum lectus. | Proin at diam vitae EGGPLANT dignissim auctor nec dictum EGGPLANT. |
Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida. | DAIKON venenatis eros congue velit feugiat, ac aliquam BANANA gravida. |
What is an efficient way of doing this string replacement?
So far I've played around with passing a named vector to str_replace_all()
, but I've been unable to overcome case sensitivity (see the reprex below). My gut tells me there's probably a way to do this using fuzzyjoin::regex_inner_join()
, but I haven't been able to crack it.
Any suggestions would be appreciated!
Reprex
library(tidyverse)
dat_orig <- tibble(TEXT = c(
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
"Fusce nec quam ut tortor interdum pulvinar id vitae magna.",
"Curabitur commodo consequat arcu et lacinia.",
"Proin at diam vitae lectus dignissim auctor nec dictum lectus.",
"Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida."
))
recode_table <- tibble(
ORIG = c("lorem", "ipsum", "magna", "fusce", "lectus"),
NEW = c("APPLE", "BANANA", "CHERRY", "DAIKON", "EGGPLANT")
)
name_tbl_vector <- function(x, name, value) {
x %>%
transpose() %>%
{
set_names(map_chr(., value), map_chr(., name))
}
}
key <- name_tbl_vector(recode_table, name = "ORIG", value = "NEW")
# This almost works, but it fails to replace "Lorem" and other targets that have capitalized letters
dat_orig %>%
mutate(NEW_TEXT = str_replace_all(TEXT, key))
## # A tibble: 5 x 2
## TEXT NEW_TEXT
## <chr> <chr>
## 1 Lorem ipsum dolor sit amet, consec~ Lorem BANANA dolor sit amet, consec~
## 2 Fusce nec quam ut tortor interdum ~ Fusce nec quam ut tortor interdum p~
## 3 Curabitur commodo consequat arcu e~ Curabitur commodo consequat arcu et~
## 4 Proin at diam vitae lectus digniss~ Proin at diam vitae EGGPLANT dignis~
## 5 Fusce venenatis eros congue velit ~ Fusce venenatis eros congue velit f~