I tried to find whether a list of keywords appeared in any of ten thousands texts.
My code is as follows:
texts <- data.frame(content=c("I have an apple and a book and cat.", "The cat is sleeping."))
Target words
target_words <- c("apple", "book", "cat")
texts %>%
rowwise() %>%
mutate(matched_word= paste(intersect(str_split(content, pattern = " ", simplify = TRUE),
target_words),
collapse = ","))
However, it is extremely slow if I used the following function to find the location in the text of the matched target words .
Is there any more efficient way to find location in the text of matched words ( column called "index") in a large dataset? Any suggestions will be appreciated. Thanks!
the janeaustenr package is just used to have something to match against.
The function gregexpr uses only one string to match against so i used the Vectorize function to accept more inputs. matches will be a list containing as many elements as you want to match (in my case 3). For every "entry" in the book (16235 it this case) it will give either -1 (meaning nothing found) or a vector of starting indexes where it found a match. For example the word have appears for the first time in the 23 entry of the vector janeaustenr::emma.
A tidyverse approach would be to use the str_locate_all function
Hi Vedoa, Thanks for solution and it worked much faster! However, it returned the matched letter location (e.g., 'cat' appears in the 5th letter location of the sentence "The cat is sleeping "). Instead, the matched word location is the wanted result (e.g., 'cat' appears in the 2nd word location of the sentence "The cat is sleeping "). Are there other function alternative to gregexpr() to get the desired result?
Hi @veda , you are right i thought you needed the starting positions.
Here is an alternative that should also be fast enough.
#install.packages("quanteda", repos = "https://packagemanager.posit.co/cran/2024-04-01")
library(quanteda)
corp <- corpus(janeaustenr::emma)
toks <- tokens(corp)
result <- index(toks, c("have", "emma", "she"))
# > head(result)
# docname from to pattern
# 1 text1 1 1 emma
# 2 text15 1 1 emma
# 3 text20 1 1 she
# 4 text23 7 7 have
# 5 text29 2 2 emma
# 6 text33 12 12 emma
Explanation to install.packages("quanteda", repos = "https://packagemanager.posit.co/cran/2024-04-01")
quanteda depends on Matrix. Matrix was released on 26.04.2024 and requires R 4.4.0 (Released 24.4.2024). Since i doubt that you are using the newest version of R that line of code ensures that you get a previous version of the package. R has no real freezing of package versions - so we have to freeze the status of CRAN . Luckily posit makes it possible.
All the gregexpr and similar functions are character based matching i don't know any base R alternative out of the box for word matching.
Here's a tidyverse alternative, but I don't know how fast it is:
library(tidyverse)
texts <-
data.frame(
content = c("I have an apple and a book and cat.", "The cat is sleeping.")
)
# Target words
target_words <- c("apple", "book", "cat")
target_regex <- target_words |> str_c(collapse = '|')
texts |>
pull(content) |>
map(
\(s)
tibble(
word = s |> str_split_1('\\s+'),
detect = word |> str_detect(target_regex)
) |>
reframe(
matches = word[detect] |> str_c(collapse = ', '),
indices = which(detect) |> str_c(collapse = ', ')
) |>
mutate(sentence = s, .before = matches)
) |>
list_rbind()
#> # A tibble: 2 × 3
#> sentence matches indices
#> <chr> <chr> <chr>
#> 1 I have an apple and a book and cat. apple, book, cat. 4, 7, 9
#> 2 The cat is sleeping. cat 2