fast way to find the location of matched word in texts

veda · April 28, 2024, 3:30pm

Dear R experts,

I tried to find whether a list of keywords appeared in any of ten thousands texts.
My code is as follows:

texts <- data.frame(content=c("I have an apple and a book and cat.", "The cat is sleeping."))

Target words

target_words <- c("apple", "book", "cat")
texts %>%
rowwise() %>%
mutate(matched_word= paste(intersect(str_split(content, pattern = " ", simplify = TRUE),
target_words),
collapse = ","))
However, it is extremely slow if I used the following function to find the location in the text of the matched target words .

texts %>%
rowwise() %>%
mutate(matched_word= paste(intersect(str_split(content, pattern = " ", simplify = TRUE),
target_words),
collapse = ","))%>%

to find matched word location

mutate(word = strsplit(as.character(content), " ")) %>%
unnest(word)%>%
mutate(row=row_number())%>%
rowwise()%>%
mutate(index = ifelse(grepl(word,matched_word), row, NA))%>%
group_by(content)%>%
filter(!is.na(row))%>%
mutate(index=paste(index,collapse = ','))

Is there any more efficient way to find location in the text of matched words ( column called "index") in a large dataset? Any suggestions will be appreciated. Thanks!

vedoa · April 29, 2024, 7:37pm

Hi @veda ,

here is a base R approach

library(janeaustenr)
Vgregexpr <- Vectorize(FUN = gregexpr, vectorize.args = "pattern", SIMPLIFY = FALSE)
matches <- Vgregexpr(pattern = c("have", "emma", "she"), 
                     text = janeaustenr::emma, 
                     ignore.case = T)

the janeaustenr package is just used to have something to match against.

The function gregexpr uses only one string to match against so i used the Vectorize function to accept more inputs.
matches will be a list containing as many elements as you want to match (in my case 3). For every "entry" in the book (16235 it this case) it will give either -1 (meaning nothing found) or a vector of starting indexes where it found a match. For example the word have appears for the first time in the 23 entry of the vector janeaustenr::emma.

A tidyverse approach would be to use the str_locate_all function

library(stringr)
Vstr_locate_all <- Vectorize(FUN = str_locate_all, vectorize.args = "pattern", SIMPLIFY = FALSE)
matchesTidy <- Vstr_locate_all(string = janeaustenr::emma, pattern = c("have", "emma", "she"))

veda · April 30, 2024, 2:18am

Hi Vedoa, Thanks for solution and it worked much faster! However, it returned the matched letter location (e.g., 'cat' appears in the 5th letter location of the sentence "The cat is sleeping "). Instead, the matched word location is the wanted result (e.g., 'cat' appears in the 2nd word location of the sentence "The cat is sleeping "). Are there other function alternative to gregexpr() to get the desired result?

vedoa · April 30, 2024, 3:29pm

Hi @veda , you are right i thought you needed the starting positions.

Here is an alternative that should also be fast enough.

#install.packages("quanteda", repos = "https://packagemanager.posit.co/cran/2024-04-01")
library(quanteda)
corp <- corpus(janeaustenr::emma)
toks <- tokens(corp)
result <- index(toks, c("have", "emma", "she"))
# > head(result)
# docname from to pattern
# 1   text1    1  1    emma
# 2  text15    1  1    emma
# 3  text20    1  1     she
# 4  text23    7  7    have
# 5  text29    2  2    emma
# 6  text33   12 12    emma

Explanation to install.packages("quanteda", repos = "https://packagemanager.posit.co/cran/2024-04-01")
quanteda depends on Matrix. Matrix was released on 26.04.2024 and requires R 4.4.0 (Released 24.4.2024). Since i doubt that you are using the newest version of R that line of code ensures that you get a previous version of the package. R has no real freezing of package versions - so we have to freeze the status of CRAN . Luckily posit makes it possible.

All the gregexpr and similar functions are character based matching i don't know any base R alternative out of the box for word matching.

veda · May 2, 2024, 4:25am

The function from quanteda worked perfect, thanks!

dromano · May 2, 2024, 5:08am

Here's a tidyverse alternative, but I don't know how fast it is:

library(tidyverse)
texts <- 
  data.frame(
    content = c("I have an apple and a book and cat.", "The cat is sleeping.")
    )

# Target words
target_words <- c("apple", "book", "cat")
target_regex <- target_words |> str_c(collapse = '|')

texts |> 
  pull(content) |> 
  map(
    \(s) 
    tibble(
      word = s |>  str_split_1('\\s+'),
      detect = word |> str_detect(target_regex)
    ) |> 
      reframe( 
        matches = word[detect] |> str_c(collapse = ', '),
        indices = which(detect) |> str_c(collapse = ', ')
      ) |> 
      mutate(sentence = s, .before = matches)
  ) |> 
  list_rbind()
#> # A tibble: 2 × 3
#>   sentence                            matches           indices
#>   <chr>                               <chr>             <chr>  
#> 1 I have an apple and a book and cat. apple, book, cat. 4, 7, 9
#> 2 The cat is sleeping.                cat               2

^{Created on 2024-05-02 with reprex v2.0.2}

system · May 9, 2024, 5:09am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.