accelerating search with str_extract_all and regex OR pattern

zoowalk · November 17, 2021, 11:39pm

Hello,

I have a somewhat general question: I have a a few hundred article IDs. And a few thousand documents which may or may not contain one, some or none of the mentioned article ids. What I am looking for is to get a dataframe containing each article ID contained in a document with its context (e.g. 50 characters before and after the article ID).

My approach below works, but is very slow. I was wondering whether there are some suggestions how to speed up the process.

#create pattern of search terms combined with OR and context
article_ids <- c("a1", "b2", "c3", "yx")
article_ids_or <- paste0(article_ids, collapse="|")
article_ids_or_context <- paste0(".{0,10}(article_ids_or).{0,10}")

#extract
df_docs %>%
mutate(hits_context=str_extract_all(text, regex(article_ids_or_context)))

I think there is nothing wrong with the approach as such. However, I am looking for ways how to speed up the entire process. It takes a few hours to run my search. Grateful for any suggestion. Many thanks.

technocrat · November 18, 2021, 7:53am

True, but keyword in context is a well-known natural language processing problem that already has solutions that can handle large text collections efficiently.

library(quanteda)
#> Package version: 3.1.0
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(doc1 = "This a sentence containing the a1 keyword.",
         doc2 = "This sentence has both a1 and b2.",
         doc3 = "Nothing to see here, folks.")
txt
#>                                         doc1 
#> "This a sentence containing the a1 keyword." 
#>                                         doc2 
#>          "This sentence has both a1 and b2." 
#>                                         doc3 
#>                "Nothing to see here, folks."
toks <- tokens(txt)
article_ids <- c("a1", "b2", "c3", "yx")
kwic(toks, pattern = article_ids, valuetype = "glob", window = 10)
#> Keyword-in-context with 3 matches.                                                         
#>  [doc1, 6] This a sentence containing the | a1 | keyword.
#>  [doc2, 5]         This sentence has both | a1 | and b2. 
#>  [doc2, 7]  This sentence has both a1 and | b2 | .

system · November 25, 2021, 7:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.