I have a somewhat general question: I have a a few hundred article IDs. And a few thousand documents which may or may not contain one, some or none of the mentioned article ids. What I am looking for is to get a dataframe containing each article ID contained in a document with its context (e.g. 50 characters before and after the article ID).
My approach below works, but is very slow. I was wondering whether there are some suggestions how to speed up the process.
#create pattern of search terms combined with OR and context
article_ids <- c("a1", "b2", "c3", "yx")
article_ids_or <- paste0(article_ids, collapse="|")
article_ids_or_context <- paste0(".{0,10}(article_ids_or).{0,10}")
I think there is nothing wrong with the approach as such. However, I am looking for ways how to speed up the entire process. It takes a few hours to run my search. Grateful for any suggestion. Many thanks.
True, but keyword in context is a well-known natural language processing problem that already has solutions that can handle large text collections efficiently.
library(quanteda)
#> Package version: 3.1.0
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(doc1 = "This a sentence containing the a1 keyword.",
doc2 = "This sentence has both a1 and b2.",
doc3 = "Nothing to see here, folks.")
txt
#> doc1
#> "This a sentence containing the a1 keyword."
#> doc2
#> "This sentence has both a1 and b2."
#> doc3
#> "Nothing to see here, folks."
toks <- tokens(txt)
article_ids <- c("a1", "b2", "c3", "yx")
kwic(toks, pattern = article_ids, valuetype = "glob", window = 10)
#> Keyword-in-context with 3 matches.
#> [doc1, 6] This a sentence containing the | a1 | keyword.
#> [doc2, 5] This sentence has both | a1 | and b2.
#> [doc2, 7] This sentence has both a1 and | b2 | .