Annotated Text Extraction: extracting paragraphs from a large annotated plain text document

atderner · April 6, 2020, 4:46pm

Hi all,

I'm working with magrittr, rweka, and openNLP to annotate text files. I have a script that functions for nlp extraction, and am using the function below to extract named entities based on three kinds: "Location", "Organization", and "Person". Everything is working well enough, but I would like to extract larger chunks of the text that include annotated elements. As it stands now, the current function extracts only single entities that have been tagged (e.g. "United States of America" or "London" for "location"). I would like to pull out entire sections, roughly paragraphs, along with the tagged entities. The text I am using are standardized interviews, so ideally I would like to pull out the text between the characters "Q:" which denote an interviewer question. Is there a way to do this by changing an element within my current function?

I believe the '[[' character is what is denoting the single entities, is there a way I could change that so that it would output a larger chunk of text?

Thank you all kindly

entities <- function(doc, kind) {
 s <- doc$content
 a <- annotation(doc)
 if(hasArg(kind)) {
 k <- sapply(a$features, `[[`, "kind")
 s[a[k == kind]]
 } else {
  s[a[a$type == "entity"]]
 }
}

technocrat · April 6, 2020, 11:41pm

Hi, and welcome!

Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers.

Many are able to help even without being deeply knowledgable without NLP. They outnumber NLP experts but are unlikely to address a question without a reprex.

This almost makes it, but it missing an essential ingredient, the data represented by the doc argument.

Without it, the most I can help with is the bracketing operator.

Basically, it selects parts of a list. For example

head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
head(mtcars[1])
#>                    mpg
#> Mazda RX4         21.0
#> Mazda RX4 Wag     21.0
#> Datsun 710        22.8
#> Hornet 4 Drive    21.4
#> Hornet Sportabout 18.7
#> Valiant           18.1
mtcars[1,1]
#> [1] 21

^{Created on 2020-04-06 by the reprex package (v0.3.0)}

Lists are also objects that can contain other objects, nested within them. And those objects, also.

So for a list of lists the double brackets address the list itself rather than the list's contents.

system · April 27, 2020, 11:41pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.