I'm working with magrittr, rweka, and openNLP to annotate text files. I have a script that functions for nlp extraction, and am using the function below to extract named entities based on three kinds: "Location", "Organization", and "Person". Everything is working well enough, but I would like to extract larger chunks of the text that include annotated elements. As it stands now, the current function extracts only single entities that have been tagged (e.g. "United States of America" or "London" for "location"). I would like to pull out entire sections, roughly paragraphs, along with the tagged entities. The text I am using are standardized interviews, so ideally I would like to pull out the text between the characters "Q:" which denote an interviewer question. Is there a way to do this by changing an element within my current function?
I believe the '[[' character is what is denoting the single entities, is there a way I could change that so that it would output a larger chunk of text?
Thank you all kindly
entities <- function(doc, kind) {
s <- doc$content
a <- annotation(doc)
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
Many are able to help even without being deeply knowledgable without NLP. They outnumber NLP experts but are unlikely to address a question without a reprex.
This almost makes it, but it missing an essential ingredient, the data represented by the doc argument.
Without it, the most I can help with is the bracketing operator.
Basically, it selects parts of a list. For example