How do I parse XML-TEI?

psh · March 10, 2021, 10:14pm

I've got several texts in XML-TEI-P5 format that I eventually need as a corpus (e.g. tm, quanteda or stylo corpus). I've never worked with XML and have trouble parsing it. I get the text, but it still has all the annotations in that I don't manage to delete. Also, I only need the text, not the metadata.

Here are two approaches I've tried so far:

With XML and xml2 . Problem here is that root1 is a "External pointer of class 'XMLInternalElemtNode'" and I can't manage to transform it into anything else.

library(xml2)
library(XML)
A1 <- read_xml("http://www.deutschestextarchiv.de/book/download_xml/schlegel_athenaeum_1798")
doc1 <- xmlParse(A1)
root1 <- xmlRoot(doc1)

print(root1)

With stylo: (same document, but saved locally)

Corpus_alle <- load.corpus.and.parse(files = "all", corpus.dir = "TexteXML", markup.type= "XML",
                      corpus.lang = "German", splitting.rule = NULL,
                      sample.size = 10000, sampling = "no.sampling",
                      sample.overlap = 0, number.of.samples = 1,
                      sampling.with.replacement = FALSE, features = "w", 
                      ngram.size = 1, preserve.case = FALSE,
                      encoding = "UTF-8")

system · March 31, 2021, 10:15pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.