I've got several texts in XML-TEI-P5 format that I eventually need as a corpus (e.g. tm
, quanteda
or stylo
corpus). I've never worked with XML and have trouble parsing it. I get the text, but it still has all the annotations in that I don't manage to delete. Also, I only need the text, not the metadata.
Here are two approaches I've tried so far:
- With
XML
andxml2
. Problem here is that root1 is a "External pointer of class 'XMLInternalElemtNode'" and I can't manage to transform it into anything else.
library(xml2)
library(XML)
A1 <- read_xml("http://www.deutschestextarchiv.de/book/download_xml/schlegel_athenaeum_1798")
doc1 <- xmlParse(A1)
root1 <- xmlRoot(doc1)
print(root1)
- With
stylo
: (same document, but saved locally)
Corpus_alle <- load.corpus.and.parse(files = "all", corpus.dir = "TexteXML", markup.type= "XML",
corpus.lang = "German", splitting.rule = NULL,
sample.size = 10000, sampling = "no.sampling",
sample.overlap = 0, number.of.samples = 1,
sampling.with.replacement = FALSE, features = "w",
ngram.size = 1, preserve.case = FALSE,
encoding = "UTF-8")