Hi there. I am parsing newspaper articles (html) into R in order to perform Sentiment Analysis (SA) on them. However, I'm having trouble getting the format right. I need corpora in order to perform SA.
I read my articles like this. (This is an example with 3 articles - I will be parsing up to 45 later).
f < -file.path("/Desktop/SentimentAnalysisTests/haefliger/cassis", c("04042020.html", "17092020.html", "13102020.html"))
d <- lapply(f, read_html, skip = 7, remove.empty = TRUE, trim = TRUE)
typeof(d)
#- this gives me a list, and works fine
#here I am parsing the full path names in order to name the elements.
names(d) <- gsub("./(.)\..*", "\1", d)
#Here I turn the list into a character vector
d_unlisted <- unlist(d)
#from here I would go on to convert the list into a corpus. However: when I unlist d, the whole list has converted into one document. Therefore, the corpus is NOT divided anymore by document (1, 2, 3), but by the combined rows/lines of all the documents combined. Therefore, SA doesn't work properly, since it analysis each line/row instead of each document separately. -> How can I a) "remove the lines" and merge everything into one row per document and b) make sure that the documents are preserved even if converted to a corpus? c) The reason for me unlisting in the first place is that only vectors can be converted into corpora. Is there a way for me to avoid unlisting but still convert to a corpus?
d_2_corpus <- corpus(d_2_unlisted)
summary(d_2_corpus)
show(d_2_corpus)
Thanks for any help for e desperate newbie!
Edit: Basically, the problem arises the moment I read the html file(s). E.g., when I read one single html file with "read_html", its length is already 36 (the amount of paragraphs in the original article). Unfortunately, there is no possibility to download the html files as txt.