How to "merge" all rows into one in a list - to perform SA on a corpus

maureen · March 14, 2022, 6:09pm

Hi there. I am parsing newspaper articles (html) into R in order to perform Sentiment Analysis (SA) on them. However, I'm having trouble getting the format right. I need corpora in order to perform SA.

I read my articles like this. (This is an example with 3 articles - I will be parsing up to 45 later).

f < -file.path("/Desktop/SentimentAnalysisTests/haefliger/cassis", c("04042020.html", "17092020.html", "13102020.html"))

d <- lapply(f, read_html, skip = 7, remove.empty = TRUE, trim = TRUE)

typeof(d)
#- this gives me a list, and works fine

#here I am parsing the full path names in order to name the elements.

names(d) <- gsub("./(.)\..*", "\1", d)

#Here I turn the list into a character vector

d_unlisted <- unlist(d)

#from here I would go on to convert the list into a corpus. However: when I unlist d, the whole list has converted into one document. Therefore, the corpus is NOT divided anymore by document (1, 2, 3), but by the combined rows/lines of all the documents combined. Therefore, SA doesn't work properly, since it analysis each line/row instead of each document separately. -> How can I a) "remove the lines" and merge everything into one row per document and b) make sure that the documents are preserved even if converted to a corpus? c) The reason for me unlisting in the first place is that only vectors can be converted into corpora. Is there a way for me to avoid unlisting but still convert to a corpus?

d_2_corpus <- corpus(d_2_unlisted)
summary(d_2_corpus)
show(d_2_corpus)

Thanks for any help for e desperate newbie!

Edit: Basically, the problem arises the moment I read the html file(s). E.g., when I read one single html file with "read_html", its length is already 36 (the amount of paragraphs in the original article). Unfortunately, there is no possibility to download the html files as txt.

pieterjanvc · March 15, 2022, 12:25am

HI there,

It's very difficult to know what it is you like to do if we don't have any data to work with. Could you recreate the issue using some public web pages as input? For example, the text on this page can be extracted like so:

library(rvest)
page = read_html("https://forum.posit.co/t/how-to-merge-all-rows-into-one-in-a-list-to-perform-sa-on-a-corpus/131677")
page = html_text(page)

Try and create an example where we can run some of your code then show us where it goes wrong and what you like as a solution. You can read this guide for more info on how to create this reprex. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Good luck
PJ

system · April 5, 2022, 12:25am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.