When scraping the desired webpages, the content of all the .txt files are the nodes I selected.

Bolleke · March 31, 2020, 2:03pm

I'm doing a basic web-scraping exercise for myself, extracting States of the Union from this website.

my code to get what I need looks like this.


library(rvest)
library(dplyr)
library(tidyr)
library(qdap)
library(dplyr)



#load webpage
pres.library <- read_html(x = "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union")

#get URL of links
links <- pres.library %>%
  html_nodes("span a , td~ td+ td a") %>%
  html_attr("href")

#get link text
text <- pres.library %>%
  html_nodes("span a , td~ td+ td a") %>%
  html_text()

#combine into df 
sotu <- data.frame (text = text, links = links, stringsAsFactors = F)

After cleaning, the dataframe looks like this for every State of the Union.

Year                                                                                                      links             President                 Party
1   2020 https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-27       Donald J. Trump            Republican
2   2019 https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-26       Donald J. Trump            Republican
3   2018 https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-25       Donald J. Trump            Republican
4   2017                                                    https://www.presidency.ucsb.edu/ws/index.php?pid=123408       Donald J. Trump            Republican
5   2016                                                    https://www.presidency.ucsb.edu/ws/index.php?pid=111174          Barack Obama            Democratic
6   2015                                                    https://www.presidency.ucsb.edu/ws/index.php?pid=108031          Barack Obama            Democratic
...

When I'm looping through my data-frame to extract the text using this

for (i in seq(nrow(sotu))) {
  sotu.text <- read_html(paste0(x = "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union"),sotu$links[i]) %>%
    html_nodes("span a , td~ td+ td a") %>%
    html_text()
  filename <- paste0("State of the Union", " ", sotu$Year[i], " ", sotu$President[i], " ", sotu$Party[i], ".txt")
  sink(file = filename) %>%
    cat(text) %>%
    sink()
}

The .txt files are in my directory but for some reason the content of all the .txt files are just the nodes I selected through the selectorgadget tool in Chrome.

I believe it has something to do with the "href" variable of html_attr?

Any help is greatly appreciated.

Thank you!

lxy009 · March 31, 2020, 4:44pm

Not sure exactly what you are trying to record.
Maybe an explanation on what the content of the txt files should be?

in the loop, cat(text) is probably supposed to be cat(sotu.text). By referring to just text, you are referring to the vector above the loop.
the html_text for your selector would still be a vector of the text of ALL elements that match your selector. So each file would be the same and contain that vector.
if you are looping through the data.frame, why are you re-scraping the site?
It seems like everything is already scraped and inside your data.frame?

Bolleke · March 31, 2020, 4:59pm

Thank you for the input!

I am trying to extract the state of the union speeches.

The data frame only contains the links to the speeches, which is why I'm scraping through it in the last part. Does that make sense? Or am I doing something wrong?

Bolleke · March 31, 2020, 7:53pm

Found it

for (i in seq(nrow(sotu))) {
  sotu.text <- read_html(sotu$links[i]) %>%
    html_nodes(".col-sm-8") %>%
    html_text()
  filename <- paste0("State of the Union", " ", sotu$Year[i], " ", sotu$President[i], " ", sotu$Party[i], ".txt")
  cat(sotu.text, file=filename, sep = "\n")
}

".col-sm-8" was the extraction node that was needed

system · April 7, 2020, 7:53pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.