I'm doing a basic web-scraping exercise for myself, extracting States of the Union from this website.
my code to get what I need looks like this.
library(rvest)
library(dplyr)
library(tidyr)
library(qdap)
library(dplyr)
#load webpage
pres.library <- read_html(x = "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union")
#get URL of links
links <- pres.library %>%
html_nodes("span a , td~ td+ td a") %>%
html_attr("href")
#get link text
text <- pres.library %>%
html_nodes("span a , td~ td+ td a") %>%
html_text()
#combine into df
sotu <- data.frame (text = text, links = links, stringsAsFactors = F)
After cleaning, the dataframe looks like this for every State of the Union.
Year links President Party
1 2020 https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-27 Donald J. Trump Republican
2 2019 https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-26 Donald J. Trump Republican
3 2018 https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-25 Donald J. Trump Republican
4 2017 https://www.presidency.ucsb.edu/ws/index.php?pid=123408 Donald J. Trump Republican
5 2016 https://www.presidency.ucsb.edu/ws/index.php?pid=111174 Barack Obama Democratic
6 2015 https://www.presidency.ucsb.edu/ws/index.php?pid=108031 Barack Obama Democratic
...
When I'm looping through my data-frame to extract the text using this
for (i in seq(nrow(sotu))) {
sotu.text <- read_html(paste0(x = "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union"),sotu$links[i]) %>%
html_nodes("span a , td~ td+ td a") %>%
html_text()
filename <- paste0("State of the Union", " ", sotu$Year[i], " ", sotu$President[i], " ", sotu$Party[i], ".txt")
sink(file = filename) %>%
cat(text) %>%
sink()
}
The .txt files are in my directory but for some reason the content of all the .txt files are just the nodes I selected through the selectorgadget tool in Chrome.
I believe it has something to do with the "href" variable of html_attr?
Any help is greatly appreciated.
Thank you!