Web Scraping Wikipedia Help

liam.monahan.tx · April 15, 2020, 6:54pm

Hi,
I'm knew to web scraping and running into some issues for scraping the name Liam on Wikipedia. I'm scraping for Irish, Ireland, and Catholic on Liam Wikipedia pages. I think the code works until Liam_urls <- paste0("https://en.wikipedia.org",Liam_urls) but could be wrong. I get the error message Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version or Error in function (type, msg, asError = TRUE) : <url> malformed
How should I adjust my code?
Thanks for your help.

library(RCurl)
library(rvest)
library(stringr)
html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- Liam_urls[which(!str_detect(Liam_urls, "https"))]
Liam_urls
Liam_urls <- paste0("https://en.wikipedia.org",Liam_urls)
scraped_Liam <- sapply(Liam_urls, function(x) getURL(x))
results_Liam <- sapply(scraped_Liam, function(x) str_detect(x,"Irish|Ireland|Catholic"))
results_Liam.df <- data.frame("Hit"=results_Liam, stringsAsFactors = FALSE)
length(results_Liam.df$Hit[which(results_Liam.df$Hit==TRUE)])/length(results_Liam.df$Hit)

nirgrahamuk · April 16, 2020, 8:45am

I tried your code and experienced the same SSL related error. I believe that RCurl is insufficiently sophisticated to getURL on the wikipedia domain. Happy to be corrected by anyone on that point.
I changed to httr library and made subsequent changes as necessary

# replace rcurl with httr
library(httr)
library(rvest)
library(stringr)
# html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- Liam_urls[which(!str_detect(Liam_urls, "https"))]
Liam_urls
Liam_urls <- paste0("https://en.wikipedia.org",Liam_urls)
#GET return response objects so have them in a list - use lapply
scraped_Liam <- lapply(Liam_urls, function(x) GET(x))
# response object need their content extracting as text
results_Liam <- sapply(scraped_Liam, function(x) str_detect(content(x,as="text"),"Irish|Ireland|Catholic"))
results_Liam.df <- data.frame("Hit"=results_Liam, stringsAsFactors = FALSE)
length(results_Liam.df$Hit[which(results_Liam.df$Hit==TRUE)])/length(results_Liam.df$Hit)

system · April 23, 2020, 8:45am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.