I'm conducting a webscraping project, but I've run into issues with the code. Getting this error while trying to run the lines of code below, specifically the sapply function: Timeout was reached: [historico.presidencia.gov.co] Connection timed out after 10000 milliseconds
I assume its because the scraping loop reaches the time limit before it can finish the process, and usually I can brute force the program to work, but it has been such a consistent issue that it has become quite frustrating. I have tried using the timeout function from httr, but I have had no success.
Any help is appreciated.
Note: because I am a new user, I cannot post my links exactly as they should. I have put spaces in between and removed http so that can post the URL without otherwise losing the meaningful information.
library(rvest)
library(tidyverse)
library(httr)
months <- tolower(c("Enero","Febrero","Marzo","Abril","Mayo","Junio","Julio","Agosto","Septiembre","Octubre","Noviembre","Diciembre"))
Uribe_index_2003_urls <- paste0(":// historico. presidencia. gov.
co/discursos/ discursos2003/ ",months,"/ ",months,"2003.htm")
search_month <- 7
url_one <- read_html(Uribe_index_2003_urls[search_month])
url_two <- html_nodes(url_one, "a.tituloscentro")
url_three <- html_attr(url_two, "href")
url_four <- url_three[which(!str_extract(url_three,"^.")==".")]
speech_url<- paste0("http://historico.presidencia.gov.co/discursos/discursos2003/",months[search_month],"/",url_four)
speech_url
sample_url <- speech_url[2]
get_speech <- function(sample_url)
{
one <- read_html(sample_url)
two <- html_nodes(one, "p.parrafos")
three <- html_text(two)
four <- str_replace_all(three, "(\r|\n\\s*)", "")
five <- paste(four, collapse=" ")
Sys.sleep(2)
return(five)
}
#####Here is where thing become problematic. Once it does complete, however, everything is good. It's the sapply which is the problem.
speeches_July_2003 <- sapply(speech_url,get_speech)
#####
speeches_2003_July.df <- data.frame("year"=2003, "month"=1, "text"=speeches_July_2003)