Hi R community,
I have done a scraping with rvest, but the names of the pdfs (outputs) in the end were not good. I am trying to improve it and I would like some tips, if it is possible.
In the end, I would like the names of the pdfs something like "year_names_.pdf" or
"2018_see-mg-fumarc.pdf".
I've tried what follows below. The problem is that I was not able to keep the columns "year" and "names", created in the first loop, in the second loop. Is it possible?
Main pages
url <- "https://www.pciconcursos.com.br/provas/professor-de-sociologia/"
urls_main_pages <- c(url, paste0(url, 2:3))
main_dt <- data.frame()
# to remove to form the names
pattern <- "https://www.pciconcursos.com.br/provas/download/professor-de-sociologia-"
pattern_2 <- "https://www.pciconcursos.com.br/provas/download/professor-de-educacao-basica-sociologia-"
pattern_3 <- "https://www.pciconcursos.com.br/provas/download/professor-de-"
pattern_4 <- "https://www.pciconcursos.com.br/provas/download/professor-auxiliar-"
pattern_5 <- "https://www.pciconcursos.com.br/provas/download/prova-professor-de-sociologia-"
pattern_6 <- "https://www.pciconcursos.com.br/provas/download/professor-educacao-basica-ii-de-"
# get links of the main pages
for(i in seq_along(urls_main_pages)){
print(i)
pages_html <- read_html(urls_main_pages[i])
nodes <- html_nodes(pages_html, '.prova_download')
links <- html_attr(nodes, "href")
main_dt <- rbind(main_dt, cbind(links))
# extract names and years
names <- str_remove_all(main_dt$links, pattern)
names <- str_remove_all(names, pattern_2)
names <- str_remove_all(names, pattern_3)
names <- str_remove_all(names, pattern_4)
names <- str_remove_all(names, pattern_5)
names <- str_remove_all(names, pattern_6)
year <- str_extract_all(names, "[\\d]{4}+", simplify = T)
names <- str_remove_all(names,"-[\\d]+")
# here a have created a dataframe with year and names which
#I would like to keep to use in the names of the pdfs in the end.
main_dt_2 <- cbind(main_dt, year, names)
}
main_dt_2$links <- as.character(main_dt_2$links)
Children pages
links_pdf <- data.frame()
for(i in seq_along(main_dt_2$links)){
print(i)
link_page <- read_html(main_dt_2$links[i])
link_page <- html_nodes(link_page, xpath = '//*[@id="download"]/ul[3]')
link_page <- html_nodes(link_page, 'a')
link_page <- html_attr(link_page, "href")
links_pdf <- rbind(links_pdf, cbind(link_page))
# Here, or somewhere inside this loop, I would like to
# join the columns years and names from main_dt_2 to the links_pdf
# with the aim to use they in the names of the pdfs in the end
}
Thanks in advantage and happy code,