Improving the names of the outputs (pdfs) in a scraping with rvest

Rony · December 15, 2019, 3:02pm

Hi R community,
I have done a scraping with rvest, but the names of the pdfs (outputs) in the end were not good. I am trying to improve it and I would like some tips, if it is possible.

In the end, I would like the names of the pdfs something like "year_names_.pdf" or
"2018_see-mg-fumarc.pdf".

I've tried what follows below. The problem is that I was not able to keep the columns "year" and "names", created in the first loop, in the second loop. Is it possible?

Main pages

url <- "https://www.pciconcursos.com.br/provas/professor-de-sociologia/"

urls_main_pages <- c(url, paste0(url, 2:3))

main_dt <- data.frame()

# to remove to form the names
pattern <- "https://www.pciconcursos.com.br/provas/download/professor-de-sociologia-"
pattern_2 <- "https://www.pciconcursos.com.br/provas/download/professor-de-educacao-basica-sociologia-"
pattern_3 <- "https://www.pciconcursos.com.br/provas/download/professor-de-"
pattern_4 <- "https://www.pciconcursos.com.br/provas/download/professor-auxiliar-"
pattern_5 <- "https://www.pciconcursos.com.br/provas/download/prova-professor-de-sociologia-"
pattern_6 <- "https://www.pciconcursos.com.br/provas/download/professor-educacao-basica-ii-de-"

# get links of the main pages 
for(i in seq_along(urls_main_pages)){
  print(i)
  pages_html <- read_html(urls_main_pages[i])
  nodes <- html_nodes(pages_html, '.prova_download')
  links <- html_attr(nodes, "href") 
  main_dt <- rbind(main_dt, cbind(links))
  # extract names and years
  names <- str_remove_all(main_dt$links, pattern)
  names <- str_remove_all(names, pattern_2)
  names <- str_remove_all(names, pattern_3)
  names <- str_remove_all(names, pattern_4)
  names <- str_remove_all(names, pattern_5)
  names <- str_remove_all(names, pattern_6)
  year <- str_extract_all(names, "[\\d]{4}+", simplify = T)
  names <- str_remove_all(names,"-[\\d]+")
  # here a have created a dataframe with year and names which 
  #I would like to keep to use in the names of the pdfs in the end. 
  main_dt_2 <- cbind(main_dt, year, names)
}
  
main_dt_2$links <- as.character(main_dt_2$links)

Children pages

links_pdf <- data.frame()

for(i in seq_along(main_dt_2$links)){
  print(i)
  link_page <- read_html(main_dt_2$links[i])
  link_page <- html_nodes(link_page, xpath = '//*[@id="download"]/ul[3]')
  link_page <- html_nodes(link_page, 'a')
  link_page <- html_attr(link_page, "href")
  links_pdf <- rbind(links_pdf, cbind(link_page))
  # Here, or somewhere inside this loop, I would like to 
  # join the columns years and names from main_dt_2 to the links_pdf
  # with the aim to use they in the names of the pdfs in the end
}

Thanks in advantage and happy code,

system · January 5, 2020, 3:02pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.