Web scraping multiple pages

Simmie · December 6, 2019, 9:57pm

Hello guys, I'm trying to scrape comments from multiple pages, but it appears only one of the pages is being scraped. Any help?

#Get url for all  the pages
page1 <- read_html("https://www.nairaland.com/search?q=Gtbank&board=0") %>% 
  html_nodes("table+ p") 
page1[[1]]

#Get the page number
page_number <- html_text(page1)
page_number


#Get the html of each page
page_url <- read_html("https://www.nairaland.com/search?q=Gtbank&board=0") %>% 
  html_nodes("table+ p") %>% 
  html_nodes("a") %>% 
  html_attr("href")

page_url

#Comment tibble
comment <- tibble()
for(i in 1:length(page_url[1:3])){
  comments <- page_url[i] %>%
    read_html() %>% 
    html_nodes(".pd") %>%
    html_text()
  # pause so we don't get banned!
}

skaltman · December 7, 2019, 4:35am

Each turn of the for loop completely overwrites comments. The end result is that comments actually only has data from the third url in page_url.

If you want all the comments stored in a tibble, you can use the purrr function map_dfr():

library(rvest)
#> Loading required package: xml2
library(tidyverse)

#Get the html of each page
page_url <- read_html("https://www.nairaland.com/search?q=Gtbank&board=0") %>% 
  html_nodes("table+ p") %>% 
  html_nodes("a") %>% 
  html_attr("href")

read_comments <- function(url) {
  tibble(
    comment = 
      url %>%
      read_html() %>% 
      html_nodes(".pd") %>%
      html_text() 
  )
}

comments <-  
  page_url[1:3] %>% 
  set_names(1:3) %>% 
  map_dfr(read_comments, .id = "page_number")

^{Created on 2019-12-06 by the reprex package (v0.2.1)}
(You also create a comment tibble, but then store the comments in comments. Not sure if that was a typo or you had other plans for comment )

Simmie · December 7, 2019, 5:17am

Wow! Thanks, I think I understand where I got it wrong, and you even made each comments sync with the pages. Haha, you're amazing.

system · December 14, 2019, 5:18am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.