Using rvest to scrape web pages

Maybellyne · July 14, 2020, 9:01pm

I am new to R and learning to scrape web pages. I am trying to scrape all user reviews across three pages for a deprecated WordPress plugin. So I have the code below:

#specify the first page URL
fpURL <- 'https://wordpress.org/support/plugin/easyrecipe/reviews/'

#read the HTML contents in the first page URL
contentfpURL <- read_html(fpURL)

#identify the anchor tags in the first page URL
fpAnchors <- html_nodes(contentfpURL, css='a.bbp-topic-permalink')

#extract the HREF attribute value of each anchor tag
fpHREF <- html_attr(fpAnchors, 'href')

#create empty lists to store titles & contents found in the HREF attribute value of each anchor tag
titles = c()
contents = c()

#loop the following actions for each HREF found firstpage 
for (u in fpHREF) {
  
  #read the HTML content of the review page
  fpURL = read_html(u)
  
  #identify the title anchor and read the title text  
  fpreviewT = html_text(html_nodes(fpURL, css='h1.page-title'))

  #identify the content anchor and read the content text
  fpreviewC = html_text(html_nodes(fpURL, css='div.bbp-topic-content'))

  #store the review titles and contents in the previous lists
  titles = c(titles, fpreviewT)
  contents = c(contents, fpreviewC)
}
#identify the anchor tag pointing to the next summary page
npAnchor <- html_text(html_nodes(contentfpURL, css='a.next page-numbers'))

#extract the HREF attribute value of the anchor tag pointing to the next summary page
npHREF <- html_attr(npAnchor, 'href')

#loop the following actions for every next summary page HREF attribute
for (u in npHREF) {
  #specify the URL of the summary page
  spURL <- read_html('npHREF')

  #identify all the anchor tags on that summary page
  spAnchors <- html_nodes(spURL, css='a.bbp-topic-permalink')

  #extract the HREF attribute value of each anchor tag
  spHREF <- html_attr(spAnchors, 'href')

    #loop the following actions for each HREF found on that summarypage 
    for (u in fpHREF) {
  
      #read the HTML contents of the review page
      spURL = read_html(u)

      #identify the title anchor and read the title text  
      spreviewT = html_text(html_nodes(spURL, css='h1.page-title'))

      #identify the content anchor and read the content text
      spreviewC = html_text(html_nodes(spURL, css='div.bbp-topic-content'))
      
      #store the review titles and contents in the previous lists
      titles = c(titles, spreviewT)
      contents = c(contents, spreviewC)
      }
}

However, my code does not work. I am not sure what I am doing wrong, maybe it's the multiple loops?

I will appreciate some help. Thank you.

gueyenono · July 14, 2020, 11:37pm

Hi @Maybellyne,

Welcome to RStudio Community

You use for loops very proficiently, which gives away the fact that even though you are a newbie in R, that you already have a solid background in at least one other programming language. Having said that, it is important to know that even though loops in general can be used in R as they are used in other languages, R has its own paradigm regarding repetitive tasks. The general rule is that whenever you want to use a loop in R, there probably is a different way of going about things. Of course, you cannot avoid loops entirely and nothing stops you from using them really. However, I would encourage you to learn about the concept of vectorization as well as the apply() family of functions (e.g. lapply(), sapply(), mapply(), ...). In the code below, I use a different family of functions (i.e. the map() family) because of the convenience they bring; however, the end result is the same really.

If you wish to understand some parts of the code, do not hesitate to let me know. At the end of the code, I also show you the end result of the scraping process.

# Load needed libraries ----

library(rvest)
library(dplyr)
library(stringr)
library(purrr)


# Build functions for scraping reviews ----

# > Function for scraping topic names and their urls 


scrape_topic_url <- function(page_url){
	
	page_html <- read_html(page_url)
	
	topic_names <- page_html %>%
		html_nodes(css = ".bbp-topic-permalink") %>%
		html_text() %>%
		str_squish()
	
	topic_urls <- page_html %>%
		html_nodes(css = ".bbp-topic-permalink") %>%
		html_attr(name = "href")
	
	tibble(topic = topic_names, topic_url = topic_urls)
	
}


# > Function for scraping the comments/reviews

topic_url <- master$topic_url[1]

scrape_topic_thread <- function(topic_url){
	
	topic_html <- read_html(topic_url)
	
	topic_html %>%
		html_nodes(css = ".bbp-topic-content") %>%
		html_text() %>%
		str_squish()
	
}


# Perform the scraping task ----

page_urls <- c("https://wordpress.org/support/plugin/easyrecipe/reviews/", paste0("https://wordpress.org/support/plugin/easyrecipe/reviews/", 2:3))

master <- map_dfr(page_urls, scrape_topic_url) %>%
	mutate(content = map_chr(topic_url, scrape_topic_thread))

master

# A tibble: 90 x 3
   topic                        topic_url                                       content                                                                  
   <chr>                        <chr>                                           <chr>                                                                    
 1 Terrible                     https://wordpress.org/support/topic/terrible-1~ "If I could give zero stars, I would. They have failed to answer any sup~
 2 Not working with WP 5.03     https://wordpress.org/support/topic/not-workin~ "I’m very disappointed that Easy Recipe hasn’t updated to be compatible ~
 3 easyrecipe plugin            https://wordpress.org/support/topic/easyrecipe~ "Have been using for years. Love this plugin"                            
 4 Don’t do it! ZERO support f~ https://wordpress.org/support/topic/dont-do-it~ "Bought slEazyRecipe Plus for the extra features, updates, and support. ~
 5 Worst Ever Don’t Buy Premium https://wordpress.org/support/topic/worst-ever~ "I’ve sent several repeated emails to get the updated version after buyi~
 6 Does not work                https://wordpress.org/support/topic/does-not-w~ "I am already using Recipe taxonomy on my website and plugin does not wo~
 7 Stay away! Don’t buy the PR~ https://wordpress.org/support/topic/stay-away-~ "There is no support and the author is using this free plugin to scam pe~
 8 After Using this for 3 Yrs ~ https://wordpress.org/support/topic/after-usin~ "I have been using the paid version of this plugin for three years – my ~
 9 The worst support ever       https://wordpress.org/support/topic/the-worst-~ "Support tickets are not being red. Bought the plugin twice! Licence key~
10 No Support and No Longer Up~ https://wordpress.org/support/topic/no-support~ "I’m not sure what’s happened because this plugin used to be great. When~
# ... with 80 more rows

gueyenono · July 15, 2020, 10:38pm

If the code I sent you works for you, please do not forget to mark it as the solution in order to help someone else in the future.

Maybellyne · July 15, 2020, 10:45pm

@gueyenono Many thanks for taking the time to help. I also took some time to study pipes, apply() and map(). My questions:

Why do we need paste0()? I understand it concatenates vectors after converting to character. Is concatenation necessary here?
What does 2:3 do?
I ran topic_url <- master$topic_url[1] and got an error: object 'master' not found Not sure what that line does.

Furthermore, I was expecting 63 rows but got 90. So I exported the data frame to a CSV. Apparently, some other URLs from different plugin reviews were scraped. I have no idea why this happened considering only one URL was specified. Only the first 30 rows (page 1) are correct. Nothing was scraped from page 2 and 3.

Thank you.

gueyenono · July 15, 2020, 11:41pm

@Maybellyne, you are perfectly right. There should be 63 rows and I corrected the code - it works now.

Questions 1 and 2

The goal is to scrape 3 web pages, so we need a way to generate the 3 URLs and this is exactly what this code does (this is the code that I corrected actually!):

page_urls <- c("https://wordpress.org/support/plugin/easyrecipe/reviews/", paste0("https://wordpress.org/support/plugin/easyrecipe/reviews/page/", 2:3))

It's primarily a call to the c() function. The paste0() function is called inside the c() function in order to generate the URLs for page 2 and page 3 (hence the use of 2:3). If you want to do it manually, this is how you would do it:

c("https://wordpress.org/support/plugin/easyrecipe/reviews/", "https://wordpress.org/support/plugin/easyrecipe/reviews/page/2", "https://wordpress.org/support/plugin/easyrecipe/reviews/page/3")

which you would agree could be very tedious if you had more than 3 web pages to scrape.

Question 3

You are right. This is a code that I used for personal testing and I forgot to remove it from the final script before pasting it here. You can completely disregard it.

So this is the updated code as well as the final result. Once again, do not hesitate if you have more questions:

# Load needed libraries ----

library(rvest)
library(dplyr)
library(stringr)
library(purrr)


# Build functions for scraping reviews ----

# > Function for scraping topic names and their urls 


scrape_topic_url <- function(page_url){
	
	page_html <- read_html(page_url)
	
	topic_names <- page_html %>%
		html_nodes(css = ".bbp-topic-permalink") %>%
		html_text() %>%
		str_squish()
	
	topic_urls <- page_html %>%
		html_nodes(css = ".bbp-topic-permalink") %>%
		html_attr(name = "href")
	
	tibble(topic = topic_names, topic_url = topic_urls)
	
}


# > Function for scraping the comments/reviews

scrape_topic_thread <- function(topic_url){
	
	topic_html <- read_html(topic_url)
	
	topic_html %>%
		html_nodes(css = ".bbp-topic-content") %>%
		html_text() %>%
		str_squish()
	
}


# Perform the scraping task ----

page_urls <- c("https://wordpress.org/support/plugin/easyrecipe/reviews/", paste0("https://wordpress.org/support/plugin/easyrecipe/reviews/page/", 2:3))

master <- map_dfr(page_urls, scrape_topic_url) %>%
	mutate(content = map_chr(topic_url, scrape_topic_thread))

master

# A tibble: 63 x 3
   topic                          topic_url                                          content                                                                                    
   <chr>                          <chr>                                              <chr>                                                                                      
 1 Terrible                       https://wordpress.org/support/topic/terrible-161/  "If I could give zero stars, I would. They have failed to answer any support questions I h~
 2 Not working with WP 5.03       https://wordpress.org/support/topic/not-working-w~ "I’m very disappointed that Easy Recipe hasn’t updated to be compatible with WP 5.03. Limi~
 3 easyrecipe plugin              https://wordpress.org/support/topic/easyrecipe-pl~ "Have been using for years. Love this plugin"                                              
 4 Don’t do it! ZERO support for~ https://wordpress.org/support/topic/dont-do-it-ze~ "Bought slEazyRecipe Plus for the extra features, updates, and support. The plugin update ~
 5 Worst Ever Don’t Buy Premium   https://wordpress.org/support/topic/worst-ever-do~ "I’ve sent several repeated emails to get the updated version after buying the premium. Th~
 6 Does not work                  https://wordpress.org/support/topic/does-not-work~ "I am already using Recipe taxonomy on my website and plugin does not work =("             
 7 Stay away! Don’t buy the PRO ~ https://wordpress.org/support/topic/stay-away-don~ "There is no support and the author is using this free plugin to scam people into buying t~
 8 After Using this for 3 Yrs Be~ https://wordpress.org/support/topic/after-using-t~ "I have been using the paid version of this plugin for three years – my original developer~
 9 The worst support ever         https://wordpress.org/support/topic/the-worst-sup~ "Support tickets are not being red. Bought the plugin twice! Licence key is not shown in e~
10 No Support and No Longer Upda~ https://wordpress.org/support/topic/no-support-an~ "I’m not sure what’s happened because this plugin used to be great. When you submitted a s~
# ... with 53 more rows

Maybellyne · July 16, 2020, 8:17pm

Thanks @gueyenono!!! It works.

gueyenono · July 16, 2020, 9:10pm

@Maybellyne Glad to hear. Once again, do not forget to mark the correct code as the solution in order to help potential future readers.

system · July 23, 2020, 9:10pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.