Rvest - first attempt at web scraping - how to deal with multiple pages and missing values

Piranha · August 23, 2018, 10:18pm

Hello,

I am newish to R and am trying to teach myself rvest for scraping web pages. For my first attempt, I thought I would try to scrape some product review information from Amazon.

Question: What are some good ways to deal with missing values so that the various components of a review are correctly aligned?

Hopefully, this example below is reproducible and explains my challenges.

Target url: https://www.amazon.com/product-reviews/B01LXJA5JD/ref=cm_cr_arp_d_viewopt_srt?_encoding=UTF8&showViewpoints=1&sortBy=recent&pageNumber=1

# here are the packages I am using
library(stringr)
library(rvest)
library(lubridate)
library(tidyverse)
library(tidytext)

First, I created a vector to deal with multiple pages (hopefully I did this right!). This product has 5,000+ reviews spread across 400+ pages.

# Their format for the url becomes consistent after page 3 onwards, so I am focusing on pages 3 onwards for now
pages <- c("https://www.amazon.com/Roku-Express-HD-Streaming-Player/product-reviews/B01LXJA5JD/ref=cm_cr_arp_d_paging_btm_") %>%
  paste0(3:476) %>%
  paste0(c("?ie=UTF8&pageNumber=")) %>%
  paste0(3:476) %>%
  paste0(c("&pageSize=10&sortBy=recent"))

Next, two functions that will scrape the review "headline" and date respectively. I am trying to do some time series analysis on the text, so the dates are very important to me.

read_headline <- function(url){
  az <- read_html(url)
  headline <- az %>%
    html_nodes("[class='a-size-base a-link-normal review-title a-color-base a-text-bold']") %>%
    html_text() %>%
    as_tibble()
} 

read_date <- function(url){
  az <- read_html(url)
  date_f <- az %>%
    html_nodes('.review-date') %>%
    html_text() %>%
    str_replace_all("on ", "") %>%
    mdy() %>%
    as_tibble()
}

Finally, I used lapply to go through all of the pages and scrape the two items mentioned above.

headlines <- bind_rows(lapply(pages, read_headline))
dates <- bind_rows(lapply(pages, read_date))

Here is the problem: the length of these two items are vastly different. With so many pages and reviews, it is not practical to manually inspect all of the elements. I am assuming that some of the reviews have missing elements.

> length(headlines$value)
[1] 560
> length(dates$value)
[1] 720
>

Since I am trying to do some time series analysis, it is really important to me that the review headline is associated with the correct date.

Any thoughts/ideas/suggestions on how I go about this?

Also a secondary question: I know that some websites are designed in such as way that makes scraping very difficult. Would you consider Amazon to be a difficult website for scraping? Clearly lots of people are interested in mining their review...so maybe the company takes measures to make this difficult.

mishabalyasin · August 24, 2018, 8:15am

Why don't you put your results into data.frame/tibble? It would make it much easier to understand what is going wrong.
Specifically, you can create a data frame with one column (pages) and then use purrr::map twice with read_headline and read_date to create two columns. Something like this:

pages_df <- pages_df %>%
     dplyr::mutate(headlines = purrr::map(pages, read_headline),
                   dates = purrr::map(pages, read_date))

However, keep in mind that if you take this approach then both functions must return exactly the same number of results. In order to guarantee that, take a look at purrr adverbs (possibly, safely ...). They take a function (e.g., read_headline) as an argument and return a modified function that on error will return a different result, depending on which adverb you are going to use.

Finally, once you've done that, you'll see which pages are problematic and you can then see whether you need to modify your function to handle some special cases that you didn't anticipate.

Piranha · August 24, 2018, 3:13pm

Hi @mishabalyasin

I tried creating a data frame/tibble. That's were I ran into the problem! I noticed that the two smaller data frames ("headlines" and "dates") are of different lengths:

> length(headlines$value)
[1] 560
> length(dates$value)
[1] 720

I am trying to figure out why they have different lengths. I suspect that it is because of missing values in some of the later pages of reviews. However, with so many reviews and pages, it is very difficult to manually inspect the results and figure out the problem. Wondering if other people have come across something similar.

Any help would be much appreciated!

mishabalyasin · August 24, 2018, 3:26pm

Are pages stay the same in both of your function calls? Then you can use it to create a dataframe like so:

pages_df <- tibble::tibble(pages = pages)

Then you can use the approach I've mentioned (don't forget about purrr::safely). After that you'll have missing values on rows with problematic pages.

Piranha · August 24, 2018, 3:41pm

I haven't looked into purrr::safely in much detail yet. I will explore this further. Thank you.

Piranha · August 24, 2018, 9:15pm

I think I finally got it to work. The nodes that I selected via selector gadget were giving me a few duplicate items. Also, I used a subset of pages (pages 3:400). I also switched out bind_rows and lappply for map_dfr. All those changes combined seemed to do the trick. All of my variables are now of equal length!

If anyone is interested, updated code below.

library(stringr)
library(rvest)
library(lubridate)
library(tidyverse)
library(tidytext)

pages_3to400 <- c("https://www.amazon.com/Roku-Express-HD-Streaming-Player/product-reviews/B01LXJA5JD/ref=cm_cr_arp_d_paging_btm_") %>%
  paste0(3:400) %>%
  paste0(c("?ie=UTF8&pageNumber=")) %>%
  paste0(3:400) %>%
  paste0(c("&pageSize=10&sortBy=recent"))
pages_3to400

read_headline <- function(url){
  az <- read_html(url)
  headline <- az %>%
    html_nodes("[class='a-size-base a-link-normal review-title a-color-base a-text-bold']") %>%
    html_text() %>%
    as_tibble()
} 

read_date <- function(url){
  az <- read_html(url)
  date_f <- az %>%
    html_nodes('.review-date') %>%
    html_text() %>%
    str_replace_all("on ", "") %>%
    mdy() %>%
    as_tibble() %>%
    slice(3:n())
}

read_stars <- function(url){
  az <- read_html(url)
  stars_f <- az %>%
    html_nodes(".review-rating") %>%
    html_text() %>%
    substr(1,3) %>%
    as.numeric() %>%
    as_tibble() %>%
    slice(3:n())
}

read_fullrev <- function(url){
  az <- read_html(url)
  full_review <- az %>%
    html_nodes(".review-text") %>%
    html_text() %>%
    as_tibble()
}

dates <- map_dfr(pages_3to400, read_date)
stars <- map_dfr(pages_3to400, read_stars)
headlines <- map_dfr(pages_3to400, read_headline)
fullreview <- map_dfr(pages_3to400, read_fullrev)

#write.csv(df, "amzn_reviews.csv")

df <- tibble(date = dates$value,
             star = stars$value,
             headline = headlines$value, 
             fullreview = fullreview$value)
df

cderv · August 25, 2018, 10:55am

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it: