Hello,
I am newish to R and am trying to teach myself rvest for scraping web pages. For my first attempt, I thought I would try to scrape some product review information from Amazon.
Question: What are some good ways to deal with missing values so that the various components of a review are correctly aligned?
Hopefully, this example below is reproducible and explains my challenges.
# here are the packages I am using
library(stringr)
library(rvest)
library(lubridate)
library(tidyverse)
library(tidytext)
First, I created a vector to deal with multiple pages (hopefully I did this right!). This product has 5,000+ reviews spread across 400+ pages.
# Their format for the url becomes consistent after page 3 onwards, so I am focusing on pages 3 onwards for now
pages <- c("https://www.amazon.com/Roku-Express-HD-Streaming-Player/product-reviews/B01LXJA5JD/ref=cm_cr_arp_d_paging_btm_") %>%
paste0(3:476) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(3:476) %>%
paste0(c("&pageSize=10&sortBy=recent"))
Next, two functions that will scrape the review "headline" and date respectively. I am trying to do some time series analysis on the text, so the dates are very important to me.
read_headline <- function(url){
az <- read_html(url)
headline <- az %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base a-text-bold']") %>%
html_text() %>%
as_tibble()
}
read_date <- function(url){
az <- read_html(url)
date_f <- az %>%
html_nodes('.review-date') %>%
html_text() %>%
str_replace_all("on ", "") %>%
mdy() %>%
as_tibble()
}
Finally, I used lapply to go through all of the pages and scrape the two items mentioned above.
headlines <- bind_rows(lapply(pages, read_headline))
dates <- bind_rows(lapply(pages, read_date))
Here is the problem: the length of these two items are vastly different. With so many pages and reviews, it is not practical to manually inspect all of the elements. I am assuming that some of the reviews have missing elements.
> length(headlines$value)
[1] 560
> length(dates$value)
[1] 720
>
Since I am trying to do some time series analysis, it is really important to me that the review headline is associated with the correct date.
Any thoughts/ideas/suggestions on how I go about this?
Also a secondary question: I know that some websites are designed in such as way that makes scraping very difficult. Would you consider Amazon to be a difficult website for scraping? Clearly lots of people are interested in mining their review...so maybe the company takes measures to make this difficult.