Hello everyone,
I am trying to scrape data from a website that displays the data of interest with JavaScript. I learned from a few tutorials how to do it using PhantomJS:
In order to make the lives of my potential helpers easier, I provide the html content obtained with PhantomJS here so that you do not have to go through the entire process.
I am trying to scrape the locations of all Verizon stores in Texas. The main webpage that contains the data is: https://www.verizonwireless.com/stores/texas/#/state. Notice that when you access the webpage, the cities displayed are the "A" cities. When you click on "All", all cities get displayed; however, the link of the webpage does not change (because JavaScript).
My main issue is that when I download the content of the page with PhantomJS and scrape the links, I only get the "A" cities and nothing else even though the other city links have the same selector. How can I get the links for all cities?
Javascript file for downloading the content of the webpage
// scrape_verizon.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'verizon-stores.html'
page.open('https://www.verizonwireless.com/stores/texas/#/state/', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
R code for scraping links
library(rvest)
library(here)
# Is it legal to scrape data from the Verizon store website: YES
robotstxt::paths_allowed("https://www.verizonwireless.com/stores/texas/#/state")
# Scrape javascript content from the webpage
system("./phantomjs scrape-tx-verizon.js")
# Scrape html
html <- read_html(here("verizon-stores.html"))
html %>%
html_nodes(css = '#cityList .link') %>%
html_attr(name = "href")
[1] "/stores/texas/abilene/" "/stores/texas/allen/" "/stores/texas/amarillo/" "/stores/texas/arlington/" "/stores/texas/athens/" "/stores/texas/austin/"
Thank you for your help.