Scrape data from javascript website with multiple tabs

gueyenono · December 29, 2019, 1:48am

Hello everyone,

I am trying to scrape data from a website that displays the data of interest with JavaScript. I learned from a few tutorials how to do it using PhantomJS:

Link 1, Link 2

In order to make the lives of my potential helpers easier, I provide the html content obtained with PhantomJS here so that you do not have to go through the entire process.

I am trying to scrape the locations of all Verizon stores in Texas. The main webpage that contains the data is: https://www.verizonwireless.com/stores/texas/#/state. Notice that when you access the webpage, the cities displayed are the "A" cities. When you click on "All", all cities get displayed; however, the link of the webpage does not change (because JavaScript).

My main issue is that when I download the content of the page with PhantomJS and scrape the links, I only get the "A" cities and nothing else even though the other city links have the same selector. How can I get the links for all cities?

Javascript file for downloading the content of the webpage

// scrape_verizon.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'verizon-stores.html'

page.open('https://www.verizonwireless.com/stores/texas/#/state/', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

R code for scraping links

library(rvest)
library(here)

# Is it legal to scrape data from the Verizon store website: YES
robotstxt::paths_allowed("https://www.verizonwireless.com/stores/texas/#/state")

# Scrape javascript content from the webpage
system("./phantomjs scrape-tx-verizon.js")

# Scrape html
html <- read_html(here("verizon-stores.html"))

html %>%
  html_nodes(css = '#cityList .link') %>%
  html_attr(name = "href")

[1] "/stores/texas/abilene/"   "/stores/texas/allen/"     "/stores/texas/amarillo/"  "/stores/texas/arlington/" "/stores/texas/athens/"    "/stores/texas/austin/"

Thank you for your help.

system · January 19, 2020, 1:48am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.