Not able to scrape a website

Anantadinath · November 2, 2018, 9:45am

I am trying to get the number of products of brand jealous-21 on 2 websites. It is just for demo purposes.

I run this code to get the number

library(rvest)
library(httr)

    read_html('http://www.jabong.com/find/jealous-21') %>%
    html_nodes('span.product-count') %>%
    html_text() %>%
    .[1]

#>[1] "278 products"

this gives me

But when I try to do the same on another site that doesn't give me anything

    read_html('https://www.myntra.com/jealous-21') %>%
    html_nodes('.horizontal-filters-sub') %>%
    html_text()

This is what I want

and it gives me nothing. What should I do here.

mfherman · November 2, 2018, 11:39am

I'm a pretty novice web scraper, but I found this StackOverflow answer that I was able to use to get the results you're looking for. Seems like that second site you are trying to scrape doesn't play well with rvest because it is dynamically created JavaScript and not static HTML.

The basic approach is to start a local Selenium server, navigate to the desired page, and then extract the elements needed. This is done using the RSelenium package from ropensci.

library(RSelenium)

# start up local selenium server 
rD <- rsDriver()
remDr <- rD$client

# navigate to desired page
remDr$navigate('https://www.myntra.com/jealous-21')

# access needed element
item_element <- remDr$findElement(using = "xpath", '//*[@id="desktopSearchResults"]/div[1]/section/div[1]/div[1]/span')

# extract element text
items <- item_element$getElementText()[[1]]

# remove non numeric characters
gsub("[^0-9]", "", items)
#> [1] "430"

# close browser
remDr$close()

# stop server
rD[["server"]]$stop() 
#> [1] TRUE

^{Created on 2018-11-02 by the reprex package (v0.2.1)}

Anantadinath · November 5, 2018, 5:41am

Thank you very much for replying but the code that you wrote doesn't work.
it gives me an error like this

> rD <- rsDriver()
checking Selenium Server versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
Error in .Call(C_serialize_to_yaml, x, line.sep, indent, omap, column.major,  : 
  Incorrect number of arguments (9), expecting 8 for 'serialize_to_yaml'

I read the documentation but it seems like I don't know how to setup the selenium on windows machine. Could you tell me the simplest method to run selenium Please.

jlacko · November 5, 2018, 8:01am

Selenium can be tricky; it was not meant as a scraping tool, but as a browser automation tool (the main use case is automated testing - sort of like test_that for web frontends). It is not a simple tool. It requires a browser to run (I used Firefox, which is the default on) and so it is one of the rare use cases when you are better off with a local installation of R Studio than with a server one. Setting up it on a server / in a cloud is possible, but yet more difficult.

It has two modes: Docker and standalone. The docker is supposed to be the preferred way to set it up, but I had more success with standalone installation.

The steps I took when setting it up were:

download the jar file from Downloads | Selenium
placed it manually in the /bin/ directory of my R package (this was a pretty unusual step!)
created a bat file to start the driver - it has a long name that I found hard to remember and I had difficulties executing it from inside R; this can probably be overcome, but I was lazy...
This needs to be started in CMD console (= outside of RStudio).

java -jar selenium-server-standalone-3.141.0.jar

connected to the driver from your R code by running

remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444L, browserName = "firefox")
remDr$open()

... # the actual code comes here

remDr$close() # this is important

RSelenium is not easy to set up, but once you get it running it is a powerful web automation & scraping tool. It actually renders the HTML, meaning pages that do not want to be scraped can run but cannot hide. Use it with caution and respect to the Terms & Conditions of your targets.

jcblum · November 5, 2018, 8:07am

I am not a web-scraper by any means, but I have seen people who got stuck with Selenium setup have success with splashr, so that might also be worth a look:

Same applies. Food for thought along these lines:
https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01

Anantadinath · November 6, 2018, 7:11am

> devtools::install_github("hrbrmstr/splashr")
Downloading GitHub repo hrbrmstr/splashr@master
Skipping 13 packages not available: curl, docker, dplyr, formatR, HARtools, httr, jsonlite, lubridate, magick, openssl, purrr, scales, stringi
Installing 2 packages: docker, HARtools
Error: (converted from warning) unable to access index for repository https://cran.rstudio.com/src/contrib:
  cannot open URL 'https://cran.rstudio.com/src/contrib/PACKAGES'
> 
> pacman::p_install_gh("hrbrmstr/splashr")
Downloading GitHub repo hrbrmstr/splashr@master

Skipping 20 packages not available: covr, curl, docker, dplyr, formatR, HARtools, httr, jpeg, jsonlite, knitr, lubridate, magick, openssl, png, purrr, rmarkdown, scales, stringi, testthat, tibble
Installing 2 packages: docker, HARtools
Installation failed: NULL : (converted from warning) unable to access index for repository https://cran.rstudio.com/src/contrib:
  cannot open URL 'https://cran.rstudio.com/src/contrib/PACKAGES'

The following packages were not able to be installed:
splashr
> library(splashr)
Error in library(splashr) : there is no package called ‘splashr’

i am not able to install it as well. What should I do.

cderv · November 6, 2018, 7:17am

Are you sure you are connected to internet without any proxy required ? This error means the cran url is not reached. Nothing to do with the specific you tried to install.