Web scraping with R and rvest when the web page uses JavaScript

Preben · February 2, 2022, 12:24pm

I am attempting to scrape the webpage Kinotoppen 2.0 - Filmweb for title and other information under each movie. For other webpages I have been fine with running a couple of lines with html_nodes() and html_text() using SelectorGadget to pick the CSS selectors to get the different things I wanted as such:

html <- read_html("https://www.filmweb.no/kinotoppen/")
title <- html %>% 
  html_nodes(".Kinotoppen_MovieTitle__2MFbT") %>% 
  html_text()

However, when running those lines on this webpage I only get an empty character vector. Upon inspecting the webpage further I see that it is calling on javascripts.
I tried using html_nodes("script") together with the v8 library to run the javascripts, but to no avail.

What is the best way to deal with javascript-rendered webpages so that I can scrape data with rvest?

jrkrideau · February 2, 2022, 2:53pm

Ins this of any help How to Scrape Data from a JavaScript Website with R

Preben · February 3, 2022, 12:23pm

Thank you for the link.
The development of PhantomJS has been suspended, according to the website.
The solution (with help from StackOverflow) was to use RSelenium to automate the web browser.
A general solution:

remDr <- rsDriver(browser='chrome', port=4444L)
browser <- remDr$client
browser$open()
browser$navigate("url")

Then together with findElement and clickElement or executeScript with more to get what you want from the dynamic website. Then

pagesource <- browser$getPageSource()
html <- read_html(pagesource[[1]])

And continue on with rvest.
A good tutorial can be found here.

system · February 24, 2022, 12:23pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.