how to scrape, do not load whole page until we scroll down?

Kusal95 · June 20, 2019, 3:14am

Some web links do not load whole page until we scroll down (eg: http://www.espncricinfo.com/series/13062/commentary/428753/australia-vs-england-5th-test-england-tour-of-australia-2010-11?innings=1). I need to scrape commentary lines from web link using RStudio. If I try to scrape data using web scraping from links, it copies only data which load first few lines but not the whole page.
I tried this,
library(rvest)
url = "http://www.espncricinfo.com/series/13062/commentary/428753/australia-vs-england-5th-test-england-tour-of-australia-2010-11?innings=1"
page = read_html(url)
pagehtml = html_nodes(page, '.description')
htmltext(pagehtml )

josiah · June 20, 2019, 5:59pm

Hi @Kusal95,

To solve your problem I would look at RSelenium. This will allow you to interact with the webpage which is not currently possible within rvest. This StackOverflow question goes over using RSelenium with infinite scroll (your current situation) and should be able to assist you further.

Kusal95 · June 21, 2019, 3:08am

Thank you for your advice....

Kusal95 · June 21, 2019, 4:19am

This gives an error

#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

---error--
[1] "Connecting to remote server"
Error in checkError(res) :
Undefined error in httr call. httr output: Failed to connect to localhost port 4445: Connection refused

Kill3rbee · June 21, 2019, 5:30am

I checked it out and all you need to do after you read the page is get div class = content:
DeepinScreenshot_select-area_20190621002237

Once you do that, everything else will be easy. I would recommend BeautifulSoup.
Another challenge is getting connected to the webserver, That is where requests library excels.
I love R, but when it comes to networking and webscraping I use Python. You can even evade detection.

Good luck

Kusal95 · June 21, 2019, 7:10am

thank you. I will try it...

system · July 12, 2019, 7:10am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.