Problem RSelenium (rsDriver) when scraping Goodreads Reviews: "Undefined error in httr call"

ledgreve · February 22, 2023, 1:31pm

Hello everyone!
First of all a disclaimer: I do not have a lot of experience using R or any other programming language, as a consequence, very simple and concrete anwsers would be extremely appreciated (so thank you in advance)!

I was given a script to scrape Goodreads-reviews in Chrome. It used to run perfectly, but now I only get error messages and I can't seem to find a solution. I hit the problem when I try to run the piece of code to set the browser and navigate the Goodreads-url:

#Set your browser settings 
rD <- rsDriver(browser = "chrome", chromever = "latest")
remDr <- rD[["client"]]
remDr$setTimeout(type = "implicit", 2000)
remDr$navigate(url)

When I run it, I get this error message:

Could not open chrome browser.
Client error message:
Undefined error in httr call. httr output: Failed to connect to localhost port 4567: Connection refused
Check server log for further details.
Warning message:
In rsDriver(browser = "chrome", chromever = "latest") :
  Could not determine server status.

I have tried manually setting the port to something else (e.g. rD <- rsDriver(port = 4686L, browser = "chrome", chromever = "latest")) or to explicitly mention my Chrome version (e.g. (e.g. rD <- rsDriver(port = 4686L, browser = "chrome", chromever = "110.0.5481.104")), but nothing is working.

I would be very grateful if someone could help me to solve this problem! I will provide the full script (including url of a Goodreads-page) below. You can run it in Rstudio, you just need to set the directory for the output at the end of the script.

library(rJava)        # Required to use RSelenium
library(data.table)   # Required for rbindlist
library(dplyr)        # Required to use the pipes %>% and some table manipulation commands
library(magrittr)     # Required to use the pipes %>%
library(rvest)        # Required for read_html
library(RSelenium)    # Required for webscraping with javascript
library(lubridate)    # Required to scrape the correct dates
library(stringr)      # Required to cut off any leading or trailing whitespace from text
library(purrr)


options(stringsAsFactors = F) #needed to prevent errors when merging data frames

#Paste the GoodReads Url
url <- "https://www.goodreads.com/book/show/96290.Die_unendliche_Geschichte"

englishOnly = F #If FALSE, all languages are chosen

#Set your browser settings (if chrome not working, pick closest version)
rD <- rsDriver(browser = "chrome", chromever = "latest")
remDr <- rD[["client"]]
remDr$setTimeout(type = "implicit", 2000)
remDr$navigate(url)

bookTitle = unlist(remDr$getTitle())
finalData = data.frame()

# Main loop going through the website pages
morePages = T
pageNumber =  1
while(morePages){
  
  #Select reviews in correct language
  #Go to the goodreads page of the book in Chrome and right-click.
  #Click on "View Page Source".
  #Look for the language code, it will look like this:
  #<select name="language_code" id="language_code"><option value="">All Languages</option><option value="de">Deutsch &lrm;(9)</option>
  #<option value="en">English &lrm;(9)</option><option value="es">Español &lrm;(1)</option>
  #The numeral language code is the sequence, so here "All Languages" is 1, "Deutsch" is 2, "English" is 3...
  #This sequence is not the same for every book, so check it each time!
  #It is sufficient if you only fill in the numeral language code.
  selectLanguage = if(englishOnly){
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[@value='de']")
  } else {
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[4]")
  }
  
  selectLanguage$clickElement()
  Sys.sleep(1)
  
  #Expand all reviews
  expandMore <- remDr$findElements("link text", "...more")
  expandMore = sapply(expandMore, function(x) x$clickElement())
  
  #Extracting the reviews from the page
  reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
  reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
  
  #Remove double text when expanded
  reviews.html <- lapply(reviews.html, function(x){
    if(str_count(x, "span id=\"freeText") > 1) {
      str_remove(x, "<span id=\"freeTextContainer.*")
    } else {
      x
    }
  })
  
  reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
  reviews.text <- unlist(reviews.list)
  
  #Some reviews have only rating and no text, so we process them separately
  onlyRating = unlist(map(1:length(reviews.text), function(i) str_detect(reviews.text[i], "^\\\n\\\n")))
  
  #Full reviews
  if(sum(!onlyRating) > 0){
    
    filterData = reviews.text[!onlyRating]
    fullReviews = purrr::map_df(seq(1, length(filterData), by=2), function(i){
      review = unlist(strsplit(filterData[i], "\n"))
      
      data.frame(
        date = mdy(review[2]), #date
        username = str_trim(review[5]), #user
        rating = str_trim(review[9]), #overall
        comment = str_trim(review[12]) #comment
      )
    })
    
    #Add review text to full reviews
    fullReviews$review = unlist(purrr::map(seq(2, length(filterData), by=2), function(i){
      str_trim(str_remove(filterData[i], "\\s*\\n\\s*\\(less\\)"))
    }))
    
  } else {
    fullReviews = data.frame()
  }
  
  #partial reviews (only rating)
  if(sum(onlyRating) > 0){
    
    filterData = reviews.text[onlyRating]
    partialReviews = purrr::map_df(1:length(filterData), function(i){
      review = unlist(strsplit(filterData[i], "\n"))
      
      data.frame(
        date = mdy(review[9]), #date
        username = str_trim(review[4]), #user
        rating = str_trim(review[8]), #overall
        comment = "",
        review = ""
      )
    })
    
  } else {
    partialReviews = data.frame()
  }
  
  #Get the review ID's from all the links
  reviewId = reviews.html %>% str_extract("/review/show/\\d+")
  partialId = reviewId[(length(reviewId) - nrow(partialReviews) + 1):length(reviewId)] %>% 
    str_extract("\\d+")
  if(nrow(fullReviews) > 0){
    reviewId = reviewId[1:(length(reviewId) - nrow(partialReviews))]
    reviewId = reviewId[seq(1, length(reviewId), 2)] %>% str_extract("\\d+")
  } else {
    reviewId = NULL
  }
  
  if(nrow(partialReviews) > 0){
    reviewId = c(reviewId, partialId)
  }
  
  finalData = rbind(finalData, cbind(reviewId, rbind(fullReviews, partialReviews)))
  
  #Go to next page if possible
  nextPage = remDr$findElements("xpath", "//a[@class='next_page']")
  if(length(nextPage) > 0){
    message(paste("PAGE", pageNumber, "Processed - Going to next"))
    nextPage[[1]]$clickElement()
    pageNumber = pageNumber + 1
    Sys.sleep(2)
  } else {
    message(paste("PAGE", pageNumber, "Processed - Last page"))
    morePages = FALSE
  }
  
}   
#end of the main loop

#Replace missing ratings by 'not rated'
finalData$rating = ifelse(finalData$rating == "", "not rated", finalData$rating)

#Stop server
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

#set directory to where you wish the file to go
#copy your working directory and exchange all backward slashes for forward slashes
getwd()
setwd("C:/Users/...")

#Write results
write.csv(finalData, paste0(bookTitle, ".csv"), row.names = F)
message("FINISHED!")

UPDATE: IT CHANGED AND I THINK IT GOT WORSE

in the mean time 've just tried to keep re-running the script in the hopes that something would change. While running I got the message that there was a java update . I figured maybe it wasn't working because that was not up to date, so I updated java. Now when I run the script I get this error message:

Selenium message:session not created: This version of ChromeDriver only supports Chrome version 100
Current browser version is 110.0.5481.178 with binary path C:\Program Files\Google\Chrome\Application\chrome.exe
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'LW07C379', ip: '157.193.150.239', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_361'
Driver info: driver.version: unknown
remote stacktrace: Backtrace:
	Ordinal0 [0x01008BD3+2395091]
	Ordinal0 [0x00F9ACA1+1944737]
	Ordinal0 [0x00E8D008+839688]
	Ordinal0 [0x00EAD1A3+971171]
	Ordinal0 [0x00EA8DAA+953770]
	Ordinal0 [0x00EA6661+943713]
	Ordinal0 [0x00ED96F0+1152752]
	Ordinal0 [0x00ED934A+1151818]
	Ordinal0 [0x00ED49D6+1133014]
	Ordinal0 [0x00EAEF76+978806]
	Ordinal0 [0x00EAFE86+982662]
	GetHandleVerifier [0x011BC912+1719138]
	GetHandleVerifier [0x0126B2CD+2434333]
	GetHandleVerifier [0x010A4001+569937]
	GetHandleVerifier [0x010A3066+565942]
	Ordinal0 [0x00FA265B+1975899]
	Ordinal0 [0x00FA72A8+1995432]
	Ordinal0 [0x00FA7395+1995669]
	Ordinal0 [0x00FB02F1+2032369]
	BaseThreadInitThunk [0x75D500F9+25]
	RtlGetAppContainerNamedObjectPath [0x778B7BBE+286]
	RtlGetAppContainerNamedObjectPath [0x778B7B8E+238]


Could not open chrome browser.
Client error message:
	 Summary: SessionNotCreatedException
 	 Detail: A new session could not be created.
	 Further Details: run errorDetails method
Check server log for further details.

I would be incredibly grateful for your help, I'm getting desperate.

system · April 5, 2023, 1:32pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.