Web-scraping: "Selenium message:invalid selector"

ledgreve · March 12, 2020, 2:26pm

Hello,

A while ago I received some wonderful help in building a script for scraping Goodreads reviews (full script included below). It worked perfectly, but when I tried to run it again today, I received the following error message:

Selenium message:invalid selector: Unable to locate an element with the xpath expression //select[@id='language_code']/option[] because of the following error:
SyntaxError: Failed to execute 'evaluate' on 'Document': The string '//select[@id='language_code']/option[]' is not a valid XPath expression.
  (Session info: chrome=80.0.3987.132)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/invalid_selector_exception.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'LW07C379', ip: '157.193.150.239', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_231'
Driver info: driver.version: unknown

Error: 	 Summary: InvalidSelector
 	 Detail: Argument was an invalid selector (e.g. XPath/CSS).
 	 class: org.openqa.selenium.InvalidSelectorException
	 Further Details: run errorDetails method

Could someone help me to solve this problem?

There are also some other things I would like to adapt, so any practical advice regarding that would be most welcome as well!

no longer removing special characters (e.g.: ü, à, ç, é, ", ...) and using utf-8
scraping the URL or IDs of the reviews as well (this would allow me to properly anonymise my data). I think this should be possible by making the script click on "see review" and by scraping the URL/ID from that location.
Removing the beginning of each review that was repeated on the html file

Thank you in advance!

Full script:

library(data.table)   # Required for rbindlist
library(dplyr)        # Required to use the pipes %>% and some table manipulation commands
library(magrittr)     # Required to use the pipes %>%
library(rvest)        # Required for read_html
library(RSelenium)    # Required for webscraping with javascript
library(lubridate)
library(stringr)
library(purrr)


options(stringsAsFactors = F) #needed to prevent errors when merging data frames

#Paste the GoodReads Url
url <- "https://www.goodreads.com/book/show/1885.Pride_and_Prejudice?ac=1&from_search=true&qid=94o1v7Jy7T&rank=1"

languageOnly = F #If FALSE, "all languages" is chosen

#Set your browser settings
rD <- rsDriver(browser = "chrome", chromever = "79.0.3945.36")
remDr <- rD[["client"]]
remDr$setTimeout(type = "implicit", 2000)
remDr$navigate(url)

bookTitle = unlist(remDr$getTitle())
finalData = data.frame()

# Main loop going through the website pages
morePages = T
pageNumber =  1
while(morePages){
  
  #Select reviews in correct language
  selectLanguage = if(languageOnly){
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[@value='']")
  } else {
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[]")
  }
  
  selectLanguage$clickElement()
  Sys.sleep(3)
  
  #Expand all reviews
  expandMore <- remDr$findElements("link text", "...more")
  sapply(expandMore, function(x) x$clickElement())
  
  #Extracting the reviews from the page
  reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
  reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
  reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
  reviews.text <- unlist(reviews.list)
  
  #Some reviews have only rating and no text, so we process them separately
  onlyRating = unlist(map(1:length(reviews.text), function(i) str_detect(reviews.text[i], "^\\\n\\\n")))
  
  #Full reviews
  if(sum(!onlyRating) > 0){
    
    filterData = reviews.text[!onlyRating]
    fullReviews = purrr::map_df(seq(1, length(filterData), by=2), function(i){
      review = unlist(strsplit(filterData[i], "\n"))
      
      data.frame(
        date = mdy(review[2]), #date
        username = str_trim(review[5]), #user
        rating = str_trim(review[9]), #overall
        comment = str_trim(review[12]) #comment
      )
    })
    
    #Add review text to full reviews
    fullReviews$review = unlist(purrr::map(seq(2, length(filterData), by=2), function(i){
      str_trim(str_remove(filterData[i], "\\s*\\n\\s*\\(less\\)"))
    }))
    
  } else {
    fullReviews = data.frame()
  }
  
  
  #partial reviews (only rating)
  if(sum(onlyRating) > 0){
    
    filterData = reviews.text[onlyRating]
    partialReviews = purrr::map_df(1:length(filterData), function(i){
      review = unlist(strsplit(filterData[i], "\n"))
      
      data.frame(
        date = mdy(review[9]), #date
        username = str_trim(review[4]), #user
        rating = str_trim(review[8]), #overall
        comment = "",
        review = ""
      )
    })
    
  } else {
    partialReviews = data.frame()
  }
  
  finalData = rbind(finalData, fullReviews, partialReviews)
  
  #Go to next page if possible
  nextPage = remDr$findElements("xpath", "//a[@class='next_page']")
  if(length(nextPage) > 0){
    message(paste("PAGE", pageNumber, "Processed - Going to next"))
    nextPage[[1]]$clickElement()
    pageNumber = pageNumber + 1
    Sys.sleep(2)
  } else {
    message(paste("PAGE", pageNumber, "Processed - Last page"))
    morePages = FALSE
  }
  
}   
#end of the main loop

#Replace missing ratings by 'not rated'
finalData$rating = ifelse(finalData$rating == "", "not rated", finalData$rating)

#Stop server
rD[["server"]]$stop()

#set directory to where you wish the file to go
getwd()
setwd("")

#Write results
write.csv(finalData, paste0(bookTitle, ".csv"), row.names = F)
message("FINISHED!")

dcruvolo · March 12, 2020, 5:19pm

It looks like trailing braces ([]) in //select[@id='language_code']/option[] is causing the error. I think the [] indicates that a specific option should be returned, but nothing was supplied. Removing the [] will select the default option.

#Select reviews in correct language
selectLanguage = if(languageOnly){
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[@value='']")
} else {
-    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[]")
+   selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option")
}

Maybe the base Encode function will help? R: Read or Set the Declared Encodings for a Character Vector

I think the script will need to be adjusted a bit to get urls and IDs. In the line:

reviews <- remDr$findElements("css selector", "#bookReviews .stacked")

the selector #bookReviews .stacked is placing you pretty deep into the markup for the reviews. You might need to add a step or two before this line and select the parent element .section firstReview (the second child div in #bookReviews), and then use getElementAttribute() function to extract id and href attributes from the div with the class review nosyndicate.

Are you referring to the duplicate <span> elements inside the div.reviewText? If so, you might need an additional selector in reviews.list. I think what is happening is that the first span is the truncated review for reviews over X characters, which returns a ..more link. When ...more is clicked, then the second <span> is displayed and the truncated review is hidden. I'm not sure if both elements are the same. To be safe, select the last <span> element.

reviews.list <- lapply(reviews.html, function(x){
    read_html(x) %>% 
+    html_nodes("span:last-child") %>%
    html_text()
})

I think you can get away without additional logic to determine if the review has a ...more link. Using the pseudo-selector last-child will select the last element or the only element for shorter reviews. I vaguely recall that the rvest package has limited support for pseudo-selectors, but I think that last-child is fine.

Hope that helps! Let me know if anything is unclear.

system · April 2, 2020, 5:19pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.