I was reading the exchange about scraping PAA from Google here -
and trying the RScript. Unfortunately, neither Pidroz (original script) or S. Woodward versions work for me.
When trying the original code, i get a "htmlParse" : Error in htmlParse(., encoding = "UTF-8") :
could not find function "htmlParse"
and when trying the suggestion , I am getting a rcul error : Error in curl::curl_fetch_memory(url, handle = handle) :
URL using bad/illegal format or missing URL
I think this has to do with &ie=utf-8&oe=utf-8&client=firefox-b" [Original code]
Versus "&ie=utf-8&oe=utf- 8&client=firefox-b").
Would anyone be willing to help? Many thanks in advance
F
why doesn't it give me the same error when i used "your version"? instead it gave me a different error? the only difference i see is the
&ie=utf-8&oe=utf-8&client=firefox-b which is different in your suggested fix
library(XML)
library(dplyr)
library(httr)
library(magrittr)
mykeyword <- c("canape", "ps4", "macbook")
my_user_agent <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0"
PAA <- vector("list", length(mykeyword))
i <- 1
for (i in 1:length(mykeyword)) {
url_to_check <- paste0("https://www.google.com/search?q=", mykeyword[i], "&ie=utf-8&oe=utf-8&client=firefox-b")
PAA[[i]] <- GET(url_to_check, user_agent(my_user_agent)) %>%
htmlParse(encoding = "UTF-8") %>%
xpathSApply("//div[/*]/g-accordion-expander/div/div", xmlValue) %>%
as.data.frame() %>%
set_colnames("text") %>%
mutate(
keyword = mykeyword[i],
text = as.character(text)
) %>%
filter(text > "")
}
PAA <- dplyr::bind_rows(PAA)
print(PAA)
#> text keyword
#> 1 What are the three types of CANape? canape
#> 2 What does CANape mean? canape
#> 3 Why are canapes called canapes? canape
#> 4 What is difference between canapes and hors d oeuvres? canape
#> 5 Is it still worth getting a ps4? ps4
#> 6 Is it still worth buying a ps4 in 2019? ps4
#> 7 How much does a ps4 cost in NZ? ps4
#> 8 Is it OK to leave a ps4 on overnight? ps4
#> 9 Which is the cheapest Apple laptop? macbook
#> 10 Are Macbooks worth it? macbook
#> 11 Which Apple laptop should I buy? macbook
#> 12 Why are macbooks so expensive? macbook
Oh! awesome! thanks so much! genius. I figured after your first response that XML library was missing- hence the error. am assuming the output could be printed to a CSV file using PAA = dplyr::bind_cols(PAA) instead
However, i am curious as to why it only returns 4 questions per keyword and not a longer list- not sure i see the command for that.
You're a star! thanks much
I think it's because Google returns 4 PAA by default. These seem to expand when you get to click on any of the questions (from the Google website)- so it's probably a limit they impose. all good!
thanks again. your knowledge was useful, and helping me decipher R a little more