Scraping PAA (People Also ask)

I was reading the exchange about scraping PAA from Google here -

and trying the RScript. Unfortunately, neither Pidroz (original script) or S. Woodward versions work for me.

When trying the original code, i get a "htmlParse" : Error in htmlParse(., encoding = "UTF-8") :
could not find function "htmlParse"

and when trying the suggestion , I am getting a rcul error : Error in curl::curl_fetch_memory(url, handle = handle) :
URL using bad/illegal format or missing URL

I think this has to do with &ie=utf-8&oe=utf-8&client=firefox-b" [Original code]

Versus "&ie=utf-8&oe=utf- 8&client=firefox-b").

Would anyone be willing to help? Many thanks in advance
F

Ccing woodward as the author of the suggested changed, Best

It looks like you don't have the package for htmlParse installed.

Type ??htmlParse to find out which package it's from.

Then install and load that package.

Many thanks, Simon.

why doesn't it give me the same error when i used "your version"? instead it gave me a different error? the only difference i see is the
&ie=utf-8&oe=utf-8&client=firefox-b which is different in your suggested fix

Your version:
url_to_check <- paste0("https://www.google.com/search?q=",mykeyword[i],"&ie=utf-8&oe=utf- 8&client=firefox-b")

Original Version :
url_to_check <- paste0("[https://www.google.com/search?q=",mykeyword[1],"&ie=utf-8&oe=utf-8&client=firefox-b]

many thanks for the help,
Best wishes


library(XML)
library(dplyr)
library(httr)
library(magrittr)

mykeyword <- c("canape", "ps4", "macbook")

my_user_agent <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0"

PAA <- vector("list", length(mykeyword))
i <- 1
for (i in 1:length(mykeyword)) {
  url_to_check <- paste0("https://www.google.com/search?q=", mykeyword[i], "&ie=utf-8&oe=utf-8&client=firefox-b")
  PAA[[i]] <- GET(url_to_check, user_agent(my_user_agent)) %>%
    htmlParse(encoding = "UTF-8") %>%
    xpathSApply("//div[/*]/g-accordion-expander/div/div", xmlValue) %>%
    as.data.frame() %>%
    set_colnames("text") %>%
    mutate(
      keyword = mykeyword[i],
      text = as.character(text)
      ) %>%
    filter(text > "")
}
PAA <- dplyr::bind_rows(PAA)
print(PAA)
#>                                                      text keyword
#> 1                     What are the three types of CANape?  canape
#> 2                                  What does CANape mean?  canape
#> 3                         Why are canapes called canapes?  canape
#> 4  What is difference between canapes and hors d oeuvres?  canape
#> 5                        Is it still worth getting a ps4?     ps4
#> 6                 Is it still worth buying a ps4 in 2019?     ps4
#> 7                         How much does a ps4 cost in NZ?     ps4
#> 8                   Is it OK to leave a ps4 on overnight?     ps4
#> 9                     Which is the cheapest Apple laptop? macbook
#> 10                                 Are Macbooks worth it? macbook
#> 11                       Which Apple laptop should I buy? macbook
#> 12                         Why are macbooks so expensive? macbook

Created on 2020-09-04 by the reprex package (v0.3.0)

Oh! awesome! thanks so much! genius. I figured after your first response that XML library was missing- hence the error. am assuming the output could be printed to a CSV file using PAA = dplyr::bind_cols(PAA) instead

However, i am curious as to why it only returns 4 questions per keyword and not a longer list- not sure i see the command for that.
You're a star! thanks much

I think it's because Google returns 4 PAA by default. These seem to expand when you get to click on any of the questions (from the Google website)- so it's probably a limit they impose. all good!
thanks again. your knowledge was useful, and helping me decipher R a little more

To write it to a csv file you would use write.csv(PAA, "yourfilename.csv").

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.