Scraping multiple pages from one site into R (absolute beginner level)

MFPete · June 14, 2023, 1:29pm

Is there any way to scrape the name and Constituency Office of every MP into R?

How do I scrape each page from this landing page:

https://members.parliament.uk/members/Commons

The Constituency Office address is then found in every MP's profile page, see the below for an example:

https://members.parliament.uk/member/4212/contact

Can R scrape all the MP's pages on the site for the Constituency Office address? And, if so, how?

Many thanks.

M_AcostaCH · June 15, 2023, 4:38am

Hi @MFPete , for this type of page you need use Rselenium.

The connection is a little difficult for the first time, but not impossible.
Here more info about it:

Post about this

For you request I'm show an advance.

library(RSelenium)
library(XML)
library(dplyr)
library(rvest)

remDr <- remoteDriver(browserName = "chrome",port = 4444, 
                      remoteServerAddr = "localhost")  

remDr$open()
Sys.sleep(1)

remDr$navigate("https://members.parliament.uk/members/Commons")

html <- remDr$getPageSource()[[1]]

url_data1 <- html %>%
  read_html() %>% 
  html_nodes(xpath='//*[@id="main-content"]/div/article/div/div/div[3]/a[1]') %>% 
  html_attr("href");url_data1
#"/member/172/contact"

url_data2 <- html %>%
  read_html() %>% 
  html_nodes(xpath='//*[@id="main-content"]/div/article/div/div/div[3]/a[2]') %>% 
  html_attr("href");url_data2
# "/member/4212/contact"

url_data3 <- html %>%
  read_html() %>% 
  html_nodes(xpath='//*[@id="main-content"]/div/article/div/div/div[3]/a[3]') %>% 
  html_attr("href");url_data3
# "/member/4639/contact"

# but whe im try to make a loop for all post of page 1 show me this error:
for (i in 1:20) {
  url_data <- html %>%
    html_nodes(xpath = paste('//*[@id="main-content"]/div/article/div/div/div[3]/a[', i, ']')) %>% 
    html_attr("href")
  
  Sys.sleep(2)
  
  # data frame
  df <- df %>% bind_rows(data.frame(url_data))
}
# Error in UseMethod("xml_find_all") : 
#   no applicable method for 'xml_find_all' applied to an object of class "character"

# The idea is repeat this loop for all pages.

technocrat · June 15, 2023, 5:03am

Not all members have a Constituency Office address entered but those that do aren't too hard to root out manually, although there are 650 of them.

It's maybe possible to script this using the {curl} package with some tweaking of customized handle options. Otherwise, you need to know either the constituency numbers, which are in no regular order, or the member id numbers, which aren't either, and you can use their API. Maybe outsource it to Mechanical Turk?

system · July 6, 2023, 5:03am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.