Strange results in table() after scraping an HTML doc.

jrkrideau · June 19, 2024, 2:13pm

I am scraping an HTML table from a Government of Canada website with {rvest} and I seem to be getting some strange results when I try to create a table.
My first problem was a straight-forward read_html(webpage) was giving an error.

library(data.table)
library(tidyverse)
library(rvest)

webpage <- "https://ciec-ccie.parl.gc.ca/en/publications/Pages/Travel2023-Deplacements2023.aspx"

read_html(webpage)

Error in open.connection(x, "rb") : 
  SSL peer certificate or SSH remote key was not OK: [ciec-ccie.parl.gc.ca] SSL certificate problem: unable to get local issuer certificate

After poking around I discovered this code that seems to work although I am not sure why.

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
library(rvest)

webpage <- "https://ciec-ccie.parl.gc.ca/en/publications/Pages/Travel2023-Deplacements2023.aspx"

content <- webpage %>% 
  httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>% 
  read_html()  

tables <- content %>% html_table(fill = TRUE)

first_table <- tables[[1]]

names(first_table) <-  c("mp", "with", "dest", "purpose", "sponsor", "dates", "benefits", "value", "docs")
 
DT <- as.data.table(first_table)

# Edit to reduce amount of text.
DT[8, sponsor := "State Committee on work with Diaspora of The Republic of Azerbaijan"]

The problem is that if I do

DT1 <- DT[, .(table(mp))]

I am not getting a complete tabulation. I am getting some since the number of rows of data goes from 93 to 73.

If I do this

DT1[2, mp] == DT1[3, mp]

They are not identical! It may be a glitch in the original creation of the HTML doc but can anyone suggest anything?

dromano · June 19, 2024, 3:10pm

As to your first issue, I used read_html() with your URL and had no problems:

webpage <- "https://ciec-ccie.parl.gc.ca/en/publications/Pages/Travel2023-Deplacements2023.aspx"

library(rvest)
read_html(webpage) |> 
  html_elements('table') |> 
  html_table()
#> [[1]]
#> # A tibble: 93 × 9
#>    `Name of Member` Name of person accompanying the Member of…¹ `Destination(s)`
#>    <chr>            <chr>                                       <chr>           
#>  1 Aboultaif, Ziad  N/A                                         "London, Englan…
#>  2 Aboultaif, Ziad  N/A                                         "Tashkent, \r\n…
#>  3 Aitchison, Scott N/A                                         "Kenya"         
#>  4 Aitchison, Scott N/A                                         "Israel"        
#>  5 Arya, Chandra    N/A                                         "Seoul, South K…
#>  6 Arya, Chandra    N/A                                         "Taiwan"        
#>  7 Arya, Chandra    N/A                                         "Kurdistan Regi…
#>  8 Arya, Chandra    N/A                                         "Baku, Azerbaij…
#>  9 Arya, Chandra    N/A                                         "Bangkok, Thail…
#> 10 Ashton, Niki     N/A                                         "Kalamata and \…
#> # ℹ 83 more rows
#> # ℹ abbreviated name: ¹`Name of person accompanying the Member of Parliament`
#> # ℹ 6 more variables: `Purpose of the trip` <chr>, `Sponsor of the trip` <chr>,
#> #   `Date(s)` <chr>, `Nature of Benefits` <chr>, `Value of Benefits` <chr>,
#> #   `Supporting Document` <chr>

library(purrr)
read_html(webpage) |> 
  html_elements('table') |> 
  html_table() |> 
  pluck(1)
#> # A tibble: 93 × 9
#>    `Name of Member` Name of person accompanying the Member of…¹ `Destination(s)`
#>    <chr>            <chr>                                       <chr>           
#>  1 Aboultaif, Ziad  N/A                                         "London, Englan…
#>  2 Aboultaif, Ziad  N/A                                         "Tashkent, \r\n…
#>  3 Aitchison, Scott N/A                                         "Kenya"         
#>  4 Aitchison, Scott N/A                                         "Israel"        
#>  5 Arya, Chandra    N/A                                         "Seoul, South K…
#>  6 Arya, Chandra    N/A                                         "Taiwan"        
#>  7 Arya, Chandra    N/A                                         "Kurdistan Regi…
#>  8 Arya, Chandra    N/A                                         "Baku, Azerbaij…
#>  9 Arya, Chandra    N/A                                         "Bangkok, Thail…
#> 10 Ashton, Niki     N/A                                         "Kalamata and \…
#> # ℹ 83 more rows
#> # ℹ abbreviated name: ¹`Name of person accompanying the Member of Parliament`
#> # ℹ 6 more variables: `Purpose of the trip` <chr>, `Sponsor of the trip` <chr>,
#> #   `Date(s)` <chr>, `Nature of Benefits` <chr>, `Value of Benefits` <chr>,
#> #   `Supporting Document` <chr>

^{Created on 2024-06-19 with reprex v2.0.2}

jrkrideau · June 19, 2024, 3:42pm

Blast it, I wonder what I have set up that my machine is giving at error. I am getting the same error on another file from the same site. (The 2022 Report).

I just downloaded a wiki page with no problem.

Thanks for checking.

jrkrideau · June 20, 2024, 4:34pm

Update

Some of my problems are coming from simple typing errors and misspellings. I managed to correct a number of problems by simply removing an extra blank space or correcting obvious misspelt words–Israel is not spelt Isarel, but I still am having problems, especially with the "Name Of Member" column.

Thanks

nirgrahamuk · June 20, 2024, 4:36pm

would you like additional help with that ?

jrkrideau · June 20, 2024, 5:58pm

I'd love some. It's not a very important project–I'm just doing it out of personal interest but I'd love to see, where I am going wrong, is the doc is that messed up, or both. From my experience yesterday, it looks like a lot of my current trouble is just inconsistent data input but I'm not seeing the problems. It is also obvious that the staff at the Office of the Conflict of Interest and Ethics Commissioner don't no anything about data analysis.

Also would you have any idea why a straight read_html() is giving me an erre while @ dromano reports no problem?

Some of the rest of the data is so badly arranged ---see the Nature of Benefits and Amount columns that it's probably not worth trying to do anything with them. If it was a serious project, I'd re-key everything into something like a decent, tidy data set.
Anyway, if you run my earlier code you should end up with the initial data set.

I am primarily interested in which MPs went where and what organizations were funding the trips.

I pulled out out these columns and my tables were not making sense. What looked like the same MP's name or Sponsor name was appearing repeatedly.

I finally opened the file in a spreadsheet and started checking spelling and spacing in the Sponsor column. I managed to reduce but but not totally reduce the duplication. So far I am not having any luck with MP's names.

In any case , my somewhat cleaned-up data set is below. The code below shows the duplications I am getting in my tables.

Thanks

suppressMessages(library(data.table))
suppressMessages((library(flextable)))

DT1 <- DT[, .N, by = sponsor]

TB1 <- flextable(DT1)

TB1 <-  set_header_labels(
  x = TB1 ,
  values = c(
    sponsor = "Sponsor",
    N = "Count")
 )

set_table_properties(TB1, layout = "autofit")


DT2 <- DT[, .N, by = mp]

TB2 <- flextable(DT)

TB2 <-  set_header_labels(
  x = TB2,
  values = c(
    sponsor = "MP",
    N = "Count")
 )

set_table_properties(TB2, layout = "autofit")

Data

DT <- structure(list(mp = c("Mathyssen, Lindsay", "Lewis, Leslyn", 
"Muys, Dan", "Fast, Ed", "Chen, Shaun", "Falk, Rosemarie", 
"Genuis, Garnett", "Gazan, Leah", "Gazan, Leah", "Falk, Rosemarie", 
"Ellis, Stephen", "Lawrence, Philip", "Stubbs, Shannon", 
"Aboultaif, Ziad", "Patzer, Jeremy", "Williamson, John", 
"Lake, Mike", "Bergeron, Stéphane", "McLeod, Michael", 
"Gallant, Cheryl", "Coteau, Michael", " Genuis, Garnett", 
"Genuis, Garnett", "Sgro, Judy", "Boulerice, Alexandre", 
"McPherson, Heather", "Cooper, Michael", "Lattanzio, Patricia", 
"Sgro, Judy", "Sgro, Judy", "Rota, Anthony", "Arya, Chandra", 
"Bergeron, Stéphane", "Kmiec, Tom", "Sinclair-Desgagné, Nathalie", 
"Lewis, Chris", "Kayabaga, Arielle", "Kayabaga, Arielle", 
"Aitchison, Scott", "Bradford, Valerie", "Gaheer, Iqwinder", 
"Melillo, Eric", "Gallant, Cheryl", "Lake, Mike", "Arya, Chandra", 
"Arya, Chandra", "Barrett, Michael", "Bergeron, Stéphane", 
"Bezan, James", "Chong, Michael", "Cooper, Michael", 
"Dancho, Raquel", "Gaudreau, Marie-Hélène", "Genuis, Garnett", 
"Gill, Marilène", "Hardie, Ken", "Lantsman, Melissa", 
"Mathyssen, Lindsay", "McKay, John", "McPherson, Heather", 
"Sarai, Randeep", "Seeback, Kyle", "Martel, Richard", 
"Schiefke, Peter", "Aitchison, Scott", "Berthold, Luc", 
"Blanchette-Joncas, Maxime", "Bradford, Valerie", "Chambers, Adam", 
"Champoux, Martin", "Chahal, Harnirjodh (George)", "Chen, Shaun", 
"Findlay, Kerry-Lynne", "Fortin, Rhéal", "Goodridge, Laila", 
"Hallan, Jasraj Singh", "Hepfner, Lisa", "Kramp-Neuman, Shelby", 
"Lapointe, Viviane", "Paul-Hus, Pierre", "Scheer, Andrew", 
"Shanahan, Brenda", "Blois, Kody", "Blanchet, Yves-François", 
"Housefather, Anthony", "Rempel Garner, Michelle", "Arya, Chandra", 
"Ehsassi, Ali", "Hoback, Randy", "Ashton, Niki", "Arya, Chandra", 
"Brunelle-Duceppe, Alexis"), sponsor = c("Ahmadiyya Muslim Jama'at", 
"Alliance for Responsible Citizenship (ARC)", "Belent Mathew", 
"Canada-DPRK Knowledge Partnership Program", "Canadian Foodgrains Bank", 
"Canadian Foodgrains Bank", "Canadian Foodgrains Bank", 
"Canadian Union of Postal Workers", "Canadian Union of Public Employees", 
"Canadians for Affordable Energy", "Canadians for Affordable Energy (Dan McTeague)", 
"Canadians for Affordable Energy (Dan McTeague)", "Canadians for Affordable Energy (Dan McTeague)", 
"Central Election Commission of the Republic of Uzbekistan", 
"Church of God Ministries", "Danube Institute", "Education Cannot Wait", 
"Federal Government of Germany", "Government of Northwest Territories", 
"Government of Taiwan", "Indigenous Sport and Wellness Ontario", 
"Inter-Parliamentary Alliance on China (IPAC)", "Inter-Parliamentary Alliance on China (IPAC)", 
"Inter-Parliamentary Alliance on China (IPAC)", "International Association of Machinists & Aerospace Workers", 
"International Campaign to Abolish Nuclear Weapons", "Iran Democratic Association", 
"Iran Democratic Association", "Iran Democratic Association", 
"Iran Democratic Association", "Italian Ministry of Foreign Affairs", 
"Kurdistan Regional Government", "Kurdistan Regional Government", 
"Kurdistan Regional Government", "Kurdistan Regional Government", 
"One Free World International", "One Young World", "Open Society Foundations, Unitas Communications", 
"Results Canada", "Results Canada", "Results Canada", 
"Results Canada", "Saab Canada Inc.", "Special Olympics International", 
"State Committee on work with Diaspora of The Republic of Azerbaijan", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "Taipei Economic and Cultural Office in Canada", 
"Taipei Economic and Cultural Office in Canada", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Centre for Israel and Jewish Affairs (CIJA)", "The Centre for Israel and Jewish Affairs (CIJA)", 
"The Greens/EFA in the European Parliament", "UJA Federation of Greater Toronto", 
"UJA Federation of Greater Toronto", "University of British Columbia Knowledge Partnership Program", 
"University of British Columbia Knowledge Partnership Program", 
"University of British Columbia Knowledge Partnership Program", 
"World Hellenic Inter-Parliamentary Association", "World Hindu Foundation", 
"World Uyghur Congress"), counts = c(1L, 1L, NA, NA, NA, NA, 
NA, NA, NA, 1L, 2L, 3L, 4L, 1L, NA, NA, NA, NA, NA, NA, NA, 1L, 
2L, 3L, NA, NA, 1L, 2L, 3L, 4L, NA, 1L, 2L, 3L, 4L, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 
20L, 21L, 1L, NA, NA, NA, NA, NA, NA, NA, NA), dest = c("United Kingdom", 
"London, England", "Kerala, India", "Seoul, South Korea", 
"Kenya", "Kenya", "Kenya", "Toronto, Ontario, Canada", 
"Vancouver, British Columbia, Canada", "London, England", 
"London, England", "London, England", "London, England", 
"Tashkent,Uzbekistan", "Tampa, Florida", "London, England", 
"Geneva, Switzerland", "Munich, Germany", "Yellowknife, Fort Smith and Hay River, Northwest Territories, Canada", 
"Taipei, Taiwan", "Halifax, Nova Scotia, Canada", "Tokyo, Japan", 
"Prague, Czech Republic", "Prague, Czech Republic", "Hollywood, Maryland", 
"Tokyo and Hiroshima, Japan", "Paris, France", "Paris, France", 
"Brussels, Belguim", "Paris, France", "Pizzo Calabro, Tropea, Catanzaro, Cosenza, Sila, Morano Calabro, Pedace and Pietrafitta, Italy", 
"Kurdistan Region, Iraq", "Erbil, Kurdistan Region of Iraq", 
"Region of Kurdistan, Iraq (Erbil, Slemani, Duhok)", "Erbil, Kurdistan Region of Iraq", 
"Iraq", "Belfast, Ireland", "San Francisco, California", 
"Kenya", "Kenya", "Kenya", "Kenya", "Karlskrona, Sweden", 
"Berlin, Germany", "Baku, Azerbaijan", "Taiwan", "Taiwan", 
"Taiwan", "Taipei, Taiwan", "Taipei, Taiwan", "Taipei, Taichung and Nantou, Taiwan", 
"Taiwan", "Taiwan", "Taiwan", "Taipei, Taiwan", "Taiwan", 
"Taipei, Taiwan", "Taiwan", "Taipei, Taiwan", "Taipei, Taiwan", 
"Taipei, Taiwan", "Taipei, Taiwan", "Israel", "Israel", 
"Israel", "Tel Aviv, Israel", "Israel", "Israel", 
"Jerusalem, Tel Aviv and Golan Heights, Israel; Ramallah, Palestine", 
"Israel", "Israel", "Israel", "Israel", "Israel", 
"Jerusalem, Tel Aviv and Golan Heights, Israel; Ramallah, Palestine", 
"Israel", "Israel", "Israel", "Tel Aviv, Israel", 
"Israel", "Jerusalem, Tel Aviv, Golan Heights, Israel; Ramallah, Palestine", 
"Israel", "Israel and Palestinian territories", "Barcelona, Spain", 
"Israel", "Israel", "Seoul, South Korea", "Republic of Korea", 
"South Korea", "Kalamata and  Athens, Greece", "Bangkok, Thailand", 
"Tokyo, Japan")), class = "data.frame", row.names = c(NA, 
-92L))

nirgrahamuk · June 21, 2024, 9:32am

The issue you had was that the code you run, that was querying the website on your behalf has guardrails to protect you from potentially malicious / untrustworthy sites. There is some issue with the SSL Certification of that domain; that's potentially something the site owners/managers would want to address so that visitors could have confidence in them.

I can only speculate that perhaps dromano has some configuration to lessen the protection and more freely interact with websites than a default configuration would be set to do.

nirgrahamuk · June 21, 2024, 10:02am

Some of the difficulty you have matching up the names is that there is some 'poisoning' with invisible/i.e. non printable characters. in particular I spotted a lot of ZWSP (Zero Width Space) .

try an initial pass through where you first keep only alphanumerics and printable space something like

DT <- mutate(DT,
             across(where(is.character),
                 \(x)str_remove_all(x, "[^[:alpha:]|0-9|[:punct:]|\\s]")))

Notepad++ is a useful utility, as it has an option to show symbols for non-printables in this case.

further ideas; if there are issues from manual typing where typos can cause names to not strictly match across datasets; string distance measuring can be a good way to gather match candidates and accept or reject them in a systematic way. R has a good stringdist package

jrkrideau · June 21, 2024, 10:42am

Great solution. I was not allowing for insvisible characters. I am still getting a couple of duplicates in the Sponsor column but I am assuming those are typos. The mp column, so far, looks clean.

I don't think I have ever seen the stringdist function.

It looks like this 2023 report may be a freak. I ran the same code on the 2022 report with no problems at all.

What's this "Notepad++"? I'm Linux. Maybe EMACS will serve.

Thanks very much.

nirgrahamuk · June 21, 2024, 10:54am

Someone suggested notepadqq might be a drop in for notepad++. But any text editor that offers features to make visible, invisible symbols in text would do for this job.

jrkrideau · June 21, 2024, 10:57am

Ah, gedit may do it.

Thanks again.

dromano · June 21, 2024, 11:06am

Could it be a version issue that was patched? Mine version of rvest is

> packageVersion('rvest')
[1] ‘1.0.4’

jrkrideau · June 21, 2024, 12:00pm

Same package

>  packageVersion('rvest')
[1] ‘1.0.4’

Thanks.

dromano · June 21, 2024, 12:22pm

I have no clue what might be different, then!

jrkrideau · June 21, 2024, 4:30pm

I'll have to live with it.

Thanks for checking.

system · June 28, 2024, 4:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.