Strange results in table() after scraping an HTML doc.

I am scraping an HTML table from a Government of Canada website with {rvest} and I seem to be getting some strange results when I try to create a table.
My first problem was a straight-forward read_html(webpage) was giving an error.

library(data.table)
library(tidyverse)
library(rvest)

webpage <- "https://ciec-ccie.parl.gc.ca/en/publications/Pages/Travel2023-Deplacements2023.aspx"

read_html(webpage)

Error in open.connection(x, "rb") : 
  SSL peer certificate or SSH remote key was not OK: [ciec-ccie.parl.gc.ca] SSL certificate problem: unable to get local issuer certificate

After poking around I discovered this code that seems to work although I am not sure why.

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
library(rvest)

webpage <- "https://ciec-ccie.parl.gc.ca/en/publications/Pages/Travel2023-Deplacements2023.aspx"

content <- webpage %>% 
  httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>% 
  read_html()  

tables <- content %>% html_table(fill = TRUE)

first_table <- tables[[1]]

names(first_table) <-  c("mp", "with", "dest", "purpose", "sponsor", "dates", "benefits", "value", "docs")
 
DT <- as.data.table(first_table)

# Edit to reduce amount of text.
DT[8, sponsor := "State Committee on work with Diaspora of The Republic of Azerbaijan"]

The problem is that if I do

DT1 <- DT[, .(table(mp))]

I am not getting a complete tabulation. I am getting some since the number of rows of data goes from 93 to 73.

If I do this

DT1[2, mp] == DT1[3, mp]

They are not identical! It may be a glitch in the original creation of the HTML doc but can anyone suggest anything?

1 Like

As to your first issue, I used read_html() with your URL and had no problems:

webpage <- "https://ciec-ccie.parl.gc.ca/en/publications/Pages/Travel2023-Deplacements2023.aspx"

library(rvest)
read_html(webpage) |> 
  html_elements('table') |> 
  html_table()
#> [[1]]
#> # A tibble: 93 √ó 9
#>    `‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚ÄčName‚Äč‚Äč of Member‚Äč` Nam‚Äče of person accompanying the Member of‚Ķ¬Ļ `‚Äč‚ÄčDestination(s)‚Äč`
#>    <chr>            <chr>                                       <chr>           
#>  1 ‚ÄčAboultaif, Ziad  ‚ÄčN/A                                         "‚ÄčLondon, Englan‚Ķ
#>  2 ‚ÄčAboultaif, Ziad‚Äč‚Äč  ‚ÄčN/A                                         "‚ÄčTashkent, \r\n‚Ķ
#>  3 ‚ÄčAitchison, Scott ‚ÄčN/A                                         "‚ÄčKenya"         
#>  4 ‚ÄčAitchison, Scott ‚ÄčN/A                                         "‚ÄčIsrael"        
#>  5 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčSeoul, South K‚Ķ
#>  6 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčTaiwan"        
#>  7 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčKurdistan Regi‚Ķ
#>  8 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčBaku, Azerbaij‚Ķ
#>  9 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčBangkok, Thail‚Ķ
#> 10 ‚ÄčAshton, Niki     ‚ÄčN/A                                         "‚ÄčKalamata and \‚Ķ
#> # ‚ĄĻ 83 more rows
#> # ‚ĄĻ abbreviated name: ¬Ļ‚Äč`‚Äč‚ÄčNam‚Äče of person accompanying the Member of Parliament`
#> # ‚ĄĻ 6 more variables: `‚ÄčPurpose of the trip` <chr>, `‚Äč‚ÄčSponsor of the ‚Äčtrip` <chr>,
#> #   `‚Äč‚ÄčDate(s)` <chr>, `‚ÄčNature of Benefits` <chr>, `‚ÄčValue of Benefits‚Äč` <chr>,
#> #   `‚ÄčSupporting Document‚Äč` <chr>
library(purrr)
read_html(webpage) |> 
  html_elements('table') |> 
  html_table() |> 
  pluck(1)
#> # A tibble: 93 √ó 9
#>    `‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚Äč‚ÄčName‚Äč‚Äč of Member‚Äč` Nam‚Äče of person accompanying the Member of‚Ķ¬Ļ `‚Äč‚ÄčDestination(s)‚Äč`
#>    <chr>            <chr>                                       <chr>           
#>  1 ‚ÄčAboultaif, Ziad  ‚ÄčN/A                                         "‚ÄčLondon, Englan‚Ķ
#>  2 ‚ÄčAboultaif, Ziad‚Äč‚Äč  ‚ÄčN/A                                         "‚ÄčTashkent, \r\n‚Ķ
#>  3 ‚ÄčAitchison, Scott ‚ÄčN/A                                         "‚ÄčKenya"         
#>  4 ‚ÄčAitchison, Scott ‚ÄčN/A                                         "‚ÄčIsrael"        
#>  5 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčSeoul, South K‚Ķ
#>  6 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčTaiwan"        
#>  7 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčKurdistan Regi‚Ķ
#>  8 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčBaku, Azerbaij‚Ķ
#>  9 ‚ÄčArya, Chandra    ‚ÄčN/A                                         "‚ÄčBangkok, Thail‚Ķ
#> 10 ‚ÄčAshton, Niki     ‚ÄčN/A                                         "‚ÄčKalamata and \‚Ķ
#> # ‚ĄĻ 83 more rows
#> # ‚ĄĻ abbreviated name: ¬Ļ‚Äč`‚Äč‚ÄčNam‚Äče of person accompanying the Member of Parliament`
#> # ‚ĄĻ 6 more variables: `‚ÄčPurpose of the trip` <chr>, `‚Äč‚ÄčSponsor of the ‚Äčtrip` <chr>,
#> #   `‚Äč‚ÄčDate(s)` <chr>, `‚ÄčNature of Benefits` <chr>, `‚ÄčValue of Benefits‚Äč` <chr>,
#> #   `‚ÄčSupporting Document‚Äč` <chr>

Created on 2024-06-19 with reprex v2.0.2

Blast it, I wonder what I have set up that my machine is giving at error. I am getting the same error on another file from the same site. (The 2022 Report).

I just downloaded a wiki page with no problem.

Thanks for checking.

2 Likes

Update

Some of my problems are coming from simple typing errors and misspellings. I managed to correct a number of problems by simply removing an extra blank space or correcting obvious misspelt words‚ÄďIsrael is not spelt Isarel, but I still am having problems, especially with the "Name Of Member" column.

Thanks

would you like additional help with that ?

I'd love some. It's not a very important project‚ÄďI'm just doing it out of personal interest but I'd love to see, where I am going wrong, is the doc is that messed up, or both. From my experience yesterday, it looks like a lot of my current trouble is just inconsistent data input but I'm not seeing the problems. It is also obvious that the staff at the Office of the Conflict of Interest and Ethics Commissioner don't no anything about data analysis.

Also would you have any idea why a straight read_html() is giving me an erre while @ dromano reports no problem?

Some of the rest of the data is so badly arranged ---see the Nature of Benefits and Amount columns that it's probably not worth trying to do anything with them. If it was a serious project, I'd re-key everything into something like a decent, tidy data set.
Anyway, if you run my earlier code you should end up with the initial data set.

I am primarily interested in which MPs went where and what organizations were funding the trips.

I pulled out out these columns and my tables were not making sense. What looked like the same MP's name or Sponsor name was appearing repeatedly.

I finally opened the file in a spreadsheet and started checking spelling and spacing in the Sponsor column. I managed to reduce but but not totally reduce the duplication. So far I am not having any luck with MP's names.

In any case , my somewhat cleaned-up data set is below. The code below shows the duplications I am getting in my tables.

Thanks

suppressMessages(library(data.table))
suppressMessages((library(flextable)))

DT1 <- DT[, .N, by = sponsor]

TB1 <- flextable(DT1)

TB1 <-  set_header_labels(
  x = TB1 ,
  values = c(
    sponsor = "Sponsor",
    N = "Count")
 )

set_table_properties(TB1, layout = "autofit")


DT2 <- DT[, .N, by = mp]

TB2 <- flextable(DT)

TB2 <-  set_header_labels(
  x = TB2,
  values = c(
    sponsor = "MP",
    N = "Count")
 )

set_table_properties(TB2, layout = "autofit")

Data

DT <- structure(list(mp = c("‚ÄčMathyssen, Lindsay", "‚ÄčLewis, Leslyn", 
"‚ÄčMuys, Dan", "‚ÄčFast, Ed", "‚ÄčChen, Shaun", "‚ÄčFalk, Rosemarie", 
"‚ÄčGenuis, Garnett", "‚ÄčGazan, Leah", "‚ÄčGazan, Leah", "‚ÄčFalk, Rosemarie", 
"‚ÄčEllis, Stephen", "‚ÄčLawrence, Philip", "‚ÄčStubbs, Shannon", 
"Aboultaif, Ziad‚Äč‚Äč", "‚ÄčPatzer, Jeremy", "‚ÄčWilliamson, John", 
"‚ÄčLake, Mike", "‚ÄčBergeron, St√©phane", "‚ÄčMcLeod, Michael", 
"‚ÄčGallant, Cheryl", "‚ÄčCoteau, Michael", "‚Äč Genuis, Garnett", 
"‚ÄčGenuis, Garnett", "‚ÄčSgro, Judy", "‚ÄčBoulerice, Alexandre", 
"‚ÄčMcPherson, Heather", "‚ÄčCooper, Michael", "‚ÄčLattanzio, Patricia", 
"‚ÄčSgro, Judy", "‚ÄčSgro, Judy", "‚ÄčRota, Anthony", "‚ÄčArya, Chandra", 
"‚ÄčBergeron, St√©phane‚Äč", "‚ÄčKmiec, Tom", "‚ÄčSinclair-Desgagn√©, Nathalie", 
"‚ÄčLewis, Chris", "‚ÄčKayabaga, Arielle", "‚ÄčKayabaga, Arielle", 
"‚ÄčAitchison, Scott", "‚ÄčBradford, Valerie", "‚ÄčGaheer, Iqwinder", 
"‚ÄčMelillo, Eric", "‚ÄčGallant, Cheryl", "‚ÄčLake, Mike", "‚ÄčArya, Chandra", 
"‚ÄčArya, Chandra", "‚ÄčBarrett, Michael", "‚ÄčBergeron, St√©phane", 
"‚ÄčBezan, James", "‚ÄčChong, Michael", "‚ÄčCooper, Michael", 
"‚ÄčDancho, Raquel", "‚ÄčGaudreau, Marie-H√©l√®ne", "‚ÄčGenuis, Garnett", 
"‚ÄčGill, Maril√®ne", "‚ÄčHardie, Ken", "‚ÄčLantsman, Melissa", 
"‚ÄčMathyssen, Lindsay", "‚ÄčMcKay, John", "‚ÄčMcPherson, Heather", 
"‚ÄčSarai, Randeep", "‚ÄčSeeback, Kyle", "‚ÄčMartel, Richard", 
"‚ÄčSchiefke, Peter", "‚ÄčAitchison, Scott", "‚ÄčBerthold, Luc", 
"‚ÄčBlanchette-Joncas, Maxime", "‚ÄčBradford, Valerie", "‚ÄčChambers, Adam", 
"‚ÄčChampoux, Martin", "‚ÄčChahal, Harnirjodh (George)", "‚ÄčChen, Shaun", 
"‚ÄčFindlay, Kerry-Lynne", "‚ÄčFortin, Rh√©al", "‚ÄčGoodridge, Laila", 
"‚ÄčHallan, Jasraj Singh", "‚ÄčHepfner, Lisa", "‚ÄčKramp-Neuman, Shelby", 
"‚ÄčLapointe, Viviane", "‚ÄčPaul-Hus, Pierre", "‚ÄčScheer, Andrew", 
"‚ÄčShanahan, Brenda", "‚ÄčBlois, Kody", "‚ÄčBlanchet, Yves-Fran√ßois", 
"‚ÄčHousefather, Anthony", "‚ÄčRempel Garner, Michelle", "‚ÄčArya, Chandra", 
"‚ÄčEhsassi, Ali", "‚ÄčHoback, Randy", "‚ÄčAshton, Niki", "‚ÄčArya, Chandra", 
"‚ÄčBrunelle-Duceppe, Alexis"), sponsor = c("‚ÄčAhmadiyya Muslim Jama'at", 
"‚ÄčAlliance for Responsible Citizenship (ARC)", "‚ÄčBelent Mathew", 
"‚ÄčCanada-DPRK Knowledge Partnership Program", "‚ÄčCanadian Foodgrains Bank", 
"‚ÄčCanadian Foodgrains Bank", "‚ÄčCanadian Foodgrains Bank", 
"‚ÄčCanadian Union of Postal Workers", "‚ÄčCanadian Union of Public Employees", 
"‚ÄčCanadians for Affordable Energy", "‚ÄčCanadians for Affordable Energy (Dan McTeague)", 
"‚ÄčCanadians for Affordable Energy (Dan McTeague)", "‚ÄčCanadians for Affordable Energy (Dan McTeague)", 
"Central Election Commission of the Republic of Uzbekistan", 
"‚ÄčChurch of God Ministries", "‚ÄčDanube Institute", "‚ÄčEducation Cannot Wait", 
"‚ÄčFederal Government of Germany", "‚ÄčGovernment of Northwest Territories", 
"‚ÄčGovernment of Taiwan", "‚ÄčIndigenous Sport and Wellness Ontario", 
"‚ÄčInter-Parliamentary Alliance on China (IPAC)", "‚ÄčInter-Parliamentary Alliance on China (IPAC)", 
"‚ÄčInter-Parliamentary Alliance on China (IPAC)", "‚ÄčInternational Association of Machinists & Aerospace Workers", 
"‚ÄčInternational Campaign to Abolish Nuclear Weapons", "‚ÄčIran Democratic Association", 
"‚ÄčIran Democratic Association", "‚ÄčIran Democratic Association", 
"‚ÄčIran Democratic Association", "‚ÄčItalian Ministry of Foreign Affairs", 
"‚ÄčKurdistan Regional Government", "‚ÄčKurdistan Regional Government", 
"‚ÄčKurdistan Regional Government", "‚ÄčKurdistan Regional Government", 
"‚ÄčOne Free World International", "‚ÄčOne Young World", "‚ÄčOpen Society Foundations, Unitas Communications", 
"‚ÄčResults Canada", "‚ÄčResults Canada", "‚ÄčResults Canada", 
"‚ÄčResults Canada", "‚ÄčSaab Canada Inc.", "‚ÄčSpecial Olympics International", 
"State Committee on work with Diaspora of The Republic of Azerbaijan", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčTaipei Economic and Cultural Office in Canada", 
"‚ÄčTaipei Economic and Cultural Office in Canada", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", "‚ÄčThe Centre for Israel and Jewish Affairs (CIJA)", 
"‚ÄčThe Greens/EFA in the European Parliament", "‚ÄčUJA Federation of Greater Toronto", 
"‚ÄčUJA Federation of Greater Toronto", "‚ÄčUniversity of British Columbia Knowledge Partnership Program", 
"‚ÄčUniversity of British Columbia Knowledge Partnership Program‚Äč", 
"‚ÄčUniversity of British Columbia Knowledge Partnership Program", 
"‚ÄčWorld Hellenic Inter-Parliamentary Association", "‚ÄčWorld Hindu Foundation", 
"‚ÄčWorld Uyghur Congress"), counts = c(1L, 1L, NA, NA, NA, NA, 
NA, NA, NA, 1L, 2L, 3L, 4L, 1L, NA, NA, NA, NA, NA, NA, NA, 1L, 
2L, 3L, NA, NA, 1L, 2L, 3L, 4L, NA, 1L, 2L, 3L, 4L, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 
20L, 21L, 1L, NA, NA, NA, NA, NA, NA, NA, NA), dest = c("‚ÄčUnited Kingdom", 
"‚ÄčLondon, England", "‚ÄčKerala, India", "‚ÄčSeoul, South Korea", 
"‚ÄčKenya", "‚ÄčKenya", "‚ÄčKenya", "‚ÄčToronto, Ontario, Canada", 
"‚ÄčVancouver, British Columbia, Canada", "‚ÄčLondon, England", 
"‚ÄčLondon, England", "‚ÄčLondon, England", "‚ÄčLondon, England", 
"‚ÄčTashkent,Uzbekistan", "‚ÄčTampa, Florida", "‚ÄčLondon, England", 
"‚ÄčGeneva, Switzerland", "‚ÄčMunich, Germany", "‚ÄčYellowknife, Fort Smith and Hay River, Northwest Territories, Canada", 
"‚ÄčTaipei, Taiwan", "‚ÄčHalifax, Nova Scotia, Canada", "‚ÄčTokyo, Japan", 
"‚ÄčPrague, Czech Republic", "‚ÄčPrague, Czech Republic", "‚ÄčHollywood, Maryland", 
"‚ÄčTokyo and Hiroshima, Japan", "‚ÄčParis, France", "‚ÄčParis, France", 
"‚ÄčBrussels, Belguim", "‚ÄčParis, France", "‚ÄčPizzo Calabro, Tropea, Catanzaro, Cosenza, Sila, Morano Calabro, Pedace and Pietrafitta, Italy", 
"‚ÄčKurdistan Region, Iraq", "‚ÄčErbil, Kurdistan Region of Iraq", 
"‚ÄčRegion of Kurdistan, Iraq (Erbil, Slemani, Duhok)", "‚ÄčErbil, Kurdistan Region of Iraq", 
"‚ÄčIraq", "‚ÄčBelfast, Ireland", "‚ÄčSan Francisco, California", 
"‚ÄčKenya", "‚ÄčKenya", "‚ÄčKenya", "‚ÄčKenya", "‚ÄčKarlskrona, Sweden", 
"‚ÄčBerlin, Germany", "‚ÄčBaku, Azerbaijan", "‚ÄčTaiwan", "‚ÄčTaiwan", 
"‚ÄčTaiwan", "‚ÄčTaipei, Taiwan", "‚ÄčTaipei, Taiwan", "‚ÄčTaipei, Taichung and Nantou, Taiwan", 
"‚ÄčTaiwan", "‚ÄčTaiwan", "‚ÄčTaiwan", "‚ÄčTaipei, Taiwan", "‚ÄčTaiwan", 
"‚ÄčTaipei, Taiwan", "‚ÄčTaiwan", "‚ÄčTaipei, Taiwan", "‚ÄčTaipei, Taiwan", 
"‚ÄčTaipei, Taiwan", "‚ÄčTaipei, Taiwan", "‚ÄčIsrael", "‚ÄčIsrael", 
"‚ÄčIsrael", "‚ÄčTel Aviv, Israel", "‚ÄčIsrael", "‚ÄčIsrael", 
"‚ÄčJerusalem, Tel Aviv and Golan Heights, Israel; Ramallah, Palestine", 
"‚ÄčIsrael", "‚ÄčIsrael", "‚ÄčIsrael", "‚ÄčIsrael", "‚ÄčIsrael", 
"‚ÄčJerusalem, Tel Aviv and Golan Heights, Israel; Ramallah, Palestine", 
"‚ÄčIsrael", "‚ÄčIsrael", "‚ÄčIsrael", "‚ÄčTel Aviv, Israel", 
"‚ÄčIsrael", "‚ÄčJerusalem, Tel Aviv, Golan Heights, Israel; Ramallah, Palestine", 
"‚ÄčIsrael", "‚ÄčIsrael and Palestinian territories", "‚ÄčBarcelona, Spain", 
"‚ÄčIsrael", "‚ÄčIsrael", "‚ÄčSeoul, South Korea", "‚ÄčRepublic of Korea", 
"‚ÄčSouth Korea", "‚ÄčKalamata and  Athens, Greece", "‚ÄčBangkok, Thailand", 
"‚ÄčTokyo, Japan‚Äč")), class = "data.frame", row.names = c(NA, 
-92L))

The issue you had was that the code you run, that was querying the website on your behalf has guardrails to protect you from potentially malicious / untrustworthy sites. There is some issue with the SSL Certification of that domain; that's potentially something the site owners/managers would want to address so that visitors could have confidence in them.

I can only speculate that perhaps dromano has some configuration to lessen the protection and more freely interact with websites than a default configuration would be set to do.

Some of the difficulty you have matching up the names is that there is some 'poisoning' with invisible/i.e. non printable characters. in particular I spotted a lot of ZWSP (Zero Width Space) .

try an initial pass through where you first keep only alphanumerics and printable space something like

DT <- mutate(DT,
             across(where(is.character),
                 \(x)str_remove_all(x, "[^[:alpha:]|0-9|[:punct:]|\\s]")))

Notepad++ is a useful utility, as it has an option to show symbols for non-printables in this case.

further ideas; if there are issues from manual typing where typos can cause names to not strictly match across datasets; string distance measuring can be a good way to gather match candidates and accept or reject them in a systematic way. R has a good stringdist package

Great solution. I was not allowing for insvisible characters. I am still getting a couple of duplicates in the Sponsor column but I am assuming those are typos. The mp column, so far, looks clean.

I don't think I have ever seen the stringdist function.

It looks like this 2023 report may be a freak. I ran the same code on the 2022 report with no problems at all.

What's this "Notepad++"? I'm Linux. Maybe EMACS will serve.

Thanks very much.

Someone suggested notepadqq might be a drop in for notepad++. But any text editor that offers features to make visible, invisible symbols in text would do for this job.

Ah, gedit may do it.

Thanks again.

Could it be a version issue that was patched? Mine version of rvest is

> packageVersion('rvest')
[1] ‚Äė1.0.4‚Äô

Same package

>  packageVersion('rvest')
[1] ‚Äė1.0.4‚Äô

Thanks.

I have no clue what might be different, then!

I'll have to live with it.

Thanks for checking.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.