Newbie here. I encounter a big problem in scrapping of HTML table with nested columns.
The table is from the immigration department of Hong Kong.
A screenshot is shown here:
I tried to do it with rvest, but the result is messy.
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>%
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable
My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.
I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.
1 Like
Hi @ronzenith, good work.
Im use the xpath
for select the table. Im put the A_ for Arrival variables and D_ for Departure.
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
url_data2 <- url_data %>%
read_html() %>%
html_nodes(xpath=' /html/body/section[2]/div/div/div/div/table/tbody') %>%
html_table()
url_data2 <- url_data2[[1]]
# for select the specify columns, because show many .
url_data2 <- url_data2[ , -c(1:3, 5, 10)]
# Change the names of columns variables
names(url_data2)[1] <- 'Variables'
names(url_data2)[2] <- 'A_Hong_Kong_Residents'
names(url_data2)[3] <- 'A_Mainland_Visitors'
names(url_data2)[4] <- 'A_Other_Visitors'
names(url_data2)[5] <- 'A_Total'
names(url_data2)[6] <- 'D_Hong_Kong_Residents'
names(url_data2)[7] <- 'D_Mainland_Visitors'
names(url_data2)[8] <- 'D_Other_Visitors'
names(url_data2)[9] <- 'D_Total'
View(url_data2)
# A tibble: 16 × 9
# Variables A_Hong_Kong_Re…¹ A_Mai…² A_Oth…³ A_Total D_Hon…⁴ D_Mai…⁵ D_Oth…⁶ D_Total
# <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr>
# 1 Airport 4,258 1,488 422 6,168 3,775 1,154 315 5,244
# 2 Express Rail Link West Kowloon 0 0 0 0 0 0 0 0
# 3 Hung Hom 0 0 0 0 0 0 0 0
# 4 Lo Wu 0 0 0 0 0 0 0 0
# 5 Lok Ma Chau Spur Line 0 0 0 0 0 0 0 0
# 6 Heung Yuen Wai 0 0 0 0 0 0 0 0
# 7 Hong Kong-Zhuhai-Macao Bridge 333 28 39 400 243 194 15 452
# 8 Lok Ma Chau 0 0 0 0 0 0 0 0
# 9 Man Kam To 0 0 0 0 0 0 0 0
# 10 Sha Tau Kok 0 0 0 0 0 0 0 0
# 11 Shenzhen Bay 3,404 348 37 3,789 1,301 524 28 1,853
# 12 China Ferry Terminal 0 0 0 0 0 0 0 0
# 13 Harbour Control 0 0 0 0 0 0 0 0
# 14 Kai Tak Cruise Terminal 0 0 0 0 0 0 0 0
# 15 Macau Ferry Terminal 0 0 0 0 0 0 0 0
# 16 Total 7,995 1,864 498 10,357 5,319 1,872 358 7,549