What is the correct approach to scraping expandable lists and storing them in a table?

kunal2989 · January 6, 2024, 11:10pm

Hi, I am trying to scrap a list of institutions which can be expanded to see a list of courses offered by the respective institution. I am trying to accomplish this through rvest. Here is the web page

So far, I can extract the list of institutions, list of course names, and list of course links separately. It is easy to join course names and links into a table as there is a 1:1 relationship. But it isn't easy to do so with Institution names as an institute can contain multiple courses.

I am looking for a way to extract all of this information in such a way that it maintains the relationship between institutions and courses, and then store all of this in a table with correct institute-course pairings.

I also want to know if this is even the best approach for this. Should I not look into rvest but packages like dplyr or stringr to create a table once I have all the data?

Eventually, I am trying to create an interactive visualization to navigate all the course offerings by these institutions under this grant.

Any help is appreciated and here is the code I have been using so far

library(rvest)
library(dplyr)

link <- "https://www.educationplannerbc.ca/future-skills-grant"
page <- read_html(link) 

Institution_names <- page |> 
  html_nodes("h4") |> 
  html_text(trim = T) |>
  str_squish() |> 
  str_split(boundary("sentence"))

Course_names <- page |> 
  html_nodes(".undefined li") |> 
  html_text(trim = T) |>
  str_squish() 

Course_links  <- page |>
  html_nodes(".undefined a") |> 
  html_attr("href") |> 
  str_squish()

Course_info <- bind_cols(Course_names, Course_links)
Course_info

technocrat · January 7, 2024, 1:18am

The problem arises in part because two of the variables are vectors, but Institution_names is a list. unlist() will turn it into a vector, but then there is the problem that two of the vectors are of length() 104 and Institutional_names is of length() 10.

104 %% 10
#[1] 4

One or more of the Institutional names has an unequal number of courses and course-related descriptions. To overcome this, we need to know how many courses for each institution there should be.

Do you have that information?

kunal2989 · January 7, 2024, 2:04am

Yes, I do. But I was hoping for a way to get that information (what course belongs to what institute) through rvest while doing scraping. All the institutes will have different numbers of courses.

If it isn't possible or too convoluted, I can always put in the information manually and create a data frame accordingly after scraping all the data.

And thanks for pointing out the issue with Institution_names being a list.

technocrat · January 7, 2024, 7:56am

{rvest} can extract tagged entries, but mapping those to other tagged entries it probably can't (I'd put it at high enough to look for other options before trying).

It looks that you will be able exploit the coding of the Course_link objects that appear to abbreviate college names

"/institutions/5B7C9834-D996-4A2F-A947-8F713AD8019D/programs/BCIT-agile-development"

I assume this is "British Columbia Institute of Technology"

So, what's needed is a mapping from the token "BCIT" to "British Columbia Institute of Technology". Throw away the existing Institution_name list and create an entry with the abbreviation in its place. You can go back later and expand the abbreviations easily enough.

Begin by creating a data frame

d <- data.frame(
        uni    = rep(NA,104),
        course = Course_name,
        link   = Course_link)

Then populate `uni` by plucking the abbreviation like this

a <- Course_links[1]
fore <- ".programs."
aft <- "-.$"
kill <- ""
gsub(fore,kill,a) |> gsub(aft,kill,x = _)

make that into a function and apply

margusl · January 14, 2024, 9:31pm

When dealing with hierarchical data, a more robust approach would be identifying a "row" container, collecting all of those and iterating over that list; this way, all elements in that list item can be processed as one entity (i.e. institution and its courses are never separated). Here, that container would be an accordion item, so we'll first create a helper function that would extract data from that single accordion item; then, we'll select all of those accordion items from the page and use purrr::map() to apply our helper function on each of those, the result will be a nested list. From there, it's bind_rows() to turn it into a nested tibble (it works thanks to naming list items in parse_accordion_item() ); and unneest_wider() to get a flat dataset.

library(rvest)
library(dplyr)
library(tidyr)
library(purrr)

# each institution is in its own accordion element, 
# handle that as a single entity and return a nested list that holds 
# institution name and all courses, in JSON it would look like:
# {
#   "institution": "British Columbia Institute of Technology",
#   "courses": [
#     {
#       "course": "Agile Development",
#       "href": "/institutions/5B7C9834-D996-4A2F-A947-8F713AD8019D/programs/BCIT-agile-development"
#     },
#     {
#       "course": "Building Construction Technology",
#       "href": "/institutions/5B7C9834-D996-4A2F-A947-8F713AD8019D/programs/BCIT-building-construction-technology"
#     },
#     ...
#   ]
# } 
parse_accordion_item <- function(div){
  institution <- html_element(div, "h4") |> html_text2()
  courses <- html_elements(div, "li a") |>
    map(\(a) list(course = html_text(a),
                  href = html_attr(a, "href")))
  
  list(institution = institution, courses = courses)
}

link <- "https://www.educationplannerbc.ca/future-skills-grant"
page <- read_html(link) 

page |>
  # accordion list items
  html_elements("div.Generic_accordion-list--green__bLrNC") |>
  # parse, return list of institutions
  map(parse_accordion_item) |> 
  # build a nested 2 column tibble (institution, courses)
  bind_rows() |> 
  # unnest courses; columns names defined in parse_accordion_item()
  unnest_wider(courses)
#> # A tibble: 104 × 3
#>    institution                              course                         href 
#>    <chr>                                    <chr>                          <chr>
#>  1 British Columbia Institute of Technology Agile Development              /ins…
#>  2 British Columbia Institute of Technology Building Construction Technol… /ins…
#>  3 British Columbia Institute of Technology Building Design and Architect… /ins…
#>  4 British Columbia Institute of Technology Computer Aided Design (CAD) T… /ins…
#>  5 British Columbia Institute of Technology Human Resource Management: As… /ins…
#>  6 British Columbia Institute of Technology Mechanical Systems             /ins…
#>  7 British Columbia Institute of Technology Project Management             /ins…
#>  8 Camosun College                          Project Management Certificat… /ins…
#>  9 Capilano University                      Bookkeeping Certificate        /ins…
#> 10 College of New Caledonia                 Bookkeeping Certificate        /ins…
#> # ℹ 94 more rows

^{Created on 2024-01-14 with reprex v2.0.2}

system · February 4, 2024, 9:31pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.