I'd like to scrape some data from LinkedIn Learning but have come up against a stumbling block in rvest
: how can I filter by two HTML classes at once?
Let's experiment with this page: https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562. The number of viewers is displayed in the following span tag (if you're not logged into LinkedIn learning)
<span class="content__info__item__value viewers">82,552</span>
Extracting all html_nodes
with the class content__info__item__value yields an xml_nodeset
with 4 different spans:
library("tidyverse")
#> ββ Attaching packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse 1.2.1 ββ
#> β ggplot2 2.2.1.9000 β purrr 0.2.4
#> β tibble 1.3.4 β dplyr 0.7.4
#> β tidyr 0.7.2 β stringr 1.2.0
#> β readr 1.1.1 β forcats 0.2.0
#> ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
#> β dplyr::filter() masks stats::filter()
#> β dplyr::lag() masks stats::lag()
library("rvest")
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
in_learning_url <- "https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562"
in_learning_page <- read_html(in_learning_url)
in_learning_page %>%
html_nodes(".content__info__item__value")
#> {xml_nodeset (4)}
#> [1] <span class="content__info__item__value duration">5h 59m 42s</span>
#> [2] <span class="content__info__item__value skill">Beginner + Intermedia ...
#> [3] <span class="content__info__item__value released">September 26, 2013 ...
#> [4] <span class="content__info__item__value viewers">82,552</span>
How can I now filter this xml_nodeset
? Or specify multiple classes in html_nodes
?
You can do this:
in_learning_page %>%
html_nodes(".content__info__item__value") %>%
str_subset(., "viewers")
This is assuming that it will always have the viewers
tag as well. You could also do this:
in_learning_page %>%
html_nodes(".content__info__item__value") %>%
.[[4]]
This is assuming that it will always be the 4th element in the returned vector
1 Like
Thanks @tbradley, I'd like something ideally that I can then pass to html_text
so as nicely extract the contents of the span tag without relying on regex a la https://stackoverflow.com/a/1732454/1659890
danr
4
The argument is a standard CSS selector so you can specify either or both
#has either class in_learning_page
html_nodes(".content__info__item__value, skill")
{xml_nodeset (4)}
[1] <span class="content__info__item__value duration">5h 59m 42s</span>
[2] <span class="content__info__item__value skill">Beginner + Intermediate</span>
[3] <span class="content__info__item__value released">September 26, 2013</span>
[4] <span class="content__info__item__value viewers">82,552</span>
# has both classes in_learning_page
html_nodes(".content__info__item__value.skill")
{xml_nodeset (1)}
[1] <span class="content__info__item__value skill">Beginner + Intermediate</span>
1 Like
Ah @danr thanks! I tried every combination except for
.class1.class2
Now I think about it that should've been obvious and I have what I wanted
library("tidyverse")
#> ββ Attaching packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse 1.2.1 ββ
#> β ggplot2 2.2.1.9000 β purrr 0.2.4
#> β tibble 1.3.4 β dplyr 0.7.4
#> β tidyr 0.7.2 β stringr 1.2.0
#> β readr 1.1.1 β forcats 0.2.0
#> ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
#> β dplyr::filter() masks stats::filter()
#> β dplyr::lag() masks stats::lag()
library("rvest")
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
in_learning_url <- "https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562"
in_learning_page <- read_html(in_learning_url)
in_learning_page %>%
html_nodes(".content__info__item__value.viewers") %>%
html_text() %>%
parse_number()
#> [1] 82552