Help with rvest and scraping

jroyyy · February 13, 2020, 8:38am

Hi everyone!

I'm currently writing some code in R language in order to extract information of the funding that various projects on a website have acquired.
I am using the rvest-package in R.

Here is a sample of how the HTML-code on the website looks:

<title>Project 2030 is launched</title>
<div data-name="category">Domestic news</div> <!--/category--> 
<div data-name="funding">25000000</div><!--/funding-->

In R, I've succesfully acquired the title with:

> library(rvest)
> a_webpage <- read_html("www.example.com")
> a_webpage %>%
+ html_node("title") %>%
+ html_text()
[1] Project 2030 is launched

My question is.. how can I do the same for the "funding" part - or more specifically, how can I extract the number 25000000? Using "html_node("div#funding)" or other varities does not seem to be sufficient.

Thanks!

jroyyy · February 13, 2020, 10:32am

By the way; here is a link to the website:
https://www.tuborgfondet.dk/projekt/mind-your-own-business-groenland
... with the title being found in line 62, and the funding amount is in line 199 of: view-source:https://www.tuborgfondet.dk/projekt/mind-your-own-business-groenland

nirgrahamuk · February 13, 2020, 12:27pm

library(rvest)


html_text <- '<title>Project 2030 is launched</title>
   <div data-name="category">Domestic news</div> <!--/category--> 
   <div data-name="funding">25000000</div><!--/funding-->
   <div data-name="funding">999</div><!--/funding-->'

b_webpage <- read_html(html_text)
b_webpage %>%
  html_node("title") %>%
  html_text()

b_webpage %>%
  html_nodes("div[data-name='funding']") %>%
  html_text()

a_webpage <- read_html("https://www.tuborgfondet.dk/projekt/mind-your-own-business-groenland")
a_webpage %>%
  html_node("title") %>%
  html_text()


a_webpage %>%
  html_nodes("div[data-name='funding']") %>%
  html_text()

mattwloftis · February 13, 2020, 12:27pm

I'd recommend using xpath to identify the specific nodes you want. Discovering node identifiers using SelectorGadget or writing your own CSS selectors often works great, but it can fail you when things aren't identified super carefully. Here's an example to grab the funding field:

a_webpage %>% 
html_nodes(xpath = "//div[@data-name='funding']") %>% 
html_text()
[1] "11209000"

jroyyy · February 13, 2020, 12:34pm

Thanks a lot! This truly helped.

system · February 20, 2020, 12:35pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.