Hi @Krim and welcome to RStudio Community

The table of data on the webpage is loaded via JavaScript and this is the reason why using {rvest}
is not ideal to scrape it. This is because the table takes a few seconds to load when you visit the url: https://fundf10.eastmoney.com/jjjz_510300.html, so rvest::read_html()
is not able to capture it as it only captures everything available on the webpage immediately after the site launches (i.e. the static HTML).
So, we are going to use the {RSelenium}
package for this task. It is a package, which allows you to manipulate your browser right from your code using a Selenium server. The only downside is that it takes a few steps to setup. Everything is not ready out of the box when you install the package with install.packages("RSelenium")
, but I'll do my best to walk you through all steps with as many details as possible. Also, it is important to mention that I am a Windows user.
Setup
- Install the latest version of Java: https://java.com/en/download/. Restart your computer when the installation process is over.
- Install Firefox. In my personal experience, manipulating firefox is easier from Selenium: https://www.mozilla.org/en-US/firefox/new/
Connect to Selenium server from R
# Load packages ----
pacman::p_load(RSelenium, purrr, rvest, glue)
# Start a Selenium server
driver <- rsDriver(port = 4444L, browser = "firefox")
remote_driver <- driver$client
Hopefully, everything has worked for you so far.
Mini tutorial
Now, let me walk you through a quick tutorial in which we will scrape the second page of the table on the webpage.
# Open browser ----
remote_driver$open() # This code will actually open the firefox browser
# Navigate to URL ----
url <- "https://fundf10.eastmoney.com/jjjz_510300.html"
remote_driver$navigate(url) # This code will actually open the website in the browser that opened up earlier
# Navigate to page 2 of the table ----
# ** Find page 2 button
page2_btn <- remote_driver$findElement(using = "css", value = glue(".pagebtns > label[value='2']"))
# ** Move pointer to button
remote_driver$mouseMoveToLocation(webElement = page2_btn)
# ** Click on page 2 button
page2_btn$click() # Notice how the browser goes to page 2 of the table
# Find table element in HTML page ----
table_el <- remote_driver$findElement(using = "css", value = "#jztable")
# Scrape table ----
table_page2 <- table_el$getElementAttribute("innerHTML") %>%
.[[1]] %>%
read_html() %>%
html_table() %>%
.[[1]]
table_page2
# A tibble: 20 x 7
`<U+51C0><U+503C><U+65E5><U+671F>` `<U+5355><U+4F4D><U+51C0><U+503C>` `<U+7D2F><U+8BA1><U+51C0><U+503C>` `<U+65E5><U+589E><U+957F><U+7387>` `<U+7533><U+8D2D><U+72B6><U+6001>` `<U+8D4E><U+56DE><U+72B6><U+6001>` `<U+5206><U+7EA2><U+9001><U+914D>`
<chr> <dbl> <dbl> <chr> <chr> <chr> <lgl>
1 2021-07-07 5.19 2.09 1.17% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
2 2021-07-06 5.13 2.07 0.02% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
3 2021-07-05 5.13 2.07 0.09% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
4 2021-07-02 5.12 2.07 -2.82% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
5 2021-07-01 5.27 2.13 0.09% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
6 2021-06-30 5.26 2.12 0.69% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
7 2021-06-29 5.23 2.11 -1.10% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
8 2021-06-28 5.29 2.13 0.22% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
9 2021-06-25 5.28 2.13 1.70% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
10 2021-06-24 5.19 2.10 0.19% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
11 2021-06-23 5.18 2.09 0.52% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
12 2021-06-22 5.15 2.08 0.63% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
13 2021-06-21 5.12 2.07 -0.25% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
14 2021-06-18 5.13 2.07 0.05% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
15 2021-06-17 5.13 2.07 0.47% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
16 2021-06-16 5.10 2.06 -1.63% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
17 2021-06-15 5.19 2.10 -1.12% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
18 2021-06-11 5.25 2.12 -0.81% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
19 2021-06-10 5.29 2.13 0.69% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
20 2021-06-09 5.25 2.12 0.11% <U+573A><U+5185><U+4E70><U+5165> <U+573A><U+5185><U+5356><U+51FA> NA
Scrape the full table
The mini tutorial shows all the steps needed to scrape the data from a specific page of the table. Now, we will package all these steps into a function and automate the scraping of all pages
# Find total number of pages ----
div_page_btns <- remote_driver$findElements(using = "css", value = "div.pagebtns")
n_pages <- div_page_btns[[1]]$findChildElements(using = "css", value = "label[value]") %>%
map_chr(~ unlist(.x$getElementText())) %>%
as.numeric() %>%
max(na.rm = TRUE)
# Create function (it uses all the steps in the mini tutorial) ----
scrape_table_page <- function(page){
message(glue::glue("Scraping data on page {page}."))
page_btn <- remote_driver$findElement(using = "css", value = glue::glue("div.pagebtns > label[value = '{page}']"))
remote_driver$mouseMoveToLocation(webElement = page_btn)
page_btn$click()
Sys.sleep(1) # Give browser a second to load the data on the new page
table_el <- remote_driver$findElement(using = "css", value = "#jztable")
table_el$getElementAttribute("innerHTML") %>%
.[[1]] %>%
read_html() %>%
html_table() %>%
.[[1]]
}
Now we can apply the function to all pages
mydata <- map_dfr(seq_len(n_pages), scrape_table_page)
Let me know if you have questions.