Im want scrape this page and the other in links box search.
Im have some problems because some nodes have the same name but are different items.
library(rvest)
library(xml2)
library(dplyr)
library(tibble)
library(lubridate)
library(tm)
url<-"https://cgspace.cgiar.org/discover?rpp=10&etal=0&query=cassava&scope=10568/35697&group_by=none&page=1"
url <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))
text_html <- url %>% read_html()
text_html
Title<-text_html %>%
html_nodes(".description-info") %>%
html_text(trim = T)
Title
# has the same node name. That's why there are 20 entries in Title, since it takes the other 10 from the author of the page
Autor<-text_html %>%
html_nodes(".description-info") %>%
html_text(trim = T)
date <-text_html %>%
html_nodes(".date") %>%
html_text(trim = T)
# Many values because are various of this node in different zones.
Type <-text_html %>%
html_nodes(".artifact-type") %>%
html_text(trim = T)
For select the final, 324 page
p_ultima <- '//*[@id="aspect_discovery_SimpleSearch_div_search"]/div[4]/div/ul/li[7]/a'
Some helps or suggest for make this.
The idea is have a df with this variables.
This an amaizing response. Im want have this level about web scraping.
When im change the number page:
final_output = lapply(1:324, scrape_page) %>% # For get all pages
bind_rows()
show this error:
Error in `stop_vctrs()`:
! Can't combine `..1$Title` <character> and `..24$Title` <list>.
Run `rlang::last_error()` to see where the error occurred.
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
Im try with 20 pages and not have any errors. But when I put 50 pages, appear this same error.
Im check in manual form this pages but not find any differences in the items.
# some problem pages
# 24 - 99 - 185 - 214 - 280 - 297 # very extrange
# every entry contains 5 rows of data was true for the first two pages, but not true for all of the others. I updated the out section of the function with the code below and encountered no errors when running through the first 50 pages. I also ran through each of the problem pages you provided (thank you!) and encountered no errors. The reason those pages errored is because one entry on each page was missing either a Date, Type, or Status.
out = tibble(label = df$content[df$row == 1],
value = df$content[df$row == 0]) %>%
mutate(label = str_replace(label, ':', '')) %>%
# group labels together for an "entry"
mutate(entry = ifelse(label == 'Title', 1, 0),
entry = cumsum(entry)) %>%
pivot_wider(names_from = label, values_from = value) %>%
mutate(search_page = i) %>%
select(search_page, everything())