Loop or function for obtains links with web scraping

M_AcostaCH · October 25, 2022, 8:48pm

Hi community

Im want to obtain the different links of this 3420 items. Im not have strong experience with loops or functions about this. Im can make the script for download the 10 items for the first page. But is very time consuming make one by one in each page.

The idea is obtain each link. All links have this form: handle/10568/43833 . For each item change the final number. In link2 with paste0 I get the final link of each item.


website<-'https://cgspace.cgiar.org/discover?scope=10568%2F35697&query=cassava&submit='
link <-  vector()
#loop through nodes
for (i in 1:10){
  link[i] <-website  %>% 
    read_html() %>%
    html_nodes(xpath=paste0('//*[@id="aspect_discovery_SimpleSearch_div_search-results"]/div[',i,']/div[2]/div/div[1]/a')) %>%
    html_attr('href')
                            
}  
pag <- data.frame(link)
pag$link2 <- paste0('https://cgspace.cgiar.org', pag$link)

# link                                        link2
# 1  /handle/10568/71370 https://cgspace.cgiar.org/handle/10568/71370
# 2  /handle/10568/43831 https://cgspace.cgiar.org/handle/10568/43831
# 3  /handle/10568/56285 https://cgspace.cgiar.org/handle/10568/56285
# 4  /handle/10568/56204 https://cgspace.cgiar.org/handle/10568/56204
# 5  /handle/10568/43833 https://cgspace.cgiar.org/handle/10568/43833
# 6  /handle/10568/54391 https://cgspace.cgiar.org/handle/10568/54391
# 7  /handle/10568/98291 https://cgspace.cgiar.org/handle/10568/98291
# 8  /handle/10568/69696 https://cgspace.cgiar.org/handle/10568/69696
# 9  /handle/10568/89962 https://cgspace.cgiar.org/handle/10568/89962
# 10 /handle/10568/71814 https://cgspace.cgiar.org/handle/10568/71814

Tnks

technocrat · October 25, 2022, 9:45pm

It looks like you are trying to extract a list of links like those in pag$link2, but all of them, rather than the first 10. Is the problem that the website only displays 10 results at a time and expects a user to click next or similar to go on to the next 10?

M_AcostaCH · October 25, 2022, 10:55pm

Yes, Im want to obtain this links but in the others pages, but are 342 pages . So, make this one by one is very time consume.

Im try to find a form for have this links.

For example put click in page 2 and obtain this links. The same for others 342 pages.

My experience in function or loops is a little.

technocrat · October 25, 2022, 11:27pm

It may be possible to do this depending on how the source website paginates. Do the pages have separate urls numbered sequentially? If so, it’s just an outside loop for the pages and inner loops for the links

andresrcs · October 26, 2022, 1:34am

If the "website" is the same as from your previous question , there is no need to scrap and iterate, dataverse has an API, for example:

library(httr)

url <- "https://dataverse.harvard.edu/api/search?q=cassava&fq=authorAffiliation_ss%3A%22International+Center+for+Tropical+Agriculture+-+CIAT%22&type=dataset&type=file&sort=score&order=desc&per_page=100"

result <- jsonlite::fromJSON(content(GET(url = url), "text", encoding = "utf8"))$data$items

Here is the API guide for making searches

https://guides.dataverse.org/en/5.12/api/search.html

M_AcostaCH · October 26, 2022, 3:26am

Im check the pages and only is different the final number. Im think about something that you said, but I can´t high experience with this type loops.

Im put web site . Forget put in the initial question

M_AcostaCH · October 26, 2022, 3:47pm

The idea was something like that:

all_pags <- data.frame()
for( i in 1:342){
    website<-paste0('https://cgspace.cgiar.org/discover?rpp=10&etal=0&query=cassava&scope=10568/35697&group_by=none&page=',i)
     link <-  vector()
     #loop through nodes
     for (i in 1:10){
       link[i] <-website  %>% 
         read_html() %>%
         html_nodes(xpath=paste0('//*[@id="aspect_discovery_SimpleSearch_div_search-results"]/div[',i,']/div[2]/div/div[1]/a')) %>% html_attr('href')
    
      }  
      pag <- data.frame(link)
      pag$link2 <- paste0('https://cgspace.cgiar.org', pag$link)
      all_pags <- rbind(all_pags, pag)
}
all_pags

HanOostdijk · October 26, 2022, 9:50pm

You could do this (more efficient I think because the website is accessed less) also in the following way.
The function handles one page and the loop over the 10 nodes is avoided by using xml_find_all .
I use package xml2 because I am not familiar with rvest but I think they are more or less the same (??)

library(xml2)
library(magrittr)
library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:magrittr':
#> 
#>     set_names

get_page_links <- function(page) {
  page_website <-
    paste0(
      'https://cgspace.cgiar.org/discover?rpp=10&etal=0&query=cassava&scope=10568/35697&group_by=none&page=',
      page
    )
  links <- page_website  %>%
    xml2::read_html() %>%
    xml2::xml_find_all(xpath = '//*[@id="aspect_discovery_SimpleSearch_div_search-results"]/div/div[2]/div/div[1]/a')  %>% 
    xml2::xml_attr('href')
  data.frame(page = page,
             i = seq(1, length(links)),
             link = links)
}

# all_links <- purrr::map_chr(1:342,get_page_links)
all <- purrr::map_dfr(1:2, get_page_links)
all$link2 <- paste0('https://cgspace.cgiar.org', all$link)
head(all)
#>   page i                link                                        link2
#> 1    1 1 /handle/10568/71370 https://cgspace.cgiar.org/handle/10568/71370
#> 2    1 2 /handle/10568/43831 https://cgspace.cgiar.org/handle/10568/43831
#> 3    1 3 /handle/10568/56285 https://cgspace.cgiar.org/handle/10568/56285
#> 4    1 4 /handle/10568/56204 https://cgspace.cgiar.org/handle/10568/56204
#> 5    1 5 /handle/10568/43833 https://cgspace.cgiar.org/handle/10568/43833
#> 6    1 6 /handle/10568/54391 https://cgspace.cgiar.org/handle/10568/54391
tail(all)
#>    page  i                link                                        link2
#> 15    2  5 /handle/10568/55409 https://cgspace.cgiar.org/handle/10568/55409
#> 16    2  6 /handle/10568/55229 https://cgspace.cgiar.org/handle/10568/55229
#> 17    2  7 /handle/10568/89934 https://cgspace.cgiar.org/handle/10568/89934
#> 18    2  8 /handle/10568/89960 https://cgspace.cgiar.org/handle/10568/89960
#> 19    2  9 /handle/10568/90629 https://cgspace.cgiar.org/handle/10568/90629
#> 20    2 10 /handle/10568/57981 https://cgspace.cgiar.org/handle/10568/57981
Created on 2022-10-26 with reprex v2.0.2

M_AcostaCH · October 27, 2022, 3:25am

This script is more efficient like you said. Now Im try to obtain the names (Title) of each link for check if the link correspond.

So, In my code Im add this, but get error:

Title[i]<-website %>% 
          read_html() %>% 
          html_nodes(xpath=paste0('//* 
      [@id="resultsTable"]/tbody/tr[',i,']/td/div/div[1]/a/span')) %>% 
          html_text(trim = T)

# Error in Title[i] <- website %>% read_html() %>% html_nodes(xpath = #paste0("//*[@id=\"resultsTable\"]/tbody/tr[",  : 
# replacement has length zero

# For example get something like this

#             Title                              link2
#1 Industrializacion de la yuca                   https://cgspace.cgiar.org/handle/10568/71370
#2 Development and use of biotechnology .......   https://cgspace.cgiar.org/handle/10568/55409

How to add this in you code,

andresrcs · October 27, 2022, 3:48am

Please be aware of our cross-posting policy

M_AcostaCH · October 27, 2022, 4:09am

Hi @andresrcs , for better find help, Im use this two forum. Because Im see for example that many people dont now https://forum.posit.co. When the response is in other site Im copy and share the response for both communities. The idea share the knowledge. and mark the correct answer so that when someone has a similar problem they can have a quick solution and not kill their heads trying to solve it.
I am a user who has learned almost everything in a self-taught way and I have found too much help in the forums. I am very impressed with the knowledge that many people have about R.

M_AcostaCH · October 27, 2022, 4:10am

Im find other helps in this way:

system · November 3, 2022, 4:10am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.