Download multiple urls with increment ending using rvest

ILearnR · February 19, 2020, 11:29am

Dear all,

Thank you for looking into my post. The United Nations Security Council produces resolution after resolution. I'd like to perform some basic text-mining tasks on the resolutions adopted between 2000 and 2018. Which means I'd like to download the document signatures S/RES/1285:S/RES/2246.

My challenge:
I would like to

Download a series of plain text files with increment numbers
Store them in a list or a df where three hrefs are in columns and each resolution is one row

My Data
The series of text files is stored here: [https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/]

The same content is inconveniently stored as a html web content here

Each document has a sequential number associated which is at the end of the URL.

E.g., Document S/RES/1285

URL <- "https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/1285"
URL # plain text of the resolution

I'd like to the files 1285:2446 from this list (basically the pendant to Curl

https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/[1285-2446] -o "#1.txt)

My failed attempts to solve challenge 1: Download and store sequential urls

In the forum, I found this approach and adapted it to my case

Packages used
library(tidyverse)
library(rvest)

Attempt a


mandate <- list()
inshows = c(1285:2446) # sequence of resultions
for(u in inshows) {
  url <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  mandate[[u]] <- read_html(url)
}

I also tried a modification of this post


Attempt b

mandates <- lapply(paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', 1285:2446),
               function(url){
                 url %>% read_html() %>%
                   })

Rather unsurprising both create empty lists of 1651. Where would the text come from in this code?

If anyone knows how to iterate over the sequence and download the list, I would greatly appreciate help!

The resolutions are not stored as txt file nor are they still in html (and hence with href). Therefore I am wondering if I am all together on the wrong track here.

Challenge 2: Store the downloaded documents tidy

In order to work with the text, I need them either in a list or in a dataframe.

Ideally, the corpus would be structured along the hrefs (included as patterns in the plain text)

"doc_sym", # contains the document signature (S/RES/1285(2000)
"raw_txt", # contains the text body of the document
"title". # contains the title "Security Council resolution 1285 [2000) [on monitoring of the demilitarization of the Prevlaka peninsula by the UN military observers]"
Year, in parenthesis in soc_sym (no href)

I have absolutely no idea how to perform step two and am open to any and all suggestions.

Thank you very much!
Best,

Martin

nirgrahamuk · February 19, 2020, 11:44am

The problem here is there are two ways of accessing lists, and the code here is inconsistent, starting with one approach and then treating as the other.

If you wish to grow the list by position, so the first entry is in mandate[[1]] the second in mandate[[2]] then this construction will work:

mandate <- list()
inshows = c(1285:2446) # sequence of resultions
for(u in seq_along(inshows)) {
  url <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  mandate[u] <- read_html(url)
}

trick (I tested that mandate fills without nulls by omtting the read_html part and just directly assigning url to the mandate list
If however you prefer to access the list by 'name' then we turn the actually resolution sequences from numbers to characters (i.e. names ) and set it up this way:

mandate <- list()
inshows = c(1285:2446) # sequence of resultions
for(u in inshows) {
  url <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  mandate[[as.character(u)]] <- read_html(url)
}

this way you can access the first element in one of two ways:
mandate$1285 and mandate[[1]]
because you now have the name option

ILearnR · February 19, 2020, 2:34pm

Dear nirgrahamuk,

Thank you very much for your help! Indeed, not ideal. As my name indicated, I learn R, haha.

Thank you for the trick, that's an excellent suggestion.

If I understand correctly, you suggest the combination which would look like this:

library(rvest)
library(tidyverse) 


mandate <- list()
inshows = c(1285:1290) # resolutions 
for(u in seq_along(inshows)) { # added your seq_along
  url <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  mandate[[as.character(u)]] <- read_html(url) # added your as.character
}

For example purposes I reduced the number to 6 mandates (1285:1290) in the example above.

What works great is downloading a series of mandates. Which answers my first question. Yay!
Also you trick works well (though I need ticks).

mandate$'1285'

Whats less great. I don't have access to my text in any meaningful way.
The list has node and doc which both contain (useless?) meta data.

mandate[[1]]$node

# results in
#<pointer: 0x000001e8b78df6a0>


> mandate[[1]]$doc
# results in 
# <pointer: 0x000001e8b6518670>

mandate[[1]]  #

Shows the plain text without the ability to work with it (let alone search or structure it as suggested by my second question).

unlist(mandates) 
# throws this (which is not very useful)
# $`1285.node`
# <pointer: 0x000001e8b78df6a0>

Other options I tried


unlist(mandates)  # doesn't unpack the text

library(qdap)
list2df(mandate, col1 = "Node", col2 = "Text") # doesn't work

library(tm)
new_corpus <- as.VCorpus(mandate) # doesn't work 


list_df <- purrr::map(.x = mandate, .f = tibble::as_data_frame) # doesn't work  (credit to [Indrajeet Patil](https://stackoverflow.com/questions/48368983/r-turning-list-to-data-frame)

I suppose my second question is asking for a lot. Accessing the text in a useful way would be necessary, though

With only the signature (e.g., 1285) it's difficult to work.

Thanks for your help !

Martin

nirgrahamuk · February 19, 2020, 3:19pm

this approach takes all your urls and makes you a nice dataframe out of them.
I removed library(rvest) as its not needed.
The actual text info at the url locations, are in a JSON style format, so I scrapped the read_html

# library(rvest)
library(tidyverse) 
library(jsonlite)
library(purrr)

url_to_df <- function(u){
  urlvalue <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  j <- fromJSON(url(urlvalue)) # added your as.character
  tibble(resolution_no = u, title=j$title,
                                       doc_sym=j$doc_sym,
                                       raw_txt=j$raw_txt)
}

inshows = c(1285:1290) # resolutions 
## take the list of inshow u numbers,
# pass each into the function to get results as df/tibble
# and return the result gathered up in a rowwise master dataframe called mandate 
mandate <- map_dfr(inshows,
        ~url_to_df(.))

# print text of 1288
print(mandate %>% 
        filter(resolution_no==1288) 
      %>% pull(raw_txt))
#or given the cell position in our final df could do
print(mandate[[5,4]])  # 5th row down 4 columns across

ILearnR · February 20, 2020, 5:37am

Dear nirgrahamuk,

Brilliant!
Thank you for turning my half-cooked question into neat code!

Have a great day

Martin

ILearnR · February 20, 2020, 6:28am

Dear nirgrahamuk,
Sorry, I have one more question

The code produces exactly the df I am after for some sequences and throws errors on others. I don't understand where the error can occur from, let alone have an idea how to fix it.

Working Sequence


library(tidyverse) 
library(jsonlite)
library(purrr)

url_to_df <- function(u){
  urlvalue <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  j <- fromJSON(url(urlvalue)) 
  tibble(resolution_no = u,  title=j$title,
         doc_sym=j$doc_sym,
         raw_txt=j$raw_txt)
}

inshows = c(1285:1300)  # works perfectly, so does 1400:1600
mandate <- map_dfr(inshows, ~url_to_df(.))

On difference sequences, it throws the error "$ operator is invalid for atomic vectors" for title, doc_sm and raw_text. Why does the data structure change?
Any idea where this sudden change of 'mind' can come from?

Not Working


library(tidyverse) 
library(jsonlite)
library(purrr)

url_to_df <- function(u){
  urlvalue <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  j <- fromJSON(url(urlvalue)) 
  tibble(resolution_no = u,  title=j$title,
         doc_sym=j$doc_sym,
         raw_txt=j$raw_txt)
}

inshows = c(1285:1350)  # also (1600:2000) doesn't work
mandate <- map_dfr(inshows, ~url_to_df(.))

 Error in j$title : $ operator is invalid for atomic vectors

I can't make sense of the trace-back and the debug message (below) neither can I find help on google. Do you know how to interpret the error message?

eval_tidy:
function (expr, data = NULL, env = caller_env()) 
{
  .Call(rlang_eval_tidy, expr, data, env)
}

Console: 
Called from: eval_tidy(xs[[i]], unique_output)
Browse[1]> 1
[1] 1
Error during wrapup: unexpected '[' in "["

Thank you very much!

Martin

nirgrahamuk · February 20, 2020, 2:49pm

hi Martin,
its not quite as complex as all that. put simply, occasionally you pass a number for which the webservice you are relying on, doesnt give you a Json with 3 elements, but a Json with a single element, saying that what you searched for was not found

Here is a version that uses the JSON length, as a clue and tries to handle it

library(tidyverse) 
library(jsonlite)
library(purrr)

url_to_df <- function(u){
  urlvalue <- paste0('https://m06qqef3qg.execute-api.us-east-1.amazonaws.com/dev/S/RES/', u)
  j <- fromJSON(url(urlvalue))
  cat("pulled ", u, " length ", length(j),"\n")
  if (length(j)==3) # expected length
  tibble(resolution_no = u,  title=j$title,
         doc_sym=j$doc_sym,
         raw_txt=j$raw_txt)
  else 
    tibble( resolution_no = u,  title="Not Found",
                   doc_sym="Not Found",
                   raw_txt="Not Found")
}

inshows = c(1280:1350)  # also (1600:2000) doesn't work
mandate <- map_dfr(inshows, ~url_to_df(.))

you can tidy this by commenting out the cat() which gives visual feedback on progress and what was encountered

ILearnR · February 21, 2020, 11:20am

Dear nirgrahamuk,

Many thanks for the intuitive explanation and your encouraging words.
In a couple of years my code will be better too.

Thanks for the additional solution detecting empty files!

Happy Friday!

nirgrahamuk · February 21, 2020, 11:23am

you're very welcome, enjoy your weekend

system · February 28, 2020, 11:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.