Download pdf files using downloader package in R

angela_italy · November 30, 2023, 1:19pm

Hi R users
I would like to download some pdf files from the web (scientfic articles) .
I have tried using the package downloader but I got the following error

HTTP status was '403 Forbidden'”
Error in download.file(url, method = method, ...): cannot open URL

Here is my code

library(downloader)
download("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true" ,
'Ecology and space in the COVID-19 epidemic diffusion: a multifactorial analysis of Italy’s provinces.pdf', mode = "wb",
headers=c("user-agent" = "Mozilla/5.0"))

And this is the error

Warning message in download.file(url, method = method, ...):
“cannot open URL 'https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true': HTTP status was '403 Forbidden'”
Error in download.file(url, method = method, ...): cannot open URL 'https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true'
Traceback:

download("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true",
. "Ecology and space in the COVID-19 epidemic diffusion: a multifactorial analysis of Italy’s provinces.pdf",
. mode = "wb", headers = c(user-agent = "Mozilla/5.0"))
download.file(url, method = method, ...)

Thank you for your help

jrkrideau · November 30, 2023, 9:36pm

I am not sure about this but I do not think {downloader} is really intended for this if you want to work with the text in R. If you just want to read the pdf, you can go to the web-site and download the file.

Have a look at the {pdftools} package.

angela_italy · December 1, 2023, 7:11am

Hi @jrkrideau thank you,
I found different examples using the downloader package but the URL always terminates with the pdf extension.
I would really appreciate if you could provide me with an example using the pdftools package?
Thank you
Angela

jrkrideau · December 1, 2023, 4:24pm

Hi Angela,

This example from the github site should help: pdftools.

angela_italy · December 3, 2023, 8:11am

Hi @jrkrideau I still get the 403 error which is strange given that the article is open source..
The only difference with the example you provided me with is the extension of the file as the URL of the file does not end with .pdf

library(reprex)
#> Warning: il pacchetto 'reprex' è stato creato con R versione 4.3.2
library(pdftools)
#> Warning: il pacchetto 'pdftools' è stato creato con R versione 4.3.2
#> Using poppler version 23.08.0
download.file("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true", 
              "article.pdf", mode = "wb")
#> Warning in
#> download.file("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true",
#> : non è possibile aprire URL
#> 'https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true':
#> HTTP lo stato era '403 Forbidden'
#> Error in download.file("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true", : non è possibile aprire URL 'https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true'

txt <- pdf_text("article.pdf")
#> Error in normalizePath(path.expand(path), winslash, mustWork): path[1]="article.pdf": Impossibile trovare il file specificato

# first page text
cat(txt[1])
#> Error in eval(expr, envir, enclos): oggetto 'txt' non trovato

# second page text
cat(txt[2])
#> Error in eval(expr, envir, enclos): oggetto 'txt' non trovato

^{Created on 2023-12-03 with reprex v2.0.2}

jrkrideau · December 3, 2023, 1:29pm

I see what is happening but I am not sure how to cure it. I have seen a couple of suggestions but they don't seem to apply to this site.

See 403 Forbidden

I'll look around but, for the moment, why not just download the pdf?

AlexisW · December 5, 2023, 10:07pm

I think it has to do with the fact that the link is not to a pdf, but to the T&F own interface which intentionally restricts automated downloads. Taking a look at the Zotero translator, it seems you first have to visit the website with a browser and get a cookie set, then can scrape the pdf. So not something easy to do with {downloader}.

Not fully sure about the right approach, I suspect you could do something with RSelenium, but that may not be easy. You'd also have to check if it's legal or if T&F has restrictions on bulk/programmatic download (though that seems doubtful for an Creative Commons license).

I'd also suggest checking the software mentioned here.

In the end, the easiest would be if the same paper is also available on a different platform (sciencedirect, jstor, PubMedCentral, ...) that has an interface to download pdfs.

angela_italy · December 7, 2023, 4:54pm

Hi Alexis,
Thank you,
Do you know how to download it with PuBMED in R ?
I was able to do a literature search but not to download the pdfs in R

Thank you for your help,
Best wishes
Angela

AlexisW · December 7, 2023, 5:02pm

I did say PubMedCentral (PMC), not the basic Pubmed. That's a repository of Open Access papers (linked with pubmed).

I'm not directly familiar with it, but usually the NIH NLM has tooling available. In that case, it looks like you could download it from their FTP.

system · December 28, 2023, 5:03pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.