Hi R users
I would like to download some pdf files from the web (scientfic articles) .
I have tried using the package downloader but I got the following error
HTTP status was '403 Forbidden'”
Error in download.file(url, method = method, ...): cannot open URL
Here is my code
library(downloader)
download("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true" ,
'Ecology and space in the COVID-19 epidemic diffusion: a multifactorial analysis of Italy’s provinces.pdf', mode = "wb",
headers=c("user-agent" = "Mozilla/5.0"))
I am not sure about this but I do not think {downloader} is really intended for this if you want to work with the text in R. If you just want to read the pdf, you can go to the web-site and download the file.
Hi @jrkrideau thank you,
I found different examples using the downloader package but the URL always terminates with the pdf extension.
I would really appreciate if you could provide me with an example using the pdftools package?
Thank you
Angela
Hi @jrkrideau I still get the 403 error which is strange given that the article is open source..
The only difference with the example you provided me with is the extension of the file as the URL of the file does not end with .pdf
library(reprex)
#> Warning: il pacchetto 'reprex' è stato creato con R versione 4.3.2
library(pdftools)
#> Warning: il pacchetto 'pdftools' è stato creato con R versione 4.3.2
#> Using poppler version 23.08.0
download.file("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true",
"article.pdf", mode = "wb")
#> Warning in
#> download.file("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true",
#> : non è possibile aprire URL
#> 'https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true':
#> HTTP lo stato era '403 Forbidden'
#> Error in download.file("https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true", : non è possibile aprire URL 'https://www.tandfonline.com/doi/epdf/10.1080/21681376.2023.2234433?needAccess=true'
txt <- pdf_text("article.pdf")
#> Error in normalizePath(path.expand(path), winslash, mustWork): path[1]="article.pdf": Impossibile trovare il file specificato
# first page text
cat(txt[1])
#> Error in eval(expr, envir, enclos): oggetto 'txt' non trovato
# second page text
cat(txt[2])
#> Error in eval(expr, envir, enclos): oggetto 'txt' non trovato
I think it has to do with the fact that the link is not to a pdf, but to the T&F own interface which intentionally restricts automated downloads. Taking a look at the Zotero translator, it seems you first have to visit the website with a browser and get a cookie set, then can scrape the pdf. So not something easy to do with {downloader}.
Not fully sure about the right approach, I suspect you could do something with RSelenium, but that may not be easy. You'd also have to check if it's legal or if T&F has restrictions on bulk/programmatic download (though that seems doubtful for an Creative Commons license).
I'd also suggest checking the software mentioned here.
In the end, the easiest would be if the same paper is also available on a different platform (sciencedirect, jstor, PubMedCentral, ...) that has an interface to download pdfs.