error while scrapping using map and polite

cderv · May 25, 2019, 9:16am

From this question (Scrapping 400 pages using rvest and purr - #3 by hassannasir) I believe you want to scrape a url of this form News Archives for 2019-05-22 - DAWN.COM

Here

You are passing parameters to the url. Per the documentation and how url works, parameters are for url of this form

Not the same.

Using polite, you are interesting by the nod function. Look at the examples in ?scrape help page

You would need something like that

fulllinks <- map(
  dates, ~ {
    nod(dawnsession, paste0("archive/",.x), verbose = TRUE) %>%
    scrape()
  })

However, it seems you are not allowed to scrape this part of the website

dawnsession <- polite::bow("https://www.dawn.com")
#> No encoding supplied: defaulting to UTF-8.

polite::nod(dawnsession, "archive/2019-04-01")
#> <polite session> https://www.dawn.com/archive/2019-04-01
#>      User-agent: polite R package - https://github.com/dmi3kno/polite
#>      robots.txt: 12 rules are defined for 1 bots
#>     Crawl delay: 5 sec
#>   The path is not scrapable for this user-agent

The path is not scrapable for this user-agent

it can be verified directy in the robotstxt file

robotstxt::paths_allowed("https://www.dawn.com/archive/2019-04-01")
#> 
 www.dawn.com                      No encoding supplied: defaulting to UTF-8.
#> [1] FALSE

rt <- robotstxt::robotstxt("https://www.dawn.com")
#> No encoding supplied: defaulting to UTF-8.
rt$permissions
#>       field useragent            value
#> 1  Disallow         *          */print
#> 2  Disallow         *   */authors/*/1*
#> 3  Disallow         *   */authors/*/2*
#> 4  Disallow         *   */authors/*/3*
#> 5  Disallow         *   */authors/*/4*
#> 6  Disallow         *   */authors/*/5*
#> 7  Disallow         *   */authors/*/6*
#> 8  Disallow         *   */authors/*/7*
#> 9  Disallow         *   */authors/*/8*
#> 10 Disallow         *   */authors/*/9*
#> 11 Disallow         * /newspaper/*/20*
#> 12 Disallow         *       /archive/*

you see that /archive/* is disallowed

So you are not allowed to scrape with some code (a robot) this part of the website. sorry. See those about scraping responsability

You should contact the website to ask permission or retrieve some information from them.