error while scrapping using map and polite

hassannasir · May 24, 2019, 1:12pm

I am trying to scrap some pages with the help of polite() and map(). But I am getting following error:

[[1]]
{xml_document}

Error in nchar(desc) : invalid multibyte string, element 2

And the instead of scrapping all pages in the giving range, it only scraps, the first page over and over for entire loop.

library(polite)
library (rvest)
library(purrr)

dawnsession <- bow("https://www.dawn.com")

dawnsession

dates <- seq(as.Date("2019-04-01"), as.Date("2019-04-30"), by="days")

fulllinks <- map(dates, ~scrape(dawnsession, params = paste0("archive/",.x)) )

links <- map(fulllinks, ~html_nodes(.x, ".mb-4") %>%
      
      html_nodes(".story__link") %>%
      
      html_attr("href"))

cderv · May 25, 2019, 9:16am

From this question (Scrapping 400 pages using rvest and purr - #3 by hassannasir) I believe you want to scrape a url of this form News Archives for 2019-05-22 - DAWN.COM

Here

You are passing parameters to the url. Per the documentation and how url works, parameters are for url of this form

Not the same.

Using polite, you are interesting by the nod function. Look at the examples in ?scrape help page

You would need something like that

fulllinks <- map(
  dates, ~ {
    nod(dawnsession, paste0("archive/",.x), verbose = TRUE) %>%
    scrape()
  })

However, it seems you are not allowed to scrape this part of the website

dawnsession <- polite::bow("https://www.dawn.com")
#> No encoding supplied: defaulting to UTF-8.

polite::nod(dawnsession, "archive/2019-04-01")
#> <polite session> https://www.dawn.com/archive/2019-04-01
#>      User-agent: polite R package - https://github.com/dmi3kno/polite
#>      robots.txt: 12 rules are defined for 1 bots
#>     Crawl delay: 5 sec
#>   The path is not scrapable for this user-agent

The path is not scrapable for this user-agent

it can be verified directy in the robotstxt file

robotstxt::paths_allowed("https://www.dawn.com/archive/2019-04-01")
#> 
 www.dawn.com                      No encoding supplied: defaulting to UTF-8.
#> [1] FALSE

rt <- robotstxt::robotstxt("https://www.dawn.com")
#> No encoding supplied: defaulting to UTF-8.
rt$permissions
#>       field useragent            value
#> 1  Disallow         *          */print
#> 2  Disallow         *   */authors/*/1*
#> 3  Disallow         *   */authors/*/2*
#> 4  Disallow         *   */authors/*/3*
#> 5  Disallow         *   */authors/*/4*
#> 6  Disallow         *   */authors/*/5*
#> 7  Disallow         *   */authors/*/6*
#> 8  Disallow         *   */authors/*/7*
#> 9  Disallow         *   */authors/*/8*
#> 10 Disallow         *   */authors/*/9*
#> 11 Disallow         * /newspaper/*/20*
#> 12 Disallow         *       /archive/*

you see that /archive/* is disallowed

So you are not allowed to scrape with some code (a robot) this part of the website. sorry. See those about scraping responsability

You should contact the website to ask permission or retrieve some information from them.

system · June 15, 2019, 9:17am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.