I am trying to scrap some pages with the help of polite() and map(). But I am getting following error:
[[1]]
{xml_document}
Error in nchar(desc) : invalid multibyte string, element 2
And the instead of scrapping all pages in the giving range, it only scraps, the first page over and over for entire loop.
library(polite)
library (rvest)
library(purrr)
dawnsession <- bow("https://www.dawn.com")
dawnsession
dates <- seq(as.Date("2019-04-01"), as.Date("2019-04-30"), by="days")
fulllinks <- map(dates, ~scrape(dawnsession, params = paste0("archive/",.x)) )
links <- map(fulllinks, ~html_nodes(.x, ".mb-4") %>%
html_nodes(".story__link") %>%
html_attr("href"))
cderv
May 25, 2019, 9:16am
2
From this question (Scrapping 400 pages using rvest and purr - #3 by hassannasir ) I believe you want to scrape a url of this form News Archives for 2019-05-22 - DAWN.COM
Here
You are passing parameters to the url. Per the documentation and how url works, parameters are for url of this form
Not the same.
Using polite, you are interesting by the nod
function. Look at the examples in ?scrape
help page
You would need something like that
fulllinks <- map(
dates, ~ {
nod(dawnsession, paste0("archive/",.x), verbose = TRUE) %>%
scrape()
})
However, it seems you are not allowed to scrape this part of the website
dawnsession <- polite::bow("https://www.dawn.com")
#> No encoding supplied: defaulting to UTF-8.
polite::nod(dawnsession, "archive/2019-04-01")
#> <polite session> https://www.dawn.com/archive/2019-04-01
#> User-agent: polite R package - https://github.com/dmi3kno/polite
#> robots.txt: 12 rules are defined for 1 bots
#> Crawl delay: 5 sec
#> The path is not scrapable for this user-agent
The path is not scrapable for this user-agent
it can be verified directy in the robotstxt file
robotstxt::paths_allowed("https://www.dawn.com/archive/2019-04-01")
#>
www.dawn.com No encoding supplied: defaulting to UTF-8.
#> [1] FALSE
rt <- robotstxt::robotstxt("https://www.dawn.com")
#> No encoding supplied: defaulting to UTF-8.
rt$permissions
#> field useragent value
#> 1 Disallow * */print
#> 2 Disallow * */authors/*/1*
#> 3 Disallow * */authors/*/2*
#> 4 Disallow * */authors/*/3*
#> 5 Disallow * */authors/*/4*
#> 6 Disallow * */authors/*/5*
#> 7 Disallow * */authors/*/6*
#> 8 Disallow * */authors/*/7*
#> 9 Disallow * */authors/*/8*
#> 10 Disallow * */authors/*/9*
#> 11 Disallow * /newspaper/*/20*
#> 12 Disallow * /archive/*
you see that /archive/*
is disallowed
So you are not allowed to scrape with some code (a robot) this part of the website. sorry. See those about scraping responsability
You should contact the website to ask permission or retrieve some information from them.
system
Closed
June 15, 2019, 9:17am
3
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.