I am trying to scrape some links from the body of each page in this website.
Though I am able to retrieve the links with this full_url for the first page, when I am trying to parse through the next few pages using the query argument page=, it does not work. I am not sure what is going wrong? Any help would be appreciated.
I am using the below code:
first_page<- read_html(full_url)
Welcome! It would likely work in you favor if you could provide some more details, e.g. your current iteration logic and code that generates or extracts URLs for consecutive pages to provide more context, also exact error messages instead of "does not work" . Fully reproducible example would be ideal.
One probable suspect would be your request rate, especially if you have not limited this in any way. Also your crafted request might not be quite valid or you may need to use the same session while you crawl ( rvest::session() or other means to use provided cookies with your requests ).
Hi,
Thanks for the reply. So, first of all, I read the html link of the first page using the command
first_page<- read_html (url)
But if you look closely at the link it does not have any query condition for page number. So, I presumed by adding the extra query by the notation page=, I will be able to move to any page. And it worked.
Now the problem arises, when I am trying to extract the href links within the td body element. For the first page, it works fine. I am using:
td_links <- html_nodes(first_page, "td a")
and then I extract the href attributes by
hrefs <- html_attr(td_links, "href")
But when I try to do this for the other pages, I am not able to extract the href links. The ourput I am getting is
character(0) meaning as if there is no td element in that page.
You may want to verify if and how it actually works.
Open your full_url with appended &page=2 in incognito(!) window of your browser and instead of a results table you should see a search from with "No results found" notice. It only opens 2nd results page if your browser has already established a session and requests are made with session cookies.
As each rvest::read_html() call is like a new incognito session and cookies are not preserved, it receives that same content as your incognito browser.
You can check it with rvest too, easiest is to save received content as html-file, which you can then open in your browser:
library(rvest)
full_url <- "https://iowacity.iowaassessors.com/results.php?sort_options=1&sort=0&mode=ressale&history=-1&sale_date1=01%2F01%2F2022&sale_date2=01%2F01%2F2025&sale_amt1=&sale_amt2=&recording1=&ilegal=&nutc1=0&occupancy1=&style1=&bedroom1=Equals&bedroom2=&ac1=&fireplace1=&bsmt1=&atticsf1=&atticsf2=&bsmtfin1=&bsmtfin2=&garatt1=&garatt2=&gardet1=&gardet2=&tla1=&tla2=&year1=&year2=&lot1=&lot2=&location1=&class1=2&dc=&maparea1=&dist1=&appraised1=&appraised2="
page <- read_html(paste0(full_url, "&page=2"))
html_element(page, "#error") |> html_text(trim = TRUE)
#> [1] "No results found. To return to previous search, press the back button on browser."
xml2::write_html(page, "tmp.html")
browseURL("tmp.html")
One option is to use rvest::session() instead of rvest::read_html(); as they enforce a 24 requests / 1 minute rate limit, we could use purrr::slowly() to not exceed that. Or just a Sys.sleep().
library(rvest)
library(stringr)
library(purrr)
full_url <- "https://iowacity.iowaassessors.com/results.php?sort_options=1&sort=0&mode=ressale&history=-1&sale_date1=01%2F01%2F2022&sale_date2=01%2F01%2F2025&sale_amt1=&sale_amt2=&recording1=&ilegal=&nutc1=0&occupancy1=&style1=&bedroom1=Equals&bedroom2=&ac1=&fireplace1=&bsmt1=&atticsf1=&atticsf2=&bsmtfin1=&bsmtfin2=&garatt1=&garatt2=&gardet1=&gardet2=&tla1=&tla2=&year1=&year2=&lot1=&lot2=&location1=&class1=2&dc=&maparea1=&dist1=&appraised1=&appraised2="
# use session() instead of read_html() to simulate session in a browser
s <- session(full_url)
# extract total page count
page_count <-
html_element(s, xpath = "//div[@id = 'resultsInfo']//td[1]/text()[3]") |>
html_text(trim = TRUE) |>
str_extract("\\d+$") |>
as.numeric()
page_count
#> [1] 156
# rate-limited session_follow_link()
session_follow_link_slowly <- slowly(session_follow_link, rate = rate_delay(3))
# href storage
page_hrefs <- vector(mode = "list", length = page_count)
for (i in seq_along(page_hrefs)){
# extract links from first column of resultsWrapper table
page_hrefs[[i]] <-
html_elements(s, "table#resultsWrapper:first-of-type tr > td:first-of-type > a") |>
html_attr("href")
# follow Next Page link if we haven't reached the last page
if (i < page_count){
s <- session_follow_link_slowly(s, xpath = "//img[@alt = 'Next Page ']/..")
}
# early stop for test & demo
if (i >= 10) break
}
#> Navigating to <results.php?page=2&history=-2&ts=1747865603>.
#> Navigating to <results.php?page=3&history=-3&ts=1747865603>.
#> Navigating to <results.php?page=4&history=-4&ts=1747865603>.
#> Navigating to <results.php?page=5&history=-5&ts=1747865603>.
#> Navigating to <results.php?page=6&history=-6&ts=1747865603>.
#> Navigating to <results.php?page=7&history=-7&ts=1747865603>.
#> Navigating to <results.php?page=8&history=-8&ts=1747865603>.
#> Navigating to <results.php?page=9&history=-9&ts=1747865603>.
#> Navigating to <results.php?page=10&history=-10&ts=1747865603>.
#> Navigating to <results.php?page=11&history=-11&ts=1747865603>.
# flatten list and build absolute urls (200 total urls due to stopping at page 10/156)
urls <-
unlist(page_hrefs) |>
url_absolute(full_url)
str(urls)
#> chr [1:200] "https://iowacity.iowaassessors.com/sale.php?gid=16026&sid=158&mode=ressale" ...
Thank you! This is working. So, if I want to now go to a certain page and extract all the href links within the td block element in it, I can just run the loop and keep storing it in the urls vector. Great.