Scraping images from the web

zara1 · March 31, 2022, 9:44pm

Hi, I am wondering how would it be possible to scrap images from a website in bulk and save them on my computer. I already tried the following code and it successully saves the URL links but only downloads the first element/image from the vector or data frame when i try to use the download.file() function:

bnb <- read_html("https://www.airbnb.co.uk/rooms/52486389?adults=1&category_tag=Tag%3A8225&children=0&infants=0&search_mode=flex_destinations_search&check_in=2022-05-20&check_out=2022-05-27&federated_search_id=1f863c44-cdf7-4e16-ab2d-3b1b7d5aed91&source_impression_id=p3_1648663537_0jW9B28n%2B98bVjKI")
imageurl <- bnb %>% html_nodes("img._6tbg2q") %>% html_attr("src")
head(imageurl)
imageurl

imagedf <- data.frame(links=imageurl)

download.file(imagedf$links, "imagedf$links.jpg", mode = "wb")

for (i in 1:5) { #change 100 with your number of rows to read through
myurl <- paste(imagedf[i,1], sep = "") #you need the exact column number so change 1 to that value and change df to your dataframe's name
#same as above
download.file(myurl,"imagedf$links.jpg", mode="wb")
}

michaelmayer · April 1, 2022, 9:03am

The issue here is that your output file is always the same so even in your for loop you are downloading the files but save them all into imagedf$links.jpg.

I would suggest to use something like

download.file(imagedf$links, c(paste0("imagedf$links-",seq(1,length(imagedf$links)),".jpg")))

which will download ALL files and store them in imagedf$links-1.jpg ... imagedf$links-5.jpg

zara1 · April 1, 2022, 5:02pm

Hi thank you so much for the reply. That make sense, but when I try to follow your suggested code and try to save them in different files, it gives me an error that only the first argument of destfile is used in the download.file function. Moreover, it is also giving me the saved files as corrupted now for some reason.

I am attaching the error message here, any help would be really appreciated.

Thank you.

michaelmayer · April 4, 2022, 8:26am

It seems like I may have misexpressed myself. The code I shared with you previously was meant to replace the for loop altogether and have only a single download.file() call as the IMHO better option when compared to a for loop. But the for loop is still possible. Find below both options properly commented in an example code based on yours.

library(textreadr)
library(rvest)
bnb <- read_html("https://www.airbnb.co.uk/rooms/52486389?adults=1&category_tag=Tag%3A8225&children=0&infants=0&search_mode=flex_destinations_search&check_in=2022-05-20&check_out=2022-05-27&federated_search_id=1f863c44-cdf7-4e16-ab2d-3b1b7d5aed91&source_impression_id=p3_1648663537_0jW9B28n%2B98bVjKI")
imageurl <- bnb %>% html_nodes("img._6tbg2q") %>% html_attr("src")
head(imageurl)
imageurl

imagedf <- data.frame(links=imageurl)

# Option 1 - Use one single download.file() call to download all images 

download.file(imagedf$links, c(paste0("/tmp/imagedf$links-",seq(1,length(imagedf$links)),".jpg")))

# Option 2 - Use a for loop and download one image after the other. 

for (i in 1:5) { #change 100 with your number of rows to read through
  myurl <- paste(imagedf[i,1], sep = "") #you need the exact column number so change 1 to that value and change df to your dataframe's name
  #same as above
  download.file(myurl,paste0("/tmp/imagedf$links-",i,".jpg"), mode="wb")
}

system · April 25, 2022, 8:27am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.