Hi everyone,
I am trying to scrape data from a homepage. Therefore I have to write a loop that scarpes data from different subpages. However, when I use html_nodes, my code fails at gathering all information that is actually on the website since the html_text() items that have to be followed are of the same name. Hence, I do not get all information.
My code looks as follows:
library(rvest)
library(xml2)
library(dplyr)
url_vw_up <- "https://www.adac.de/infotestrat/autodatenbank/autokatalog/modelle.aspx?baureihe=up!&limit=1000#Ergebnis"
# vw-up page, follow_link(i) is later used to follow the nodes for each vw up that is set here
vw_up <- read_html(url_vw_up) %>% html_nodes(".img-wrap+ td .block") %>% html_text()
# create the desired format of dataframe
Adac_raw <- data.frame(matrix(nrow = 9, ncol =,))
# loop for scraping information
s_vw_up <- html_session(url_vw_up)
for (i in vw_up[1:194]){
page_up <- s_vw_up %>% follow_link(i) %>% read_html()
#here, I have the issue that duplicated observations are overwritten such that i only reveal 73 out of 194 observations - how can I change it?
Adac_raw[[i]] <- page_up %>% html_nodes("strong+ .box-section tr:nth-child(7) td+ td , strong+ .box-section tr:nth-child(6) td+ td , strong+ .box-section tr:nth-child(4) td+ td , strong+ .box-section tr:nth-child(3) td+ td , strong+ .box-section tr:nth-child(2) td+ td , strong+ .box-section tr:nth-child(1) td+ td , strong+ .box-section tr:nth-child(10) td+ td , strong+ .box-section tr:nth-child(11) td+ td , strong+ .box-section tr:nth-child(15) td+ td") %>% html_text()
Sys.sleep(2)
}
My code should acurally reveal information about all 194 vehicles, however it does only reveal for 73 due to identical names. Within my loop, same names are overwritten when I want to assign information to "Adac_raw. How can I change it to keep the duplicates / same names?