Hi there,
I want to scrape a web link with rvest and show the results with shiny.
I have this in my ui part:
numericInput("end", "End Page", value = 100, min=100, max=1000, step = 100),
textInput("Initial_page", "Link")
start <- 10 # where the page starts
end <- # last page, depends on numeric input
links <- seq(start, end, by = 10) # it will return 10, 20, ... , 500
Make an empty dataframe to store the data
data <- data.frame()
Let's loop!
we will process the links, one by one, that's why I used seq_along function
for(i in **seq_along(links)**) {
Initial_page <- "https://linkdotblabla-" # should be the text input plus " symbols
url <- paste0(Initial_page, "&start=", links[i]) # construct the url by pasting
page <- xml2::read_html(url) # read the html
my problem is, I do not know:
- How to "infuse" link into the for loop and make my seq_along function works.
- How to make the rvest works.
here is the fullcode:
# Specifying the url
start <- 10 # where the page starts
end <- 1000 # last page, depends on how many data that you want
links <- seq(start, end, by = 10) # it will return 10, 20, ... , 500
Alright, we loop the links and now we need to store the result into a dataframe.
# Make an empty dataframe to store the data
data <- data.frame()
# Let's loop!
# we will process the links, one by one, that's why I used seq_along function
for(i in seq_along(links)) {
Initial_page <- "https://ie.indeed.com/jobs?q=analyst&l=Ireland" # the very first page
url <- paste0(Initial_page, "&start=", links[i]) # construct the url by pasting
page <- xml2::read_html(url) # read the html
# Sys.sleep pauses R for two seconds to avoid the error message
Sys.sleep(2)
# right-click on page - inspect and you can use CSS Selector addins on Chrome
# get the job title
job_title <- page %>%
rvest::html_nodes("div") %>%
rvest::html_nodes(xpath = '//a[@data-tn-element = "jobTitle"]') %>%
rvest::html_attr("title")
# get job location CSS selector
job_location <- page %>%
rvest::html_nodes('.accessible-contrast-color-location') %>%
rvest::html_text() %>%
stringi::stri_trim_both()
# get the company name
company_name <- page %>%
rvest::html_nodes("span") %>%
rvest::html_nodes(xpath = '//*[@class="company"]') %>%
rvest::html_text() %>%
stringi::stri_trim_both() -> company.name
# get job description CSS selector
job_description <- page %>%
rvest::html_nodes('.summary') %>%
rvest::html_text() %>%
stringi::stri_trim_both()
df <- data.frame(job_title, job_location, company_name, job_description)
data <- rbind(data, df)
}
Next, I only have an interest to find a job in Dublin. Then I subset the data with unique location = Dublin.
# New Dublin Data set
df_IE <- data %>%
dplyr::distinct() %>%
dplyr::mutate(city = "Dublin") # add column city = Dublin
# Cleaning
df_IE$job_description <- gsub("[\r\n]", "", df_IE$job_description)
# in case you want to save the dataset into a csv
write.csv(df_IE,"df_IE.csv")
Please let me know if you have any idea, thanks!