Hi y'all! I've been teaching myself how to scrape with rvest for a work project and I have (after finally getting the script down) been hit with the 403 error. I know this is due to being flagged and denied access due to the bot activity, or at least that's what I've gathered from the internet. I have been attempting to find workarounds to the problem but do not know enough about webscraping or the backend processes of webpages enough to implement them myself at the time, so I'm hoping I can find some advice here from more experienced folks.
For some background, I am attempting to scrape job posting info from indeed. I have been able to get my script to run and collect data for the past three days, however today it seems to have been flagged and blocked. I repeat the same process five different times (once for high school, associates, bachelor's, master's, and doctoral education requirements) prior to appending the observations to a growing data frame.
Loaded Libraries
## Data Wrangling
library(tidyverse)
## Webscraping
library(rvest)
## Scheduling Automated Script Run
library(cronR)
## Dates
library(lubridate)
First iteration
I first load up the initial page to gather baseline information on each job posting (title, date posted, company) in addition to creating a link that loads the full job description page (that I will use later to parse out the rest of the information I am interested in gathering).
# High School
# Load in Washington, DC only jobs sorted by Date
url <- "https://www.indeed.com/jobs?q=&l=Washington+DC&sc=0kf%3Aattr%28FCGTU%7CQJZM9%252COR%29%3B&radius=0&fromage=1&sort=date&start=0"
indeed <- read_html(res)
# Index Total Number of Current Job Postings
total_jobs <- indeed %>%
html_elements(
css = ".jobsearch-JobCountAndSortPane-jobCount span:nth-child(1)") %>%
html_text() %>%
# Remove all non-numeric characters and spaces
str_remove_all(., pattern = "[:alpha:]|[:punct:]|[:space:]") %>%
as.numeric()
# Note that there are 15 job posting per page
## Round up to the nearest integer
total_pages <- ceiling(total_jobs / 15)
# Create a Pattern to Sequence Along by
page_sequence <- seq(from = 0, to = (total_pages * 10) - 10, by = 10)
# Create a vector of links to iterate through that filters for the "high school degree" jobs only
page_url <- paste("https://www.indeed.com/jobs?q=&l=Washington+DC&sc=0kf%3Aattr%28FCGTU%7CQJZM9%252COR%29%3B&radius=0&fromage=1&sort=date&start=",
page_sequence, sep = "")
# Iterate through every page to parse out all information
high_school <- map_dfr(page_url, ~ {
## Replicate Human Input
Sys.sleep(runif(1,2,3))
## Read in URL
url <- .x %>% read_html()
## Job Titles
title <- unlist(url %>%
html_elements(xpath = "//td[@class='resultContent']//h2/a") %>%
html_text()) %>% tolower()
## Company Names
company <- unlist(url %>%
html_elements(xpath = "//span[@class = 'companyName']") %>%
html_text()) %>% tolower()
## Date Posted
date_posted <- unlist(url %>%
html_elements(xpath = "//span[@class = 'date']") %>%
html_text()) %>% tolower()
### Update Date Posted to Reflect Dates in Time
date_posted[which(date_posted == "postedjust posted")] <- as.character(today())
date_posted[which(str_detect(date_posted, "today"))] <- as.character(today())
date_posted[which(str_detect(date_posted, "1"))] <- as.character(today() - 1)
## Job Description Links
link <- unlist(url %>%
### Extract the @data-jk element using its xpath and convert them into a vector of strings
html_elements(xpath = "//td[@class='resultContent']//h2/a[@data-jk]/@data-jk") %>%
html_text())
### Next we will append each of the @data-jk elements to https://www.indeed.com/viewjob?jk=
link <- paste("https://www.indeed.com/viewjob?jk=",
link, sep = "")
## Bind into a new Date Frame
distinct(bind_cols(title, company, date_posted, link))
})
# Rename Column Names
colnames(high_school) <- c("title", "company", "date_posted", "link")
# Remove Duplicates
high_school <- high_school %>% distinct()
2nd, 3rd, and 4th Iterations
I then iterate through every link gathered that goes to each individual full job description and iterate through them so I can parse out the full job description text, salary information, and hiring insights. I tried to figure out a way to complete this process within a single map function but I was unsuccessful (hence three different iterations to gather the information individually). I know this is not optimal but I couldn't get it to work properly but I was able to do so individually. The code for each is more or less the same format, so I've only provided a single iteration. IF you're able to also help show me how I can combine these different iterations into a single process that would also be amazing and would really cut down the computation time.
# Extracting Information from Job Description Links
## Full Job Description Text
high_school$description <- map(high_school$link, ~ {
# Replicate Human Input by Forcing Random Pauses
Sys.sleep(runif(1,2,3))
# Read in the html links
.x %>% read_html() %>%
# Navigate toward the Full Job Description elements of the Page by ID
html_elements(xpath = "//div[@id = 'jobDescriptionText']") %>%
# Convert the element to text
html_text() %>% tolower()
})
Currently, I receive the 403 error no matter what indeed.com page I load into rvest and I have no idea how to bypass that. Thank you in advance for your help