Bypassing 403 Error when Webscraping with Rvest

doddzzy · November 2, 2022, 11:00pm

Hi y'all! I've been teaching myself how to scrape with rvest for a work project and I have (after finally getting the script down) been hit with the 403 error. I know this is due to being flagged and denied access due to the bot activity, or at least that's what I've gathered from the internet. I have been attempting to find workarounds to the problem but do not know enough about webscraping or the backend processes of webpages enough to implement them myself at the time, so I'm hoping I can find some advice here from more experienced folks.
For some background, I am attempting to scrape job posting info from indeed. I have been able to get my script to run and collect data for the past three days, however today it seems to have been flagged and blocked. I repeat the same process five different times (once for high school, associates, bachelor's, master's, and doctoral education requirements) prior to appending the observations to a growing data frame.

Loaded Libraries

## Data Wrangling
library(tidyverse)
## Webscraping 
library(rvest)
## Scheduling Automated Script Run
library(cronR)
## Dates
library(lubridate)

First iteration
I first load up the initial page to gather baseline information on each job posting (title, date posted, company) in addition to creating a link that loads the full job description page (that I will use later to parse out the rest of the information I am interested in gathering).

# High School
# Load in Washington, DC only jobs sorted by Date
url <- "https://www.indeed.com/jobs?q=&l=Washington+DC&sc=0kf%3Aattr%28FCGTU%7CQJZM9%252COR%29%3B&radius=0&fromage=1&sort=date&start=0"
indeed <- read_html(res)
# Index Total Number of Current Job Postings
total_jobs <-  indeed %>% 
  html_elements(
    css = ".jobsearch-JobCountAndSortPane-jobCount span:nth-child(1)") %>% 
  html_text() %>% 
  # Remove all non-numeric characters and spaces
  str_remove_all(., pattern = "[:alpha:]|[:punct:]|[:space:]") %>% 
  as.numeric()
# Note that there are 15 job posting per page 
## Round up to the nearest integer
total_pages <- ceiling(total_jobs / 15)
# Create a Pattern to Sequence Along by
page_sequence <- seq(from = 0, to = (total_pages * 10) - 10, by = 10)
# Create a vector of links to iterate through that filters for the "high school degree" jobs only 
page_url <- paste("https://www.indeed.com/jobs?q=&l=Washington+DC&sc=0kf%3Aattr%28FCGTU%7CQJZM9%252COR%29%3B&radius=0&fromage=1&sort=date&start=",
                  page_sequence, sep = "")
# Iterate through every page to parse out all information
high_school <- map_dfr(page_url, ~ {
  ## Replicate Human Input
  Sys.sleep(runif(1,2,3))
  ## Read in URL 
  url <- .x %>% read_html()
  ## Job Titles
  title <- unlist(url %>% 
                    html_elements(xpath = "//td[@class='resultContent']//h2/a") %>%
                    html_text()) %>% tolower()
  ## Company Names
  company <- unlist(url %>% 
                      html_elements(xpath = "//span[@class = 'companyName']") %>% 
                      html_text()) %>% tolower()
  ## Date Posted 
  date_posted <- unlist(url %>% 
                          html_elements(xpath = "//span[@class = 'date']") %>% 
                          html_text()) %>% tolower()
  ### Update Date Posted to Reflect Dates in Time
  date_posted[which(date_posted == "postedjust posted")] <- as.character(today())
  date_posted[which(str_detect(date_posted, "today"))] <- as.character(today())
  date_posted[which(str_detect(date_posted, "1"))] <- as.character(today() - 1)
  ## Job Description Links 
  link <- unlist(url %>% 
                   ### Extract the @data-jk element using its xpath and convert them into a vector of strings                
                   html_elements(xpath = "//td[@class='resultContent']//h2/a[@data-jk]/@data-jk") %>%
                   html_text())
  ### Next we will append each of the @data-jk elements to https://www.indeed.com/viewjob?jk=
  link <- paste("https://www.indeed.com/viewjob?jk=", 
                link, sep = "")
  ## Bind into a new Date Frame
  distinct(bind_cols(title, company, date_posted, link))
  
})
# Rename Column Names
colnames(high_school) <- c("title", "company", "date_posted", "link")
# Remove Duplicates 
high_school <- high_school %>% distinct()

2nd, 3rd, and 4th Iterations
I then iterate through every link gathered that goes to each individual full job description and iterate through them so I can parse out the full job description text, salary information, and hiring insights. I tried to figure out a way to complete this process within a single map function but I was unsuccessful (hence three different iterations to gather the information individually). I know this is not optimal but I couldn't get it to work properly but I was able to do so individually. The code for each is more or less the same format, so I've only provided a single iteration. IF you're able to also help show me how I can combine these different iterations into a single process that would also be amazing and would really cut down the computation time.

# Extracting Information from Job Description Links
## Full Job Description Text 
high_school$description <- map(high_school$link, ~ {
  
  # Replicate Human Input by Forcing Random Pauses
  Sys.sleep(runif(1,2,3)) 
  
  # Read in the html links
  .x %>% read_html() %>% 
    
    # Navigate toward the Full Job Description elements of the Page by ID  
    html_elements(xpath = "//div[@id = 'jobDescriptionText']") %>% 
    
    # Convert the element to text
    html_text() %>% tolower()
})

Currently, I receive the 403 error no matter what indeed.com page I load into rvest and I have no idea how to bypass that. Thank you in advance for your help

technocrat · November 3, 2022, 1:12am

Indeed has an API, and I expect that they have routines to detect and block bots. See this post for more guidance

doddzzy · November 3, 2022, 6:42pm

Hmmm thank you! I didn't know that they had an API. I will look into that, I'm admittedly not familiar with how API's work to be honest but will be a good thing to learn I suppose!

system · November 24, 2022, 6:42pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.