HTTP 403 Error - WorldfootballR scraping

Hi all,

I’m trying to scrape Premier League player stats using the worldfootballR package in R, but I keep getting the following error:

Error in purrr::map():
ℹ In index: 1.
Caused by error in read_html.response():
! Forbidden (HTTP 403).
Warning message:
In session_set_response(x, resp) : Forbidden (HTTP 403)

Any tips on how to fix this?

Thanks!

For which URL specifically? Does it work from the browser? Did you look at the HTTP response body to see what the error is? Does the web site require login or cookies? Maybe it requires a (differe) user-agent header?

Target sites are ( somewhat ) more actively blocking scraping attempts and worldfootballR would need an update to set request headers to handle those changes. Though the package hasn't been updated in CRAN since 2022-11-26 and couple of months ago devs also decided to archive JaseZiv/worldfootballR repository. Some new forks have popped up, so you could try those - Forks · JaseZiv/worldfootballR · GitHub .

Or you could clone / fork it yourself and update .load_page() @ internals.R#L436

As a quick & hacky proof of concept, you could install it from JaseZiv/worldfootballR as-is (e.g. pak::pak("JaseZiv/worldfootballR") ) and patch .load_page() for the current session.

Installed version:

# pak::pak("JaseZiv/worldfootballR")
pak::pkg_status("worldfootballR")[,c(2, 3, 11:13)]
#> # A data frame: 1 × 5
#>   package        version    remotetype remotepkgref           remoteref
#> * <chr>          <chr>      <chr>      <chr>                  <chr>    
#> 1 worldfootballR 0.6.8.0001 github     JaseZiv/worldfootballR HEAD

Verify it fails:

library(worldfootballR)
httr::with_verbose(
  fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats"),
)
#> Warning in session_set_response(x, resp): Forbidden (HTTP 403).
#> Error in read_html.response(x$response, ..., base_url = x$url): Forbidden (HTTP 403).
-> GET /en/comps/9/Premier-League-Stats HTTP/2
-> Host: fbref.com
-> User-Agent: RStudio Desktop (2022.7.1.554); R (4.5.1 x86_64-w64-mingw32 x86_64 x86_64)
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 403 
...

Check current .load_page() :

worldfootballR:::.load_page
#> function (page_url) 
#> {
#>     agent <- getOption("worldfootballR.agent", default = "RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)")
#>     ua <- httr::user_agent(agent)
#>     session <- rvest::session(url = page_url, ua)
#>     xml2::read_html(session)
#> }
#> <bytecode: 0x0000027186434660>
#> <environment: namespace:worldfootballR>

Can we work around this by just setting worldfootballR.agent option:

withr::with_options(
  list(worldfootballR.agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"),
  httr::with_verbose(
    fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats"),
  )
)
#> Warning in session_set_response(x, resp): Forbidden (HTTP 403).
#> Error in read_html.response(x$response, ..., base_url = x$url): Forbidden (HTTP 403).
-> GET /en/comps/9/Premier-League-Stats HTTP/2
-> Host: fbref.com
-> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 403 
...

Patch it to use Chrome user agent and sec-ch-ua & cache-control headers, test again:

assignInNamespace(
  x = ".load_page", 
  ns = "worldfootballR",
  value = function (page_url) 
  {
    headers <- httr::add_headers(
      `cache-control` = "no-cache",
      `sec-ch-ua` =  '"Chromium";v="142", "Google Chrome";v="142", "Not_A Brand";v="99"',
      `user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"
    )
    session <- rvest::session(url = page_url, headers)
    xml2::read_html(session)
  }
)

httr::with_verbose(
  fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats"),
)
#>  [1] "https://fbref.com/en/squads/18bb7c10/Arsenal-Stats"                 
#>  [2] "https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats"         
#>  [3] "https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats"                 
# ...
-> GET /en/comps/9/Premier-League-Stats HTTP/2
-> Host: fbref.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> cache-control: no-cache
-> sec-ch-ua: "Chromium";v="142", "Google Chrome";v="142", "Not_A Brand";v="99"
-> user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
-> 
<- HTTP/2 200 
...
1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.