Target sites are ( somewhat ) more actively blocking scraping attempts and worldfootballR would need an update to set request headers to handle those changes. Though the package hasn't been updated in CRAN since 2022-11-26 and couple of months ago devs also decided to archive JaseZiv/worldfootballR repository. Some new forks have popped up, so you could try those - Forks · JaseZiv/worldfootballR · GitHub .
Or you could clone / fork it yourself and update .load_page() @ internals.R#L436
As a quick & hacky proof of concept, you could install it from JaseZiv/worldfootballR as-is (e.g. pak::pak("JaseZiv/worldfootballR") ) and patch .load_page() for the current session.
Installed version:
# pak::pak("JaseZiv/worldfootballR")
pak::pkg_status("worldfootballR")[,c(2, 3, 11:13)]
#> # A data frame: 1 × 5
#> package version remotetype remotepkgref remoteref
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 worldfootballR 0.6.8.0001 github JaseZiv/worldfootballR HEAD
Verify it fails:
library(worldfootballR)
httr::with_verbose(
fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats"),
)
#> Warning in session_set_response(x, resp): Forbidden (HTTP 403).
#> Error in read_html.response(x$response, ..., base_url = x$url): Forbidden (HTTP 403).
-> GET /en/comps/9/Premier-League-Stats HTTP/2
-> Host: fbref.com
-> User-Agent: RStudio Desktop (2022.7.1.554); R (4.5.1 x86_64-w64-mingw32 x86_64 x86_64)
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
->
<- HTTP/2 403
...
Check current .load_page() :
worldfootballR:::.load_page
#> function (page_url)
#> {
#> agent <- getOption("worldfootballR.agent", default = "RStudio Desktop (2022.7.1.554); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)")
#> ua <- httr::user_agent(agent)
#> session <- rvest::session(url = page_url, ua)
#> xml2::read_html(session)
#> }
#> <bytecode: 0x0000027186434660>
#> <environment: namespace:worldfootballR>
Can we work around this by just setting worldfootballR.agent option:
withr::with_options(
list(worldfootballR.agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"),
httr::with_verbose(
fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats"),
)
)
#> Warning in session_set_response(x, resp): Forbidden (HTTP 403).
#> Error in read_html.response(x$response, ..., base_url = x$url): Forbidden (HTTP 403).
-> GET /en/comps/9/Premier-League-Stats HTTP/2
-> Host: fbref.com
-> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
->
<- HTTP/2 403
...
Patch it to use Chrome user agent and sec-ch-ua & cache-control headers, test again:
assignInNamespace(
x = ".load_page",
ns = "worldfootballR",
value = function (page_url)
{
headers <- httr::add_headers(
`cache-control` = "no-cache",
`sec-ch-ua` = '"Chromium";v="142", "Google Chrome";v="142", "Not_A Brand";v="99"',
`user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"
)
session <- rvest::session(url = page_url, headers)
xml2::read_html(session)
}
)
httr::with_verbose(
fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats"),
)
#> [1] "https://fbref.com/en/squads/18bb7c10/Arsenal-Stats"
#> [2] "https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats"
#> [3] "https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats"
# ...
-> GET /en/comps/9/Premier-League-Stats HTTP/2
-> Host: fbref.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> cache-control: no-cache
-> sec-ch-ua: "Chromium";v="142", "Google Chrome";v="142", "Not_A Brand";v="99"
-> user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
->
<- HTTP/2 200
...