I got confused what nodes i need to use to meet what i want during using html_nodes()

Hello everyone, Wish you have a nice day

I need help from all of you guys. So i decided to start learning to scrapping data using rvest instead do it manually and i found a problem. I want to scrape Club/Team name and Pts from https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1?saison_id=2021

The problem is :

  1. team_elements <- html_nodes(webpage, ".hauptlink > a")

The code scrap not only team name, but also pic of information on man city, chelsea, leicester, brentford, watford, and norwich. So how to filter only name that appear after scrap

  1. pts_elements <- html_nodes(webpage, ".zentriert")
    That code scrap not only pts, but also w,d,l,goals. So how to filter only pts appear after scrap

Thank you guys for the help! And i attach the full code below.

# Define the URL of the website
url <- "https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1?saison_id=2021"

# Read the HTML content of the webpage
webpage <- read_html(url)

# Extract Club names
team_elements <- html_nodes(webpage, ".hauptlink > a")
team<- html_text(team_elements)

# Extract Points (Pts)
pts_elements <- html_nodes(webpage, ".zentriert")
pts <- as.numeric(html_text(pts_elements))

# Enter to Data Frame
premier_league_data <- data.frame(Team = team, Pts = pts)

# Print it
print(premier_league_data)

I would go more direct.
Grab the table of interest and get it as a data.frame and work from there.

library(tidyverse)
library(rvest)
url <- "https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1?saison_id=2021"

# Read the HTML content of the webpage
webpage <- read_html(url)

# Extract Club names
thetable <- html_node(webpage, ".items")
table_as_df <- rvest::html_table(thetable)
# fix name problems
names(table_as_df) <- make.names(names(table_as_df), unique = TRUE)

table_as_df |> select(
  Club = Club.1,
  Pts
)
1 Like

Thank you for the help, really apreciate it :pray:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.