I got confused what nodes i need to use to meet what i want during using html_nodes()

fhat · July 21, 2023, 5:17pm

Hello everyone, Wish you have a nice day

I need help from all of you guys. So i decided to start learning to scrapping data using rvest instead do it manually and i found a problem. I want to scrape Club/Team name and Pts from https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1?saison_id=2021

The problem is :

team_elements <- html_nodes(webpage, ".hauptlink > a")

The code scrap not only team name, but also pic of information on man city, chelsea, leicester, brentford, watford, and norwich. So how to filter only name that appear after scrap

pts_elements <- html_nodes(webpage, ".zentriert")
That code scrap not only pts, but also w,d,l,goals. So how to filter only pts appear after scrap

Thank you guys for the help! And i attach the full code below.

# Define the URL of the website
url <- "https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1?saison_id=2021"

# Read the HTML content of the webpage
webpage <- read_html(url)

# Extract Club names
team_elements <- html_nodes(webpage, ".hauptlink > a")
team<- html_text(team_elements)

# Extract Points (Pts)
pts_elements <- html_nodes(webpage, ".zentriert")
pts <- as.numeric(html_text(pts_elements))

# Enter to Data Frame
premier_league_data <- data.frame(Team = team, Pts = pts)

# Print it
print(premier_league_data)

nirgrahamuk · July 21, 2023, 5:53pm

I would go more direct.
Grab the table of interest and get it as a data.frame and work from there.

library(tidyverse)
library(rvest)
url <- "https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1?saison_id=2021"

# Read the HTML content of the webpage
webpage <- read_html(url)

# Extract Club names
thetable <- html_node(webpage, ".items")
table_as_df <- rvest::html_table(thetable)
# fix name problems
names(table_as_df) <- make.names(names(table_as_df), unique = TRUE)

table_as_df |> select(
  Club = Club.1,
  Pts
)

fhat · July 23, 2023, 4:34pm

Thank you for the help, really apreciate it

system · July 30, 2023, 4:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.