Scraping data from player profiles of various lengths

Hey,

I'm very new to scraping data with R and this porblem seems to be very tricky. Here is it:

I'd like to scrap player data from this german football manager website (example profile: Kahn - Comunio Statistiken)

The specific data I'd like to collect is marked yellow in this screenshot:

The problem is: The quantity of data concerning "Saison XX" and "Pkt. xx" differs from each profile. Depending on which and how many seasons a player plays in the Bundesliga. So in one profile you may have only one season-and-point-data set in other profiles you have a lot of data sets like this one; Gnabry - Comunio Statistiken.

Ideally I would like to get data set or frame looking like this (first example):

Name Position Season Points
Kahn GK 2007/08 76
Kahn GK 2006/07 106
and then next profile.

So in the loop: the name and position have to be constant as long as there are more seasons and points to collect. Then the next palyer profile should be collected.

I've tried multiple things: First I try to work with the html text function to wirte from the specific nodes but since the quantity of notes is different from each profile I was only able to get the first position (in this example Season 2007/08) of every player profile.

library(dplyr) 
 library(rvest)

   
    
    playerinf=data.frame()
     
        for(page_result in seq (from = 1, to = 1000, by = 1)){
              link = paste0("https://stats.comunio.de/profile?id=",page_result)
             code = read_html(link) 
             Name = code %>% html_nodes("#content .bold")%>% html_text()
       Season = code %>% html_nodes(".nopadding:nth-child(1) tr:nth-child(2) td:nth-child(1)")%>% html_text()
       Position = code %>% html_nodes("td:nth-child(1) tr:nth-child(3) .left+ td")%>% html_text()
       Points = code %>% html_nodes(".nopadding:nth-child(1) tr:nth-child(2) td+ td")%>% html_text()

         playerinf=rbind(playerinf,data.frame(
            Name = ifelse(length(Name)==0,NA,Name),
             
              Season= ifelse(length(Season)==0,NA,Season),
             Position= ifelse(length(Position)==0,NA,Position),
             Points= ifelse(length(Points)==0,NA,Points)))
         
           write.csv(playerinf, "PlayerInfomartionComStat.csv")   
        }

My second idea was to scrap the table including Seasons and Points (which node is always described the same way in every player profile). I got these information but I fail to then combine them with the name and position to get it in the desired form (name and position in every new row with actual season and points).

How can I scrap the data in the desired form. If you have any idea please let me know.

Thank in advance!

Hello,

I just wrote the following code for you and it works well. I wrote a function for scraping one player profile, then I applied this function to the first 20 player profiles.

# Load packages ----

pacman::p_load(
  dplyr,
  purrr,
  rvest
)

# Function for scraping a single player's data ----
# url: URL of the player
# e.g: url <- "https://stats.comunio.de/profile?id=1"


scrape_data <- function(url){
  
  html <- read_html(url)
  
  all_data <- html %>%
    html_elements(css = "table") %>%
    html_table()
  
  name <- all_data[[5]][[2]][2]
  position <- all_data[[5]][[2]][4]
  season <- all_data[[9]][[1]]
  points <- all_data[[9]][[2]]
  
  data.frame(
    name = name,
    position = position,
    season = season,
    points = points
  )
}

scrape_data2 <- possibly(.f = scrape_data, otherwise = NA)

# ACTUAL SCRAPING (first 20 profiles) ----

urls <- paste0("https://stats.comunio.de/profile?id=", 1:10)

final_data <- map_dfr(urls, scrape_data2)

  name position  season points
1          Kahn  Torwart 2007/08     76
2          Kahn  Torwart 2006/07    106
3          Butt  Torwart 2011/12      4
4          Butt  Torwart 2010/11     62
5          Butt  Torwart 2009/10    106
6          Butt  Torwart 2008/09     16
7          Butt  Torwart 2006/07     54
8       Lehmann  Torwart 2009/10     80
9       Lehmann  Torwart 2008/09     90
10      F. Rost  Torwart 2010/11     86
...

Ty for the response! Much appreciated! One problem, which I forget mentioned in my first post is: There are profiles which are completely empty with an error message "Player not found". Sorry for not mentioned it. Sadly there are randomly huge gaps of empty or no profiles when going up with url id numbers.

If you choose to go until profile id 20 you will see that there is an error message when running your code. I think the reason is that this player profiles are empty? How can I fix this?
image

Besides that a big thank you that helped a lot!

Solved that last problem by myself. I have just replaced the "NA" with an data frame () in the argument of the possibly function.

# Load packages ----
library(purrr)
library(rvest)
library(dplyr)

# Function for scraping a single player's data ----
# url: URL of the player
# e.g: url <- "https://stats.comunio.de/profile?id=1"


scrape_data <- function(url){
  
  html <- read_html(url)
  
  all_data <- html %>%
    html_elements(css = "table") %>%
    html_table()
  
  name <- all_data[[5]][[2]][2]
  position <- all_data[[5]][[2]][4]
  season <- all_data[[9]][[1]]
  points <- all_data[[9]][[2]]
  
  data.frame(
    name = name,
    position = position,
    season = season,
    points = points
  )
}

scrape_data2 <- possibly(.f = scrape_data, otherwise = data.frame())

# ACTUAL SCRAPING (first 20 profiles) ----

urls <- paste0("https://stats.comunio.de/profile?id=", 30498:40000)

final_data <- map_dfr(urls, scrape_data2)

write.csv(final_data, "PlayerInfomartionComStat.csv")  


Thanks again to Mr. gueyenono!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.