Using dplyr and ggplot2 for Sports Data Analysis

Hi everyone :waving_hand:

I’ve been working on some small sports analytics projects using the tidyverse, especially dplyr for wrangling and ggplot2 for visualization. For example, here’s a toy dataset where I look at player efficiency in basketball:

library(dplyr)
library(ggplot2)

set.seed(123)
players <- data.frame(
player = paste("Player", 1:10),
points = round(rnorm(10, mean = 15, sd = 5)),
assists = round(rnorm(10, mean = 5, sd = 2)),
rebounds = round(rnorm(10, mean = 7, sd = 3)),
minutes = round(runif(10, 20, 40))
)

players <- players %>%
mutate(points_per_min = points / minutes)

ggplot(players, aes(x = minutes, y = points_per_min, label = player)) +
geom_point(color = "blue", size = 3) +
geom_text(vjust = -0.8, size = 3) +
theme_minimal()

This kind of workflow works nicely for quick insights — but I’m curious: what other tidyverse functions/packages would you recommend for sports data analysis?

If anyone is interested in exploring further, I’ve also been writing about R and analytics here: rprogrammingbooks.com

Thanks in advance for your suggestions!

It is really highly dependent on what you are doing. I'd suggest doing an internet search and seeing if you can find a couple of decent summary documents that expand on this, Tidyverse packages.

One package I find invaluable is {lubridate}. It has made dealing with dates so much easier than dates in base R.

That said, I am a bit of a heretic here and am not at all fond of {dplyr} . It does things very well but it is far too verbose. I recommend having a look at {data.table}. The code is rather terse and not quite as easily read at times as dplyr code but it is much more convenient to write. Also if you are working with very large data sets it is faster.

For example:
dplyr code

players <- players %>%   mutate(points_per_min = points / minutes)

data.table code

players[, points_per_min := points / minutes]

Here are some comparisons between {dplyr} and {data.table}
A data.table and dplyr tour.
A word of warning, if you try {data.table} and {tidyverse} together. Always load {data.table} before {tidyverse}. Otherwise, name conflicts totally mess up {lubridate}.

You really do not want to limit yourself to {tidyverse} packages. Depending on what you are doing, there are all sorts of packages available that can help. For example, {janitor} which calls itself Simple Tools for Examining and Cleaning Dirty Data can be invaluable.

If you need tables, there are any number available. Off the top of my head, {tinytable}, {flextable}, {gt} and {kableExtra} are handy.

It really is a case of picking the right tool for the task though with well over 20,000 packages at the moment, it can be a bit of a problem finding the right one. A couple of, more or less, utility packages that I am becoming fond of are {here} and {numform}.

You might want to think about using linear optimisation - eg choosing a squad subject to constraints such as cost There is an article out there on doing that for Fantasy Football (soccer) - Mathematically Optimising Your Fantasy Football Team | penaltyblog

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.