Seeking a nudge in the right direction here.
I'm a novice R programmer who's leaped into it and the tidyverse with gusto. I've created a project for myself, the goal of which is to create a system that allows for a comparison of results for archers across a variety of archery disciplines. To that end, I've assembled a sample of approximately 11,000 individual scores from the last three years of some major archery competitions in the U.S. The dataset can downloaded from my web site here: 2016-2018_all_scores.csv
Once downloaded, the following code loads the dataset and sets the appropriate column names:
results <- read_csv("2016-2018_all_scores.csv",
col_names = c("year", "event", "class", "division", "gender", "org", "round", "score"),
col_types = "fffffffd")
The columns of the most interest are class
, round
, and score
. I need to analyze the scores for each different archery round and equipment class in the dataset. There's a total of 17 unique archery rounds and 2 equipment classes for a total of 34 separate analyses. For each of the 34 different combinations I need to:
- Remove the scores below the 10th percentile and above the 90th percentile.
- Calculate a CDF for the remaining values
- Fit a linear or exponential model to the CDF (depending on the how the curve looks)
I've described this process in more detail in a short article, The Performance Method: An Improved Archer Ranking System for Determining a “Shooter of the Year". (All created in an R Notebook. What an amazing tool!)
The big question
My first instinct was to use dplyr's filter
and mutate
to slice and dice myself 34 data frames out of the original dataset which would be further processed to arrive at each individual model.
Realization #1: I hate repeating myself
The thought of typing out all of the piped commands to create those 34 data frames makes even my novice R programmer skin crawl. There has to be a better way.
Realization #2: apply functions including purrr
are cool
I can see in principal how useful it would be to apply the functional programming approach to this problem. Is it possible to programmatically create those 34 data frames by creating a function to filter
the data set? Here's a function that does the job.
# Filter dataset to get results for a specific round and equipment class
# The divisor parameter converts a two-day event to a one-day event (e.g., divisor = 2)
filter_results <- function(results, archery_round, equipment_class, divisor = 1) {
df <- results %>%
filter(round == archery_round,
class == equipment_class) %>%
mutate(score = score / divisor)
return (df)
}
But how do I apply that function to create 34 new data frames that don't already exist? I'm stuck.
Realization #3: Do I need 34 data frames after all?
Can I filter, trim, create the CDFs and build those models without creating separate data frames? WWHD? (What Would Hadley Do?)
That's where I find myself. I could have done everything the manual way by now, but I wouldn't learn anything new doing it that way.
So I'm asking the experience R programmers here, how would you do it? What path would you send me down?