https://www.rstudio.com/blog/title-here
This is a companion discussion topic for the original entry at https://www.rstudio.com/blog/title-here
Often, real-life data requires some level of cleaning. R packages help us transform data into a tidy, analyzable structure. However, if we jump into cleaning data without pausing and thinking about it beforehand, we may run into issues down the road. We may realize we need a complicated workflow to get the data into the correct format, or that we repeat ourselves too much, or we end up finding that the package we tried is not quite right and then we have to find an alternative.
We propose a few questions to guide you to the right tool for the job:
- What does the data look like?
Part of the challenge is that the structure varies from dataset to dataset.
Tidy datasets are all alike but every messy dataset is messy in its own way. — Tidy Data, Hadley Wickham
Having a principled approach to data cleaning helps. If we know we want our dataset to be in a tidy format — where each variable forms a column, each observation forms a row, and each type of observational unit forms a table — then we can determine what transformations will lead to our desired result.
For some datasets, tidying the dataset can be as simple as pivoting it into a long format. For others, we may see that columns have multiple variables stored. Or, we notice that our data is in separate Excel sheets. And so on.
Taking a close look at the dataset can help us decide whether we need conditional statements or another approach. [how to make this determination?]
- How often will I be doing this analysis?
If we are running an analysis only once, we may decide that a long, “not pretty” script is enough for what we need to do. However, if we know that we will be doing the same analysis next year, it may be worth taking the time to set ourselves up for the future.
A Real-Life Dataset
This blog post will be going through a real-life example from the Washington State Department of Education. The DOE stores their state test scores in a CVS spreadsheet. It contains 110 columns for each school’s Reading, Writing, Science, and Math tests for grades 1 - 14:
library(readr)
wa_head <-
read_csv(here::here("2022-02-17-title-here", "data", "wa_head.csv"))
wa_head
## # A tibble: 6 × 110
## SchoolYear ESD CountyNumber County CountyDistrictN… District GradeTested
##
## 1 2010-11 Educatio… 1 Adams 1109 Washtuc… 3
## 2 2010-11 Educatio… 1 Adams 1109 Washtuc… 4
## 3 2010-11 Educatio… 1 Adams 1109 Washtuc… 5
## 4 2010-11 Educatio… 1 Adams 1109 Washtuc… 6
## 5 2010-11 Educatio… 1 Adams 1109 Washtuc… 7
## 6 2010-11 Educatio… 1 Adams 1109 Washtuc… 8
## # … with 103 more variables: SubGroup , ReadingTotalTestedNotTested ,
## # ReadingTotalMSP_HSPETested ,
## # ReadingMetStandardIncludingPrevPass ,
## # ReadingPercentMetStandardIncludingPrevPass ,
## # ReadingMetStandardWithoutPrevPass ,
## # ReadingPercentMetStandardWithoutPrevPass ,
## # ReadingPercentMetStandardExcludingNoScore , ReadingLevel1 , …
Let’s take a look at the column names:
wa_colnames <- read_rds(here::here("2022-02-17-title-here", "data", "wa_colnames.Rds"))
wa_colnames[c(24, 25, 49, 50)] # Taking four as an example
## [1] "ReadingLevel4" "ReadingPercentLevel4" "MathLevel4"
## [4] "MathPercentLevel4"
We can see that a column contains subject, measure, and grade level:
For the purposes of this post, let’s look at a simplified table:
## # A tibble: 6 × 12
## SchoolYear ESD GradeTested SubGroup ReadingLevel4 ReadingPercentLevel4
##
## 1 2010-11 ESD 123 3 Male 23 14.5
## 2 2010-11 ESD 123 4 Male 16 11.5
## 3 2010-11 ESD 123 5 Male 37 24.6
## 4 2010-11 ESD 123 3 Female 36 26
## 5 2010-11 ESD 123 4 Female 30 20.5
## 6 2010-11 ESD 123 5 Female 39 33
## # … with 6 more variables: MathLevel4 , MathPercentLevel4 ,
## # WritingLevel4 , WritingPercentLevel4 , ScienceLevel4 ,
## # SciencePercentLevel4
First, let’s think about what the data looks like. This data is not tidy. First, the data is a “wide” format, so we need to make it “long” if we want to use it in packages like ggplot2. We can do this using the tidyr function pivot_longer()
.
dat_melt <- tidyr::pivot_longer(
dat,
cols = ReadingLevel4:SciencePercentLevel4,
names_to = "variable",
values_to = "value"
)
Second, multiple variables are stored together in columns.
Nested if-else statements
If-else statements are an option once our data is in a long format (that is, the column names are in a single column). The if statement determines whether a statement based on a certain condition. Otherwise, it will do something else.
Nested if statements mean an if statement inside an if statement.
dat_melt
## # A tibble: 48 × 6
## SchoolYear ESD GradeTested SubGroup variable value
##
## 1 2010-11 ESD 123 3 Male ReadingLevel4 23
## 2 2010-11 ESD 123 3 Male ReadingPercentLevel4 14.5
## 3 2010-11 ESD 123 3 Male MathLevel4 18
## 4 2010-11 ESD 123 3 Male MathPercentLevel4 11.3
## 5 2010-11 ESD 123 3 Male WritingLevel4 NA
## 6 2010-11 ESD 123 3 Male WritingPercentLevel4 NA
## 7 2010-11 ESD 123 3 Male ScienceLevel4 NA
## 8 2010-11 ESD 123 3 Male SciencePercentLevel4 NA
## 9 2010-11 ESD 123 4 Male ReadingLevel4 16
## 10 2010-11 ESD 123 4 Male ReadingPercentLevel4 11.5
## # … with 38 more rows
# Or, using reshape2::melt
# dat_melt <-
# reshape2::melt(dat, id = c("SchoolYear", "ESD", "GradeTested", "SubGroup"))
Since our variable is now contained under “variable”, we can separate all the pieces into individual columns. We can run our if_else()
statements:
dat_melt$Subject <- ifelse(grepl("Reading", dat_melt$variable), "Reading",
ifelse(grepl("Math", dat_melt$variable), "Math",
ifelse(grepl("Writing", dat_melt$variable), "Writing", "Science")))
However, there are a few drawbacks. Multiple if-else statements need to be “nested”, or defined within the previous function. While this works well on our simplified dataset, this becomes very hard to manage if we have many columns. The text starts to wrap and we lose track of parentheses, so it’s easy to get lost:
\[image\]
Are there other options?
dplyr::case_when()
and data.table::fcase()
The case_when()
and fcase()
functions solve the issue of nesting functions. [how they work rather than nesting]
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
dat_melt %>%
mutate(variable = as.character(variable)) %>%
mutate(
Subject = case_when(
str_detect(variable, "Math") ~ "Math",
str_detect(variable, "Reading") ~ "Reading",
str_detect(variable, "Writing") ~ "Writing",
str_detect(variable, "Science") ~ "Science"
)
)
## # A tibble: 48 × 7
## SchoolYear ESD GradeTested SubGroup variable value Subject
##
## 1 2010-11 ESD 123 3 Male ReadingLevel4 23 Reading
## 2 2010-11 ESD 123 3 Male ReadingPercentLevel4 14.5 Reading
## 3 2010-11 ESD 123 3 Male MathLevel4 18 Math
## 4 2010-11 ESD 123 3 Male MathPercentLevel4 11.3 Math
## 5 2010-11 ESD 123 3 Male WritingLevel4 NA Writing
## 6 2010-11 ESD 123 3 Male WritingPercentLevel4 NA Writing
## 7 2010-11 ESD 123 3 Male ScienceLevel4 NA Science
## 8 2010-11 ESD 123 3 Male SciencePercentLevel4 NA Science
## 9 2010-11 ESD 123 4 Male ReadingLevel4 16 Reading
## 10 2010-11 ESD 123 4 Male ReadingPercentLevel4 11.5 Reading
## # … with 38 more rows
The advantages of this approach is that it is much “cleaner” — it’s easy to see what is being changed.
[disadvantages?]