An unintended bug I often encounter when programming functions with dplyr is unexpected data masking in a summarise step. I'll illustrate what I mean with a simple example.
Let's say you want to do a grouped summary, but happen to accidentilly use the same name as one of the columns in your summary step:
library(dplyr)
# Unintended result due to data masking in summary step
mtcars %>%
group_by(gear) %>%
summarise(
wt = sum(mpg),
# wt here refers the value computed in the summary step
# instead of wt column grouped by gear
mean = mean(wt)
)
#> # A tibble: 3 x 3
#> gear wt mean
#> <dbl> <dbl> <dbl>
#> 1 3 242. 242.
#> 2 4 294. 294.
#> 3 5 107. 107.
# Same result when trying to use .data
mtcars %>%
group_by(gear) %>%
summarise(
wt = sum(mpg),
mean = mean(.data$wt)
)
#> # A tibble: 3 x 3
#> gear wt mean
#> <dbl> <dbl> <dbl>
#> 1 3 242. 242.
#> 2 4 294. 294.
#> 3 5 107. 107.
# Same result when using across()
mtcars %>%
group_by(gear) %>%
summarise(
wt = sum(mpg),
mean = across("wt", mean)
)
#> # A tibble: 3 x 3
#> gear wt mean$wt
#> <dbl> <dbl> <dbl>
#> 1 3 242. 242.
#> 2 4 294. 294.
#> 3 5 107. 107.
# Using . the entire wt column gets used instead of grouped by gear
mtcars %>%
group_by(gear) %>%
summarise(
wt = sum(mpg),
mean = mean(.$wt)
)
#> # A tibble: 3 x 3
#> gear wt mean
#> <dbl> <dbl> <dbl>
#> 1 3 242. 3.22
#> 2 4 294. 3.22
#> 3 5 107. 3.22
# Desired output
mtcars %>%
group_by(gear) %>%
summarise(,
mean = mean(wt),
wt = sum(mpg)
) %>%
select(gear, wt, mean)
#> # A tibble: 3 x 3
#> gear wt mean
#> <dbl> <dbl> <dbl>
#> 1 3 242. 3.89
#> 2 4 294. 2.62
#> 3 5 107. 2.63
This tends to happen unexpectedly when this type of functionality is wrapped in a function where you don't know what the column names will be and they happen to coincide with the names you use in the summarise step. E.g.
group_and_summarise <- function(data, var, by) {
data %>%
group_by(.data[[by]]) %>%
summarise(
wt = sum(.data[[var]]),
mean = mean(.data[[var]])
)
}
group_and_summarise(mtcars, "wt", "gear")
#> # A tibble: 3 x 3
#> gear wt mean
#> <dbl> <dbl> <dbl>
#> 1 3 58.4 58.4
#> 2 4 31.4 31.4
#> 3 5 13.2 13.2
Is there a way using tidyeval to avoid this data masking and ensure you summarising the data in a correct way?
(Unsatisfactory) workarounds I'm aware of would be:
- Ordering the summary computations in a way to avoid this risk
- Using names in your summary steps that are less likely to coincide with column names (e.g.
.wt
instead ofwt
)
But in both approaches, this bug can still sneak up on you.