OK, so first of all, .
isn't a dplyr
construct; it comes from the pipe operator supplied by magrittr
. It's perfectly possible to write dplyr
code without using %>%
although it would be much less readable. .data
on the other hand, is native to dplyr
(and other tidyverse packages such as tidyr
that also provide data masking functions).
From a technical standpoint, only .
fits this description. .
represents the object on the LHS of the pipe which could be anything (not necessarily a data frame). The .data
pronoun is specific to data masking functions which are designed for working with data frames.
library(magrittr)
# Piping a vector.
c(1, 2, 3) %>% mean(.)
#> [1] 2
# Piping a list.
list(c(1, 2, 3), c(4, 5, 6)) %>% sapply(., max)
#> [1] 3 6
# And of course, piping a data.frame.
iris %>% head(.)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Created on 2020-08-14 by the reprex package (v0.3.0)
Even when you're piping data frames, .
isn't really equivalent to .data
since the former is a data.frame
while the latter is not. When you want to refer to the object on the LHS of %>%
, always use .
.
So then what is .data
used for? Mainly to disambiguate between data variables and environment variables. Consider the following example:
library(dplyr, warn.conflicts = FALSE)
# We have defined an environment variable n.
n <- 100
# We want to use n in a computation.
data.frame(x = 1) %>%
mutate(y = x / n) %>%
pull()
#> [1] 0.01
# But what happens if the data frame already contains a variable called n?
data.frame(x = 1, n = 2) %>%
mutate(y = x / n) %>%
pull()
#> [1] 0.5
# We get the wrong answer because the data frame variable n has precedence in
# the computation i.e. the data variable n "masks" the environment variable n.
# To disambiguate, we need to be explicit about where the variables come from
# by using the .data and .env pronouns.
data.frame(x = 1, n = 2) %>%
mutate(y = .data$x / .env$n) %>%
pull()
#> [1] 0.01
Created on 2020-08-14 by the reprex package (v0.3.0)
This distinction between data and environment variables is important when you're writing functions for packages since you have no idea what variables will be present in the user's workspace.
You may have noticed that the last example works fine with x
as long as you use .env$n
since data variables always have precedence in data masking functions. Then why use .data
at all? Well, when you want to write functions that wrap data masking functions, you need to "tunnel" data variables through the environment variables by using the {{
(embrace) operator. When you do this however, your function also becomes a data masking function.
library(dplyr, warn.conflicts = FALSE)
my_function <- function(data, by, var) {
data %>%
group_by({{ by }}) %>%
summarise(avg = mean({{ var }}))
}
# my_function is a data masking function. This allows the user to supply bare
# quoted expressions to function arguments just like dplyr functions.
my_function(iris, Species, Sepal.Width)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#> Species avg
#> <fct> <dbl>
#> 1 setosa 3.43
#> 2 versicolor 2.77
#> 3 virginica 2.97
Created on 2020-08-14 by the reprex package (v0.3.0)
This means that users who want to wrap your function (say within your team or organization) will have to know about data masking and the theory behind it. What if you wanted to avoid this and create a "regular" function? Here's where .data
comes in with the [[
operator.
library(dplyr, warn.conflicts = FALSE)
my_function <- function(data, by, var) {
data %>%
group_by(.data[[by]]) %>%
summarise(avg = mean(.data[[var]]))
}
# my_function is a regular function with no data masking properties. Arguments
# must be supplied as strings just like most base R functions.
my_function(iris, "Species", "Sepal.Width")
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#> Species avg
#> <fct> <dbl>
#> 1 setosa 3.43
#> 2 versicolor 2.77
#> 3 virginica 2.97
Created on 2020-08-14 by the reprex package (v0.3.0)
Hope that explanation makes things clear.
Note: Credit goes to Lionel Henry for some of the examples used above as I learned about these concepts mostly from his lectures.