What is the difference between . and .data?

siddharthprabhu · August 14, 2020, 8:36am

OK, so first of all, . isn't a dplyr construct; it comes from the pipe operator supplied by magrittr. It's perfectly possible to write dplyr code without using %>% although it would be much less readable. .data on the other hand, is native to dplyr (and other tidyverse packages such as tidyr that also provide data masking functions).

From a technical standpoint, only . fits this description. . represents the object on the LHS of the pipe which could be anything (not necessarily a data frame). The .data pronoun is specific to data masking functions which are designed for working with data frames.

library(magrittr)

# Piping a vector.
c(1, 2, 3) %>% mean(.)
#> [1] 2

# Piping a list.
list(c(1, 2, 3), c(4, 5, 6)) %>% sapply(., max)
#> [1] 3 6

# And of course, piping a data.frame.
iris %>% head(.)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

Even when you're piping data frames, . isn't really equivalent to .data since the former is a data.frame while the latter is not. When you want to refer to the object on the LHS of %>%, always use ..

So then what is .data used for? Mainly to disambiguate between data variables and environment variables. Consider the following example:

library(dplyr, warn.conflicts = FALSE)

# We have defined an environment variable n.
n <- 100

# We want to use n in a computation.
data.frame(x = 1) %>% 
  mutate(y = x / n) %>% 
  pull()
#> [1] 0.01

# But what happens if the data frame already contains a variable called n?
data.frame(x = 1, n = 2) %>% 
  mutate(y = x / n) %>% 
  pull()
#> [1] 0.5

# We get the wrong answer because the data frame variable n has precedence in
# the computation i.e. the data variable n "masks" the environment variable n.

# To disambiguate, we need to be explicit about where the variables come from 
# by using the .data and .env pronouns.
data.frame(x = 1, n = 2) %>% 
  mutate(y = .data$x / .env$n) %>% 
  pull()
#> [1] 0.01

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

This distinction between data and environment variables is important when you're writing functions for packages since you have no idea what variables will be present in the user's workspace.

You may have noticed that the last example works fine with x as long as you use .env$n since data variables always have precedence in data masking functions. Then why use .data at all? Well, when you want to write functions that wrap data masking functions, you need to "tunnel" data variables through the environment variables by using the {{ (embrace) operator. When you do this however, your function also becomes a data masking function.

library(dplyr, warn.conflicts = FALSE)

my_function <- function(data, by, var) {
  data %>% 
    group_by({{ by }}) %>% 
    summarise(avg = mean({{ var }}))
}

# my_function is a data masking function. This allows the user to supply bare
# quoted expressions to function arguments just like dplyr functions.
my_function(iris, Species, Sepal.Width)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   Species      avg
#>   <fct>      <dbl>
#> 1 setosa      3.43
#> 2 versicolor  2.77
#> 3 virginica   2.97

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

This means that users who want to wrap your function (say within your team or organization) will have to know about data masking and the theory behind it. What if you wanted to avoid this and create a "regular" function? Here's where .data comes in with the [[ operator.

library(dplyr, warn.conflicts = FALSE)

my_function <- function(data, by, var) {
  data %>% 
    group_by(.data[[by]]) %>% 
    summarise(avg = mean(.data[[var]]))
}

# my_function is a regular function with no data masking properties. Arguments
# must be supplied as strings just like most base R functions.
my_function(iris, "Species", "Sepal.Width")
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   Species      avg
#>   <fct>      <dbl>
#> 1 setosa      3.43
#> 2 versicolor  2.77
#> 3 virginica   2.97

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

Hope that explanation makes things clear.

Note: Credit goes to Lionel Henry for some of the examples used above as I learned about these concepts mostly from his lectures.