What is the difference between . and .data?

brad.cannell · August 13, 2020, 3:49pm

I'm trying to develop a deeper understand of using the dot (".") with dplyr and using the .data pronoun with dplyr. The code I was writing that motivated this post, looked something like this:

cat_table <- tibble(
  variable = vector("character"), 
  category = vector("numeric"), 
  n        = vector("numeric")
) 

for(i in c("cyl", "vs", "am")) {
  cat_stats <- mtcars %>% 
    count(.data[[i]]) %>% 
    mutate(variable = names(.)[1]) %>%
    rename(category = 1)
  
  cat_table <- bind_rows(cat_table, cat_stats)
}

# A tibble: 7 x 3
  variable category     n
  <chr>       <dbl> <dbl>
1 cyl             4    11
2 cyl             6     7
3 cyl             8    14
4 vs              0    18
5 vs              1    14
6 am              0    19
7 am              1    13

The code does what I wanted it to do and isn’t really the focus of this question. I was just providing it for context.

I'm trying to develop a deeper understanding of why it does what I want it to do. And more specifically, why I can't use . and .data interchangeably. I've read the Programming with dplyr article, but I guess in my mind, both . and .data just mean "our result up to this point in the pipeline." But, it appears as though I'm oversimplifying my mental model of how they work because I get an error when I use .data inside of names() below:

mtcars %>% 
  count(.data[["cyl"]]) %>% 
  mutate(variable = names(.data)[1])

Error: Problem with `mutate()` input `variable`.
x Can't take the `names()` of the `.data` pronoun
ℹ Input `variable` is `names(.data)[1]`.
Run `rlang::last_error()` to see where the error occurred.

And I get an unexpected (to me) result when I use . inside of count():

mtcars %>% 
  count(.[["cyl"]]) %>% 
  mutate(variable = names(.)[1])

  .[["cyl"]]  n   variable
1          4 11 .[["cyl"]]
2          6  7 .[["cyl"]]
3          8 14 .[["cyl"]]

I suspect it has something to do with, "Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with .data[[var]]. Don’t expect other functions to work with it," from the Programming with dplyr article. This tells me what .data isn't -- a data frame -- but, I'm still not sure what .data is and how it differs from ..

I tried figuring it out like this:

mtcars %>% 
  count(.data[["cyl"]]) %>% 
  mutate(variable = list(.data))

But, the result <S3: rlang_data_pronoun> doesn't mean anything to me that helps me understand. If anybody out there has a better grasp on this, I would appreciate a brief lesson. Thanks!

siddharthprabhu · August 14, 2020, 8:36am

OK, so first of all, . isn't a dplyr construct; it comes from the pipe operator supplied by magrittr. It's perfectly possible to write dplyr code without using %>% although it would be much less readable. .data on the other hand, is native to dplyr (and other tidyverse packages such as tidyr that also provide data masking functions).

From a technical standpoint, only . fits this description. . represents the object on the LHS of the pipe which could be anything (not necessarily a data frame). The .data pronoun is specific to data masking functions which are designed for working with data frames.

library(magrittr)

# Piping a vector.
c(1, 2, 3) %>% mean(.)
#> [1] 2

# Piping a list.
list(c(1, 2, 3), c(4, 5, 6)) %>% sapply(., max)
#> [1] 3 6

# And of course, piping a data.frame.
iris %>% head(.)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

Even when you're piping data frames, . isn't really equivalent to .data since the former is a data.frame while the latter is not. When you want to refer to the object on the LHS of %>%, always use ..

So then what is .data used for? Mainly to disambiguate between data variables and environment variables. Consider the following example:

library(dplyr, warn.conflicts = FALSE)

# We have defined an environment variable n.
n <- 100

# We want to use n in a computation.
data.frame(x = 1) %>% 
  mutate(y = x / n) %>% 
  pull()
#> [1] 0.01

# But what happens if the data frame already contains a variable called n?
data.frame(x = 1, n = 2) %>% 
  mutate(y = x / n) %>% 
  pull()
#> [1] 0.5

# We get the wrong answer because the data frame variable n has precedence in
# the computation i.e. the data variable n "masks" the environment variable n.

# To disambiguate, we need to be explicit about where the variables come from 
# by using the .data and .env pronouns.
data.frame(x = 1, n = 2) %>% 
  mutate(y = .data$x / .env$n) %>% 
  pull()
#> [1] 0.01

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

This distinction between data and environment variables is important when you're writing functions for packages since you have no idea what variables will be present in the user's workspace.

You may have noticed that the last example works fine with x as long as you use .env$n since data variables always have precedence in data masking functions. Then why use .data at all? Well, when you want to write functions that wrap data masking functions, you need to "tunnel" data variables through the environment variables by using the {{ (embrace) operator. When you do this however, your function also becomes a data masking function.

library(dplyr, warn.conflicts = FALSE)

my_function <- function(data, by, var) {
  data %>% 
    group_by({{ by }}) %>% 
    summarise(avg = mean({{ var }}))
}

# my_function is a data masking function. This allows the user to supply bare
# quoted expressions to function arguments just like dplyr functions.
my_function(iris, Species, Sepal.Width)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   Species      avg
#>   <fct>      <dbl>
#> 1 setosa      3.43
#> 2 versicolor  2.77
#> 3 virginica   2.97

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

This means that users who want to wrap your function (say within your team or organization) will have to know about data masking and the theory behind it. What if you wanted to avoid this and create a "regular" function? Here's where .data comes in with the [[ operator.

library(dplyr, warn.conflicts = FALSE)

my_function <- function(data, by, var) {
  data %>% 
    group_by(.data[[by]]) %>% 
    summarise(avg = mean(.data[[var]]))
}

# my_function is a regular function with no data masking properties. Arguments
# must be supplied as strings just like most base R functions.
my_function(iris, "Species", "Sepal.Width")
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   Species      avg
#>   <fct>      <dbl>
#> 1 setosa      3.43
#> 2 versicolor  2.77
#> 3 virginica   2.97

^{Created on 2020-08-14 by the reprex package (v0.3.0)}

Hope that explanation makes things clear.

Note: Credit goes to Lionel Henry for some of the examples used above as I learned about these concepts mostly from his lectures.

brad.cannell · August 14, 2020, 1:44pm

Thank you for taking the time to write that really comprehensive answer, @siddharthprabhu! I really appreciate it. I have a couple of follow-up questions, which I would love your thoughts on if you have time.

I get that . is from magrittr and literally means the thing on the LHS. I get that .data is different than . and is not itself a data frame (or vector, or list, or whatever).

At this point, I think I have some clarity about what it isn't. But, I'm still not 100% sure what it is. I guess the reason it continues to bother me is because I still don't have a good intuition at this point as to why mutate(variable = names(.data)[1]) in

mtcars %>% 
  count(.data[["cyl"]]) %>% 
  mutate(variable = names(.data)[1])

produces an error.

I get that .data isn't a data frame, but it seems to me like .data must still have some awareness of the column names because .data[["cyl"]] has meaning to .data. Further, when I replace names(.data)[1] in the code above with .data$, RStudio shows me a list of column names the same way it would if .data were a data frame (see below). How does .data store those names and is there a way to extract them? I know that extracting column names from .data is a minor thing that would probably rarely ever be useful, but my curiosity has the best of me at this point.

As I was typing this response, my second question was going to be, "I understand that .data isn't simply the "thing on the LHS", which is the results of count() in this situation. But, what does .data reference in when we get to the mutate() part of the code? Does it still reference mtcars? If so, does that mean that .data always references the object at the beginning of the pipeline?" Based on the screenshot above, it appears as though the answer is yes.

Thanks again!

nirgrahamuk · August 14, 2020, 1:57pm

if you type .data alone in the console (with the tidyverse loaded) you'll see <pronoun> printed to the console
.data retrieves data-variables from the data frame which is current in the context of the dplyr pipeline
Think of it as a special device. Its a fake object ...An object of class rlang_fake_data_pronoun of length .

you can read the help documentation by typing

 ?rlang::.data

in your point 2. you ask if it always references the frame at the beginning of the pipeline, i think this example shows that no, it is the present dataframe in the pipeline.

mtcars %>% 
  count(.data[["cyl"]]) %>%
  mutate(variable = .data[["n"]])

in this case there is no 'n' in original mtcars, n only exists in the dataframe that the count function creates.

nirgrahamuk · August 14, 2020, 2:06pm

oh, I should add that in this case its not really needed... the following works and gives the same result

mtcars %>% 
  count(cyl) %>%
  mutate(variable = n)

siddharthprabhu · August 14, 2020, 2:37pm

@brad.cannell It is meaningless to call names() on .data because it is not a real data frame and thus doesn't have any column names. As nirgrahamuk mentioned, it is a special construct that fetches the data variables in the context of the data frame being currently evaluated.

From the documentation:

Note that .data is only a pronoun, it is not a real data frame. This means that you can't take its names or map a function over the contents of .data. Similarly, .env is not an actual R environment. For instance, it doesn't have a parent and the subsetting operators behave differently.

This question seems to have more to do with how .data is implemented. For this I think you may have to look at the rlang source code as I don't think anyone here outside the rlang development team would be able to explain the internals of these objects.

brad.cannell · August 14, 2020, 3:13pm

Thanks again, @siddharthprabhu!

Andrzej · August 14, 2020, 6:03pm

I don't know if this is relevant but if you want to check pipe steps (and meaning of dots as well)
here is a package;

devtools::install_github("daranzolin/ViewPipeSteps")

regards,

smouksassi · August 21, 2020, 9:59am

brad.cannell:

cat_table <- tibble(
  variable = vector("character"), 
  category = vector("numeric"), 
  n        = vector("numeric")
) 

for(i in c("cyl", "vs", "am")) {
  cat_stats <- mtcars %>% 
    count(.data[[i]]) %>% 
    mutate(variable = names(.)[1]) %>%
    rename(category = 1)
  
  cat_table <- bind_rows(cat_table, cat_stats)
}

to illustrate how to get the same table in a more straightforward manner without looks

mtcars %>% 
  gather(variable,category,cyl,vs,am,
         factor_key = TRUE)%>%
  group_by(variable,category)%>%
  summarize(n=n())

system · August 28, 2020, 9:59am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.