I think it's mostly scoping worries, though transform
is both laxer and stricter than mutate
, and not always how you'd think, e.g.
library(magrittr)
set.seed(47)
some_data <- data.frame(i = 1:4)
# Recycling works...
some_data %>% transform(x = rnorm(2))
#> i x
#> 1 1 1.9946963
#> 2 2 0.7111425
#> 3 3 1.9946963
#> 4 4 0.7111425
# ...but not partial recycling
data.frame(i = 1:5) %>% transform(x = rnorm(2))
#> Error in data.frame(structure(list(i = 1:5), .Names = "i", row.names = c(NA, : arguments imply differing number of rows: 5, 2
# Referring to previously created variables doesn't work...
some_data %>%
transform(x = rnorm(2),
y = x)
#> Error in eval(substitute(list(...)), `_data`, parent.frame()): object 'x' not found
# ...unless they're in different calls
some_data %>%
transform(x = rnorm(2)) %>%
transform(y = x)
#> i x y
#> 1 1 -0.98548216 -0.98548216
#> 2 2 0.01513086 0.01513086
#> 3 3 -0.98548216 -0.98548216
#> 4 4 0.01513086 0.01513086
x <- 2
# If a global variable by the same name exists, it will grab it even if there's an earlier one in the call...
some_data %>%
transform(x = rnorm(2),
y = x)
#> i x y
#> 1 1 -0.2520459 2
#> 2 2 -1.4657503 2
#> 3 3 -0.2520459 2
#> 4 4 -1.4657503 2
# ...unless the calls are separated, in which case the data frame `x` takes priority
some_data %>%
transform(x = rnorm(2)) %>%
transform(y = x)
#> i x y
#> 1 1 -0.92245624 -0.92245624
#> 2 2 0.03960243 0.03960243
#> 3 3 -0.92245624 -0.92245624
#> 4 4 0.03960243 0.03960243
# If you want to get a global variable with the same name as a data frame variable, you have to tell it where to look...
some_data %>%
transform(x = rnorm(2),
y = substitute(x, env = globalenv()))
#> i x y
#> 1 1 0.4938202 2
#> 2 2 -1.8282292 2
#> 3 3 0.4938202 2
#> 4 4 -1.8282292 2
# ...in which case it doesn't matter how you separate the calls
some_data %>%
transform(x = rnorm(2)) %>%
transform(y = substitute(x, env = globalenv()))
#> i x y
#> 1 1 0.09147291 2
#> 2 2 0.67077922 2
#> 3 3 0.09147291 2
#> 4 4 0.67077922 2
dplyr is stricter about recycling (only length-1 vectors), but handles scoping very similarly. When you're working with a data frame and environment you control, that behavior makes coding very quick. When the code will be operating on an arbitrary, unknown data frame or in an arbitrary environment, that behavior becomes risky for both transform
and mutate
(Could there be a similarly-named vector in a parent environment?), and quickly leads to either abandoning both in favor of safer [[
syntax or some gymnastic defensive coding.
So it's not that transform
's NSE is more risky than mutate
's (though it is less powerful; try setting a variable name to a stored string in transform
), it's that dplyr prioritizes uses where the author knows what the data looks like (the vast majority of code) and accepts that the programming cases will require a more thorough knowledge of its NSE system, whereas transform
just advises users to avoid it for programmatic usage, as controlling its NSE system effectively is significantly more of a pain than using [[
syntax.
Since most of these programmatic cases will come when writing packages, and most of those cases will end up written in base R anyway to avoid a large dependency graph, the actual number of relevant cases for programming dplyr are limited to those packages that build on the tidyverse framework, e.g. tidytext
. And occasionally people trying to operate on a terribly arranged data structure, though such an approach is rarely simpler than tidying first.