# Problem: I am trying to use dplyr within a function
# call where the function parameters are a dataframe
# name, and variable names within the dataframe. The
# function needs to accomodate different dataframes and
# variable names so that it is generalized for use with
# any dataframe.
#
# Here is my simple example:
# Load tidyverse library
require(tidyverse)
#> Loading required package: tidyverse
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag(): dplyr, stats
# Example: summing columns of a dataframe. This example uses
# mutate and passes unquoted variable names
# Data
df.out <- data.frame(x=1:50, y=100:51)
# Function
df.function <- function(df, var1, var2){
var1 <- enquo(var1)
var2 <- enquo(var2)
# create a third variable that is a sum of the
# first two
df.new <- df %>% mutate(z = UQ(var1) + UQ(var2))
return(df.new)
}
# Function Call
df.augmented <- df.function(df.out, x, y)
df.augmented
#> x y z
#> 1 1 100 101
#> 2 2 99 101
#> 3 3 98 101
#> 4 4 97 101
#> 5 5 96 101
#> 6 6 95 101
#> 7 7 94 101
#> 8 8 93 101
#> 9 9 92 101
#> 10 10 91 101
#> 11 11 90 101
#> 12 12 89 101
#> 13 13 88 101
#> 14 14 87 101
#> 15 15 86 101
#> 16 16 85 101
#> 17 17 84 101
#> 18 18 83 101
#> 19 19 82 101
#> 20 20 81 101
#> 21 21 80 101
#> 22 22 79 101
#> 23 23 78 101
#> 24 24 77 101
#> 25 25 76 101
#> 26 26 75 101
#> 27 27 74 101
#> 28 28 73 101
#> 29 29 72 101
#> 30 30 71 101
#> 31 31 70 101
#> 32 32 69 101
#> 33 33 68 101
#> 34 34 67 101
#> 35 35 66 101
#> 36 36 65 101
#> 37 37 64 101
#> 38 38 63 101
#> 39 39 62 101
#> 40 40 61 101
#> 41 41 60 101
#> 42 42 59 101
#> 43 43 58 101
#> 44 44 57 101
#> 45 45 56 101
#> 46 46 55 101
#> 47 47 54 101
#> 48 48 53 101
#> 49 49 52 101
#> 50 50 51 101
# Question: My code seems overly complicated in
# terms of converting the unquoted input parameters to
# quoted values using enquo, and then unquoting again
# using UQ in the call to mutate. It is the only way
# I could get this work for arbitrary dataframe and variable
# names. Is there a way to do this without using enquo,
# and UQ ??????
#
# Thanks
Know that the !!
operator is the equivalent to UQ
(see here) to enquote.
You could then write mutate(z = !!var1 + !!var2)
Otherwise your code seems ok. It is coherent with the vignette programming with dplyr: quote to get a quosure inside a function with enquo
and then unquote with !!
when needed.
Another way to extend this a bit / a different approach includes using the ellipsis and quos(). This quick example uses the filter and rowSums, but I would be interested in knowing if this could be collapsed down to one function call and avoid using filter.
library(tidyverse)
df <- data.frame(x=1:50, y=100:51)
df2 <- data.frame(x=1:50, y=100:51, z=101:150)
# Function
df.function <- function(df, ...) {
vars <- quos(...)
# create a third variable that is a sum of the input
df.new <- df %>%
filter(!!!vars) %>%
mutate(z1 = rowSums(.))
return(df.new)
}
df.result <- df.function(df, x, y)
df.result2 <- df.function(df2, x, y, z)
While attempting to generalize to more than 2 columns, you can avoid using filter I think. Usefool tidyeval
tools are quos(...)
, and ``!!!to enquote it. You can use that inside
dplyrfunction like
select`. Here is how I would do it:
library(dplyr, warn.conflicts = F)
df1 <- data_frame(x=1:50, y=100:51)
df2 <- data_frame(x=1:50, y=100:51, z=101:150)
rowsum_df <- function(df, ...) {
var <- quos(...)
df %>%
mutate(z1 = select(., !!!var) %>% rowSums())
}
rowsum_df(df1, x, y)
#> # A tibble: 50 x 3
#> x y z1
#> <int> <int> <dbl>
#> 1 1 100 101
#> 2 2 99 101
#> 3 3 98 101
#> 4 4 97 101
#> 5 5 96 101
#> 6 6 95 101
#> 7 7 94 101
#> 8 8 93 101
#> 9 9 92 101
#> 10 10 91 101
#> # ... with 40 more rows
rowsum_df(df2, x, y)
#> # A tibble: 50 x 4
#> x y z z1
#> <int> <int> <int> <dbl>
#> 1 1 100 101 101
#> 2 2 99 102 101
#> 3 3 98 103 101
#> 4 4 97 104 101
#> 5 5 96 105 101
#> 6 6 95 106 101
#> 7 7 94 107 101
#> 8 8 93 108 101
#> 9 9 92 109 101
#> 10 10 91 110 101
#> # ... with 40 more rows
rowsum_df(df2, x, y, z)
#> # A tibble: 50 x 4
#> x y z z1
#> <int> <int> <int> <dbl>
#> 1 1 100 101 202
#> 2 2 99 102 203
#> 3 3 98 103 204
#> 4 4 97 104 205
#> 5 5 96 105 206
#> 6 6 95 106 207
#> 7 7 94 107 208
#> 8 8 93 108 209
#> 9 9 92 109 210
#> 10 10 91 110 211
#> # ... with 40 more rows
Thanks for the reply.....
Yes, I did realize that you can use !! in place of UQ(). I only did it the longer way to avoid confusing the negation operator with UQ(). I can see myself forgetting the distinction later on down the road when I am trying to explain my function to others.
Thanks for your reply.
Using the ellipsis seems like a good option. But then how would I access the individual variable names that are passed into ... ?
In actual practice i would need to be able to test for values of the individual variables passed into ....
Bear in mind that functions should only be as complicated as necessary. Instead of taking a data.frame and column names, then looking for those columns in the data, just ask for the vectors.
vector_function <- function(var1, var2){
var1 + var2
}
df.augmented <- df.out %>%
mutate(z = vector_function(x, y))
Non-standard evaluation is only necessary for extremely general functions. And even then, a vector-input function is often the better choice in keeping things general. With the example above, df.function
would overwrite any existing z
column, while vector_function
allows the user to specify the column name.
Sometimes, narrowly focused functions could benefit from NSE if they exploit the fact that it's not evaluated. For example, the dbplyr
package allows statements to be executed in a database instead of R because it catches arguments before evaluation.
Unless there's a benefit to using NSE, see if you can make the function vector-based. They're simpler to write, debug, understand, and use inside other functions.