Using map with a vector of variables and dplyr programming

Lief · January 14, 2021, 7:49am

I've had a frustrating pattern come up in my work a couple of times now with writing functions around dplyr code. I'll be doing an exploratory analysis and will write a short block that transforms and summarizes a variable, usually with some grouping. Then I'll need to apply the same analysis to a second variable. So, I write a function using embrasures and its good. Finally, it turns out I need to apply the same analysis to 10 more variables, and I'd like to use map, but map needs the variable names to be strings. So I have to go back and re-write my function using .data[[var]]. I'd really like to have a solution where I can use the same function with either promises (is that the right term for referring to data-variables?) or character strings as variables, but I haven't been able to come up with one. Are there any suggestions?

Here's a simple example

library(dplyr)
library(tidyr)

iris <- iris %>% 
  mutate(group1 = ceiling(runif(nrow(iris), 0, 3)))

# this is a useful analysis block
iris %>% 
  group_by(Species) %>% 
  summarize(mean(Sepal.Length), 
            .groups = "drop")

# what about checking Sepal.Width?
sum_fn <- function(dat, groupvar, var) {
  dat %>% 
    group_by({{groupvar}}) %>% 
    summarize(mean({{var}}), 
              .groups = "drop")
}
sum_fn(iris, Species, Sepal.Width)

# now lets map it to a bunch of variables, and other groups too
vars <- names(iris[1:2])
groupvars <- c("Species", "group1")
analysis_list <- crossing(groupvars, vars)

# Oops, the function doesnt work, we are mapping characters as variables
purrr::map2(analysis_list$groupvars, analysis_list$vars, sum_fn, dat = iris)

# rewrite the function using .data
sum_fn2 <- function(dat, groupvar, var) {
  dat %>% 
    group_by(.data[[groupvar]]) %>% 
    summarize(mean(.data[[var]]), 
              .groups = "drop")
}

purrr::map2(analysis_list$groupvars, analysis_list$vars, sum_fn2, dat = iris)

I'd love to be able to put something like if(is.character(var)) {var = sym(var)} at the top of my function and be done with it.

I am quite familiar with the programming with dplyr vignette.

JacekB · January 14, 2021, 9:15am

Try pivot_longer() to move all your variables to longer form. Then you can use map() with only one variable.

nirgrahamuk · January 14, 2021, 10:31am

My advice to you comes from my own workflow.
99% of my work (i.e. what am I doing when I'm writing my own functions), my users will not interact with, they are functions that other functions of mine call, or that I use to process data. Therefore I always pass variables as character strings (i.e. names of the variables). Therefore I can always iterate with purrr functions with little difficulty. What are the costs and downsides of this approach ? possible that when writing out the params of a function call I'm using the quote key a few times more than I would had I written with embrasures to make the param calls naked variable symbols, but .... to me this cost is so low it doesnt register, in fact it becomes cleaner to understand the param call when I do code review. i.e. differentiate a named object that might be being passed from a simple character string that represents something (like a variable name).

Lief · January 14, 2021, 4:54pm

Thank you, Jacek. The function contents I provided above are just a simplified example. Rewriting some generic, arbitrarily complex, analysis to work with data in a long format would be a similarly time consuming step that I'd like to avoid when extending an analysis to more variables.

Lief · January 14, 2021, 5:00pm

Thank you nirgrahamuk, I may end up doing that for my own code.

Unfortunately, the other use case is that I work with a number of people who are relative novices with R. dplyr and the tidyverse generally is great at being super approachable for beginners, but I am finding it really hard to help people get past the novice level. The particular pain point right now is getting colleagues to use functions instead of just copy pasting code repeatedly and manually changing variable names. The embrasure method is a huge improvement in clarity from enquo and !!, but I'm still struggling with explaining (both to myself and to others) how Sepal.Width is different from "Sepal.Width" for code parsing.

This is particularly challenging for people coming from SAS and Stata.

siddharthprabhu · January 14, 2021, 5:06pm

You could just create symbols from the character vectors and then pass them to purrr::map2.

library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)

iris <- iris %>%
  mutate(group1 = ceiling(runif(nrow(iris), 0, 3)))

sum_fn <- function(dat, groupvar, var) {
  dat %>%
    group_by({{ groupvar }}) %>%
    summarize(mean({{ var }}),
      .groups = "drop"
    )
}

# now lets map it to a bunch of variables, and other groups too
vars <- names(iris[1:2])
groupvars <- c("Species", "group1")
analysis_list <- crossing(groupvars, vars)

# works if you pass lists of symbols
purrr::map2(syms(analysis_list$groupvars), syms(analysis_list$vars), sum_fn, dat = iris)
#> [[1]]
#> # A tibble: 3 x 2
#>   group1 `mean(Sepal.Length)`
#>    <dbl>                <dbl>
#> 1      1                 6.02
#> 2      2                 5.70
#> 3      3                 5.84
#> 
#> [[2]]
#> # A tibble: 3 x 2
#>   group1 `mean(Sepal.Width)`
#>    <dbl>               <dbl>
#> 1      1                3.12
#> 2      2                3.12
#> 3      3                2.96
#> 
#> [[3]]
#> # A tibble: 3 x 2
#>   Species    `mean(Sepal.Length)`
#>   <fct>                     <dbl>
#> 1 setosa                     5.01
#> 2 versicolor                 5.94
#> 3 virginica                  6.59
#> 
#> [[4]]
#> # A tibble: 3 x 2
#>   Species    `mean(Sepal.Width)`
#>   <fct>                    <dbl>
#> 1 setosa                    3.43
#> 2 versicolor                2.77
#> 3 virginica                 2.97

^{Created on 2021-01-14 by the reprex package (v0.3.0)}

joels · January 14, 2021, 5:27pm

I often struggle with these sorts of tidyeval issues. The code below is flexible regarding the number of grouping columns and value columns, and will work with bare names or strings. I'm not sure if this is the "best" or "intended" way to use tidyeval, but it seems to work. (Maybe @lionel can provide additional guidance.)

library(tidyverse)

# Add a second grouping variable to iris
d = iris %>% mutate(group2 = rep(LETTERS[1:3], 50))

fnc = function(data, value.vars, group.vars=NULL) {
  data %>% 
    group_by(across({{group.vars}})) %>% 
    summarise(n=n(), across({{value.vars}}, mean, .names="mean_{.col}"))
}

First, show that the function works when we invoke it directly:

d %>% fnc(c(Petal.Width, Sepal.Width))
#> # A tibble: 1 x 3
#>       n mean_Petal.Width mean_Sepal.Width
#>   <int>            <dbl>            <dbl>
#> 1   150             1.20             3.06

d %>% fnc(c("Petal.Width", "Sepal.Width"))
#> # A tibble: 1 x 3
#>       n mean_Petal.Width mean_Sepal.Width
#>   <int>            <dbl>            <dbl>
#> 1   150             1.20             3.06

d %>% fnc(c(Petal.Width, Sepal.Width), Species)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 4
#>   Species        n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <int>            <dbl>            <dbl>
#> 1 setosa        50            0.246             3.43
#> 2 versicolor    50            1.33              2.77
#> 3 virginica     50            2.03              2.97

d %>% fnc(c("Petal.Width", "Sepal.Width"), "Species")
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 4
#>   Species        n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <int>            <dbl>            <dbl>
#> 1 setosa        50            0.246             3.43
#> 2 versicolor    50            1.33              2.77
#> 3 virginica     50            2.03              2.97

d %>% fnc(c(Petal.Width, Sepal.Width), c(Species, group2))
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> # A tibble: 9 x 5
#> # Groups:   Species [3]
#>   Species    group2     n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <chr>  <int>            <dbl>            <dbl>
#> 1 setosa     A         17            0.229             3.47
#> 2 setosa     B         17            0.247             3.45
#> 3 setosa     C         16            0.262             3.36
#> 4 versicolor A         17            1.29              2.68
#> 5 versicolor B         16            1.34              2.91
#> 6 versicolor C         17            1.35              2.74
#> 7 virginica  A         16            2.07              2.98
#> 8 virginica  B         17            2.09              3.02
#> 9 virginica  C         17            1.92              2.92

d %>% fnc(c("Petal.Width", "Sepal.Width"), c("Species", "group2"))
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> # A tibble: 9 x 5
#> # Groups:   Species [3]
#>   Species    group2     n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <chr>  <int>            <dbl>            <dbl>
#> 1 setosa     A         17            0.229             3.47
#> 2 setosa     B         17            0.247             3.45
#> 3 setosa     C         16            0.262             3.36
#> 4 versicolor A         17            1.29              2.68
#> 5 versicolor B         16            1.34              2.91
#> 6 versicolor C         17            1.35              2.74
#> 7 virginica  A         16            2.07              2.98
#> 8 virginica  B         17            2.09              3.02
#> 9 virginica  C         17            1.92              2.92

Now try mapping over combinations of grouping columns:

quos(NULL, Species, group2, c(Species, group2)) %>% 
  map(~fnc(d, c(Petal.Width, Sepal.Width), !!.x))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> [[1]]
#> # A tibble: 1 x 3
#>       n mean_Petal.Width mean_Sepal.Width
#>   <int>            <dbl>            <dbl>
#> 1   150             1.20             3.06
#> 
#> [[2]]
#> # A tibble: 3 x 4
#>   Species        n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <int>            <dbl>            <dbl>
#> 1 setosa        50            0.246             3.43
#> 2 versicolor    50            1.33              2.77
#> 3 virginica     50            2.03              2.97
#> 
#> [[3]]
#> # A tibble: 3 x 4
#>   group2     n mean_Petal.Width mean_Sepal.Width
#>   <chr>  <int>            <dbl>            <dbl>
#> 1 A         50             1.18             3.04
#> 2 B         50             1.22             3.13
#> 3 C         50             1.20             3.00
#> 
#> [[4]]
#> # A tibble: 9 x 5
#> # Groups:   Species [3]
#>   Species    group2     n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <chr>  <int>            <dbl>            <dbl>
#> 1 setosa     A         17            0.229             3.47
#> 2 setosa     B         17            0.247             3.45
#> 3 setosa     C         16            0.262             3.36
#> 4 versicolor A         17            1.29              2.68
#> 5 versicolor B         16            1.34              2.91
#> 6 versicolor C         17            1.35              2.74
#> 7 virginica  A         16            2.07              2.98
#> 8 virginica  B         17            2.09              3.02
#> 9 virginica  C         17            1.92              2.92

# Can also use "list" here instead of "quos"
quos(NULL, "Species", "group2", c("Species", "group2")) %>% 
  map(~fnc(d, c(Petal.Width, Sepal.Width), !!.x))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> [[1]]
#> # A tibble: 1 x 3
#>       n mean_Petal.Width mean_Sepal.Width
#>   <int>            <dbl>            <dbl>
#> 1   150             1.20             3.06
#> 
#> [[2]]
#> # A tibble: 3 x 4
#>   Species        n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <int>            <dbl>            <dbl>
#> 1 setosa        50            0.246             3.43
#> 2 versicolor    50            1.33              2.77
#> 3 virginica     50            2.03              2.97
#> 
#> [[3]]
#> # A tibble: 3 x 4
#>   group2     n mean_Petal.Width mean_Sepal.Width
#>   <chr>  <int>            <dbl>            <dbl>
#> 1 A         50             1.18             3.04
#> 2 B         50             1.22             3.13
#> 3 C         50             1.20             3.00
#> 
#> [[4]]
#> # A tibble: 9 x 5
#> # Groups:   Species [3]
#>   Species    group2     n mean_Petal.Width mean_Sepal.Width
#>   <fct>      <chr>  <int>            <dbl>            <dbl>
#> 1 setosa     A         17            0.229             3.47
#> 2 setosa     B         17            0.247             3.45
#> 3 setosa     C         16            0.262             3.36
#> 4 versicolor A         17            1.29              2.68
#> 5 versicolor B         16            1.34              2.91
#> 6 versicolor C         17            1.35              2.74
#> 7 virginica  A         16            2.07              2.98
#> 8 virginica  B         17            2.09              3.02
#> 9 virginica  C         17            1.92              2.92

You can also map over group-value pairs as in your example:

vars <- names(d[1:2])
groupvars <- c("Species", "group2")
analysis_list <- crossing(groupvars, vars)

analysis_list %>% pmap(~fnc(d, .y, .x))

lionel · January 14, 2021, 6:11pm

Thanks for the ping @joels.

Lief, I would not write a new function for this. I would adjust how your function is called from the map2:

purrr::map2(
  analysis_list$groupvars,
  analysis_list$vars,
  ~ sum_fn(iris, .data[[.x]], .data[[.y]])
)

joels · January 14, 2021, 6:58pm

Thanks Lionel. If you have time to comment on the function I wrote, I'm curious whether I'm using tidyeval appropriately for a function that can flexibly take any number of grouping and output-value columns, and that can handle both strings and bare column names.

lionel · January 14, 2021, 7:29pm

I agree that going through across() is a nice solution for multiple inputs.

Then when the variables are stored as strings in a character vector, the caller can use all_of() to pick up the corresponding columns. This would silence the messages that your last example is causing.

Lief · January 15, 2021, 5:16pm

Thank you all for this conversation. It was super helpful to see the different approaches.

Lief · January 15, 2021, 5:20pm

I would also really appreciate it if you could point me to any resources that I can dig into to understand exactly what is going on with these different approaches. Would Advanced R's Quasiquotation section be the right place to start? I'm particularly interested in understanding what these two are doing that leads to the same result. Edit: I'm not asking for an explanation per se, more just some direction to where I can learn about it myself.

purrr::map2(syms(analysis_list$groupvars), syms(analysis_list$vars), sum_fn, dat = iris)

purrr::map2(
  analysis_list$groupvars,
  analysis_list$vars,
  ~ sum_fn(iris, .data[[.x]], .data[[.y]])
)

jgduenasl · January 17, 2021, 7:20pm

Hi, I have spent a lot of time dealing with the tidyeval issues to define my summarising functions analyzing a dataset. Although here there are many ways to solve these issues, I would like to deeply understand the details and their uses of different functions. For example, it's often common to pass variables as arguments to a summary function as

vars_summary <- function(data, ...){
 .group_vars <- enquos(...)
 data %>%
    group_by(!!!.group_vars) %>%
    summarise(N = n()) 
}

iris %>% vars_summary(Species, group1)

for a single group_var variables, it seems the same as to use {{ group_var}} evaluation, as mentioned at the beginning. However, when I tried to define a function using this evaluation to pivot a data frame in a longer format it didn't work. Then, I define it using ensym() function.

pivot_frame <- function(data, cols_preffix, target_var){
  cols_preffix <- rlang::ensym(cols_preffix)
  target_var <- rlang::ensym(target_var)
  df_pivot_frame <- data %>%
    pivot_longer(cols = starts_with(paste0(cols_preffix)),
                 names_to = paste0(target_var),
                 names_prefix = paste0(cols_preffix),
                 values_to = "has_record",
                 values_drop_na = TRUE) 
result(df_pivot_frame)
}

Further, in a more complex function, I solved these kind of issues using the lang::as_name(enquo(var)). My point is that each time I need to define a function I have to struggle with all these tidyeval functions. I have read many blogs, the vignette and quasiquotation section of Advanced R mentioned before and I don't get a clear understanding of when I should use these tidyeval functions. I think I'm losing conceptual subtleties about the concepts of string, symbols, quotation, expressions, and variations as shown in defuse R expressions.

system · January 24, 2021, 7:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

lionel · February 25, 2021, 8:10am

@jgduenasl

I think one problem in your pivot_frame() function is that you are trying to take prefixes in unquoted form. A prefix does not represent anything in the data frame, so this sort of NSE does not follow the principles of data-masking or tidy-selections. I think it's best to be disciplined and very conservative about NSE, this way your functions follow the same principles as tidyverse ones and their usage is easier to predict and remember.

So I think I would just rewrite pivot_frame() to take prefixes as strings.