parse vector of functions in mutate

bhive01 · September 17, 2019, 6:20pm

I'm trying to summarize data in a programmatic way across rows that I've created with another function and rlang.
This example has 2 traits and 3 materials, but I don't know how many there are in advance and I need to take these predictions and average them.

library(rlang)
library(dplyr)

set.seed(42)

df1 <- tibble(PLTID = 1:10, vigor_M1_pred = runif(10), vigor_M2_pred = runif(10), vigor_M3_pred = runif(10), senes_M1_pred = runif(10), senes_M2_pred = runif(10), senes_M3_pred = runif(10))


traits <- c("vigor", "senes")
mats <- c("M1", "M2", "M3")

# this gives me what I want, but I want to do it programmatically
df1 %>% 
	mutate(
		vigor_avg = rowMeans(select(., starts_with("vigor"))),
		senes_avg = rowMeans(select(., starts_with("senes")))
		)


funs2 <-  setNames(paste('rowMeans(select(., starts_with("', traits, '"), na.rm = TRUE)', sep = ""), paste0(traits, "_avg"))

#can't parse this
df1 %>% mutate(., !!!rlang::parse_exprs(funs2))
#Error in parse(text = x) : <text>:1:55: unexpected ';'
#1: rowMeans(select(., starts_with("vigor"), na.rm = TRUE);

I'd like to be able to generate the names and the function calls using the traits, but I can't seem to get it to happen.
I was pulling a lot from the discussion here:

Passing named list to mutate (and probably other dplyr verbs) tidyverse

Hi, I want to write a function that is given a named list which is then passed on to mutate() in a way that each element of the list is an argument to mutate(). I cannot get this right, either with the new quotation/quasi-quotation syntax or with the old mutate_() and would appreciate some help. Small example: foo <- function(x, args) { args <- enquote(args) mutate(x, UQS(args)) } foo(mtcars, args=list(cyl2=cyl*2)) foo <- function(x, args) { mutate_(x, .dots=args) } foo(mtcars, args=list(cyl2=cyl*2)) In both cases I get object 'cyl' not found when cyl exists in mtcars. I suppose the expression is not evaluated in the correct environment but I am not sure why. PS: I know that I cou…

Jiho has non-quoted arguments though. I'm trying to paste together the function and then parse it.

mishabalyasin · September 17, 2019, 8:04pm

Pasting strings together is always fraught with difficulties. rlang provides you with a rich toolkit for working with raw expressions, so in your example you can do something like this:

library(rlang)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(42)

df1 <- tibble(PLTID = 1:10, 
              vigor_M1_pred = runif(10), 
              vigor_M2_pred = runif(10), 
              vigor_M3_pred = runif(10), 
              senes_M1_pred = runif(10), 
              senes_M2_pred = runif(10), 
              senes_M3_pred = runif(10))


traits <- c("vigor", "senes")
mats <- c("M1", "M2", "M3")


# funs2 <-  setNames(paste('rowMeans(select(., starts_with("', traits, '"), na.rm = TRUE)', sep = ""), paste0(traits, "_avg"))
funs2 <- purrr::map(traits, function(trait){
  rlang::expr(rowMeans(select(., starts_with(!!trait)), na.rm = TRUE))
}) %>%
  purrr::set_names(paste0(traits, "_avg"))

#can't parse this
df1 %>% mutate(., !!!funs2)
#> # A tibble: 10 x 9
#>    PLTID vigor_M1_pred vigor_M2_pred vigor_M3_pred senes_M1_pred
#>    <int>         <dbl>         <dbl>         <dbl>         <dbl>
#>  1     1         0.915         0.458        0.904        0.738  
#>  2     2         0.937         0.719        0.139        0.811  
#>  3     3         0.286         0.935        0.989        0.388  
#>  4     4         0.830         0.255        0.947        0.685  
#>  5     5         0.642         0.462        0.0824       0.00395
#>  6     6         0.519         0.940        0.514        0.833  
#>  7     7         0.737         0.978        0.390        0.00733
#>  8     8         0.135         0.117        0.906        0.208  
#>  9     9         0.657         0.475        0.447        0.907  
#> 10    10         0.705         0.560        0.836        0.612  
#> # … with 4 more variables: senes_M2_pred <dbl>, senes_M3_pred <dbl>,
#> #   vigor_avg <dbl>, senes_avg <dbl>

^{Created on 2019-09-17 by the reprex package (v0.3.0)}

bhive01 · September 17, 2019, 8:35pm

So my issue was going the string route and not trying to make the function call an expression.

Very nice solution. Works well. Cheers.

ChenxinLi · September 18, 2019, 4:13am

library(tidyr)
library(dplyr)

set.seed(42)

df1 <- tibble(PLTID = 1:10, 
              vigor_M1_pred = runif(10), 
              vigor_M2_pred = runif(10), 
              vigor_M3_pred = runif(10), 
              senes_M1_pred = runif(10), 
              senes_M2_pred = runif(10), 
              senes_M3_pred = runif(10))


traits <- c("vigor", "senes")
mats <- c("M1", "M2", "M3")

df2 <- df1 %>% 
  gather("tag", "value", 2:ncol(df1)) %>% 
  separate(tag, c("trait", "material", "pred"), sep = "_") %>% 
  group_by(trait, material) %>% 
  summarise(mean = mean(value)) %>% 
  ungroup()

df2

#Easy to read solution!

ChenxinLi · September 18, 2019, 4:26am

The challenge here is that the variable "trait" and "material" are encoded in the column name, so it's not a tidy data.

I simply tidy-ed the data.

Gathered all the info in col names into a tag column, which harbors all the info.
Split the column names into trait and material columns. If the tag column is more complex, you can always use case_when and str_detect (in package stringr), for example,

...%>%
    mutate(trait = case_when(
        str_detect(tag, "vig") ~ "vigor",
        str_detect(tag, "sene") ~ "senes"
))

dplyr has elegant and powerful group - split - joint operation, in conjunction with summarise, you will never need to use rowMean or rowSum again!

bhive01 · September 18, 2019, 2:13pm

Chenxin,
Thanks for your response too. After posting I did try and do this using pivot_longer/wider (gather/spread) and summarize. I originally wanted to use rowMeans because I couldn't do gather/spread due to it being a spatial tibble. https://github.com/r-spatial/sf/issues/1149
However, removing the geometry, isolating important columns and applying this strategy would have been an equally good solution.

The column name coding was on purpose because it was done in a previous map2(mutate) function that generates the input dataframe, but my thoughts were that by standardizing it that allows one to use the select functions to isolate these columns and run rowwise functions on them.

system · September 25, 2019, 2:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.