Iterating over multiple fields/vectors

I am trying to create a data frame containing a set of calculations for different groups in the original data. A (silly) example is below:

library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.1

my_mean <- function(df, less_than) {
  cyl <- max(df$cyl)
  mean <- mean(df$mpg[df$mpg < less_than])
  results <- list(cyl, mean)
  names(results) <- c("cyl", paste("mean_lt", less_than, sep = "_"))
  results
}

mtcars %>% 
  split(.$cyl) %>% 
  map_dfr(~ my_mean(df = .x, less_than = 30))
#> # A tibble: 3 x 2
#>     cyl mean_lt_30
#>   <dbl>      <dbl>
#> 1     4       23.7
#> 2     6       19.7
#> 3     8       15.1

Created on 2018-10-19 by the reprex package (v0.2.1)

My question is whether it is possible to do the above for a set of levels of less_than. I could, of course, just use for over the levels and merge the resulting data frames on cyl, but where is the fun in that?! An example of what I am looking for using just two levels is at the end of the following:

library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.1

my_mean <- function(df, less_than) {
  cyl <- max(df$cyl)
  mean <- mean(df$mpg[df$mpg < less_than])
  results <- list(cyl, mean)
  names(results) <- c("cyl", paste("mean_lt", less_than, sep = "_"))
  results
}

t1 <- mtcars %>% 
  split(.$cyl) %>% 
  map_dfr(~ my_mean(df = .x, less_than = 30))

t2 <- mtcars %>% 
  split(.$cyl) %>% 
  map_dfr(~ my_mean(df = .x, less_than = 40))

merge(t1, t2)
#>   cyl mean_lt_30 mean_lt_40
#> 1   4   23.74286   26.66364
#> 2   6   19.74286   19.74286
#> 3   8   15.10000   15.10000

Created on 2018-10-19 by the reprex package (v0.2.1)

Finally, I should also mention that the calculations eventually will make their way into a bootstrapping set-up per group and level, but I am trying to do one step at the time.

1 Like

I'm no dplyr guru, but this works:

library(dplyr)

my_means <- function(df, less_thans) {
  means <- lapply(
    X   = less_thans,
    FUN = function(x) mean(df[["mpg"]][df[["mpg"]] < x])
  )
  names(means) <- paste0("mean_lt_", less_thans)
  means
}

data(mtcars)
limits <- c(15, 20, 25, 30, 35)

mtcars %>%
  split(.[["cyl"]]) %>%
  lapply(FUN = my_means, less_thans = limits) %>%
  bind_rows(.id = "cyl")
# # A tibble: 3 x 6
#   cyl   mean_lt_15 mean_lt_20 mean_lt_25 mean_lt_30 mean_lt_35
#   <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
# 1 4          NaN        NaN         22.6       23.7       26.7
# 2 6          NaN         18.7       19.7       19.7       19.7
# 3 8           12.6       15.1       15.1       15.1       15.1

That meets your requirements, and I don't know if you need exactly this shape, but I am worried you'll be bitten by an "untidy" bug. Embedding information in column names is a bad practice and can make the data a pain to use. It might be better to store those limits in an extra column.

library(tidyr)

my_means_vector <- function(df, less_thans) {
  vapply(
    X   = less_thans,
    FUN = function(x) mean(df[["mpg"]][df[["mpg"]] < x]),
    FUN.VALUE = numeric(1)
  )
}

mtcars %>%
  crossing(lim = limits) %>%
  group_by(cyl, lim) %>%
  summarise(mn = mean(mpg[mpg < lim]))
# # A tibble: 15 x 3
# # Groups:   cyl [?]
#      cyl   lim    mn
#    <dbl> <dbl> <dbl>
#  1     4    15 NaN  
#  2     4    20 NaN  
#  3     4    25  22.6
#  4     4    30  23.7
#  5     4    35  26.7
#  6     6    15 NaN  
#  7     6    20  18.7
#  8     6    25  19.7
#  9     6    30  19.7
# 10     6    35  19.7
# 11     8    15  12.6
# 12     8    20  15.1
# 13     8    25  15.1
# 14     8    30  15.1
# 15     8    35  15.1
4 Likes

Here is a way to do this using dplyr and purrr that utilizes group_by, summarize_at, and purrr::partial.

Putting aside the purrr::partial portion for now, I had to make changes to your my_means function to work with the group_by/summarize workflow instead of split/map. To see my thoughts on the differences you can see this thread. The function now takes a vector rather than a dataframe and returns only the mean (which meets the requirements of a function passed to summarize).

So, now the fun part. purrr::partial allows you to pass a function to it while setting different variables to change with each iteration of the function. If you call partial inside of a map call then these preset functions are conveniently saved to a list. Now the tricky part.. How do we run a list of functions on a specific subset of columns of our dataframe. Luckily, with rlang (here using functions reexported with dplyr) we can call our function list inside of the funs argument/function in summarize_at with !!!. This will output the results for each of the functions in the list as its own column and each row will contain a different group.

One other important thing to note is that if you want to call the list of functions from summarize_at as shown, the list has to be named. Hence, the reason for creating a dynamic list of names and using purrr::set_names to apply them.

Here is the reprex:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)


my_mean <- function(x, less_than) {
  mean <- mean(x[x < less_than])
  mean
}

my_cutoffs <- c(25, 30, 35)

my_means_names <- purrr::map(my_cutoffs, ~paste0("mean_lt_", .x))

my_partial_mean <- purrr::map(my_cutoffs, ~purrr::partial(my_mean, less_than = .x)) %>% 
  purrr::set_names(nm = my_means_names)


mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg), funs(!!!my_partial_mean))
#> # A tibble: 3 x 4
#>     cyl mean_lt_25 mean_lt_30 mean_lt_35
#>   <dbl>      <dbl>      <dbl>      <dbl>
#> 1     4       22.6       23.7       26.7
#> 2     6       19.7       19.7       19.7
#> 3     8       15.1       15.1       15.1

Created on 2018-10-19 by the reprex package (v0.2.0).

I recently wrote a blog post using this exact same workflow to calculate multiple quantiles for different groups with dplyr

6 Likes

Thank you so much @nwerth and @tbradley for the suggestions. Both of those would do the trick, but it looks like I can only pick one as the solution and I am slightly partial to the last one (no pun intended!) because it is easier to read. I need to get better at using purrr.

I have not thought about the tidy aspects of the data. That is a really good point, especially since they will be used later. I sometimes create tables like that, but I tend to only use them as the last stop before I have R output them to LaTeX.