Using purrr::map with default/absent fn arguments

cawthm · August 16, 2018, 10:43pm

I'm not understanding something fundamental about purrr::map and/ or perhaps how nesting works.

# Suppose we have a simple function with a default-valued input argument k

toy_func <- function(k = 3) { rnorm(k) }

# Simple starting dataframe

library(tidyverse)

toy_df <- tibble(runs = 1:2)

# Now suppose we wish to add a variable of nested values using toy_func()

# 1) This fails
toy_df1 <- toy_df %>% mutate(new_var = map(.f = toy_func))

# 2) This sort of works, but it recycles the toy_func output into new_var

toy_df2 <- toy_df %>% mutate(new_var = map(.x = 3, .f = toy_func))

toy_df2 %>% unnest() #ie, it only invokes toy_func once

# 3) this does what we want, but we seemingly need a dummy variable to pass in for use as .x

toy_df3 <- toy_df %>% mutate(dummy_x = 3, new_var = map(.x = dummy_x, .f = toy_func))
toy_df3 %>% unnest()

Question 1: Why does method 2 recycle, but not method 3?

Question 2: Is there a way to invoke map but tell it to use the default function value (eg, k=3 in toy_func())?

Question 3: More generally, if one wishes to create a variable in a df using map with a function that takes no has default arguments (or takes no arguments), is there a preferred construction for this pattern?

markdly · August 16, 2018, 11:47pm

Well I can make a start at this one. Other more eloquent and knowledgable
community members can no doubt improve on this response!

To answer Question 1, you are only passing one value (3) to toy_func() to create toy_df2. You can
pass a vector c(3, 3) instead as in the example below. This works the same way as the toy_df3 code because the recycling happens during the creation of dummy_x. That is, dummy_x is a vector of length 2 which is then passed onto toy_func().

library(tidyverse)
toy_func <- function(k = 3) { rnorm(k) }
toy_df <- tibble(runs = 1:2)

set.seed(123)
toy_df %>% mutate(new_var = map(c(3, 3), toy_func)) %>% unnest()
#> # A tibble: 6 x 2
#>    runs new_var
#>   <int>   <dbl>
#> 1     1 -0.560 
#> 2     1 -0.230 
#> 3     1  1.56  
#> 4     2  0.0705
#> 5     2  0.129 
#> 6     2  1.72

In terms of Question 2 I'm not sure how to use map() when I only want to use the default function arguments. I'd use rerun() instead.

In summary, and in answer to Question 3, I'd use rerun() rather than map() as shown in the example below if only the default function arguments are required for toy_func().

set.seed(123)
toy_df %>% mutate(new_var = rerun(nrow(.), toy_func())) %>% unnest()
#> # A tibble: 6 x 2
#>    runs new_var
#>   <int>   <dbl>
#> 1     1 -0.560 
#> 2     1 -0.230 
#> 3     1  1.56  
#> 4     2  0.0705
#> 5     2  0.129 
#> 6     2  1.72

Created on 2018-08-17 by the reprex package (v0.2.0).

rensa · August 16, 2018, 11:51pm

The map family is super powerful, but I still find myself getting turned around by it—especially when you start using it in pipes.

I think, before we look at your examples, it's worth contrasting them with a simpler one: just making a new column.

toy_df0 <- toy_df %>% mutate(new_var = toy_func())
# Error in mutate_impl(.data, dots) : Column `new_var` must be length 2 (the number of rows) or one, not 3

toy_df0 <- toy_df %>% mutate(new_var = toy_func(2))
toy_df0
# A tibble: 2 x 2
#    runs new_var
#   <int>   <dbl>
# 1     1   -1.21
# 2     2   -1.30

toy_df0 <- toy_df %>% mutate(new_var = toy_func(1))
toy_df0
# A tibble: 2 x 2
#    runs new_var
#   <int>   <dbl>
# 1     1   -1.68
# 2     2   -1.68

In these examples, toy_func runs once each. The first time, it runs with the default argument, k = 3, and that throws an error because it doesn't fit in the existing data frame. With k = 2 it fits perfectly. With k = 1 the vector is "recycled", being concatenated with itself until it fits.

Part of the tricky part of using it in pipes is recognising the context in which the pipe operator works. When you're using map, you're running toy_func several times according to the first argument, .x, and that's missing in the first example (that's what the error means). The pipe is passing toy_df as the first argument to mutate, but not to map. So what you're really running is:

toy_df1 <- mutate(toy_df, new_var = map(.f = toy_func))

The next two examples are identical, and are maybe what you intended to express:

toy_df1 <- mutate(toy_df, new_var = map(toy_df, .f = toy_func))
toy_df1 <- toy_df %>% mutate(new_var = map(., .f = toy_func))

In the second version, the pipe passes toy_df to mutate, but you can then use it again with ., as I have in map.

If we look at the output here, we can see that it's definitely different to my examples:

toy_df1
# A tibble: 2 x 2
# runs new_var
# <int> <list>
# 1   1 <dbl [2]>
# 2   2 <dbl [2]>

toy_df1$new_var
# [[1]]
# [1] -0.02948677 -0.20796988

# [[2]]
# [1] -0.02948677 -0.20796988

It's nested: each call to toy_func produces a vector, and each vector becomes one element in the list column. This is different from vector recycling.

The question is, why is each vector of length 2 and not the default, 3?

When map is called, its first argument, .x, gets broken up element-by-element and toy_func is called on each element. In the previous examples where you passed toy_df all the way into map using the .. pronoun, the . argument is what gets broken up. Under the hood, data frames are really lists (with each column being a list element), so when you do one of these...

toy_df1 <- mutate(toy_df, new_var = map(toy_df, .f = toy_func))
toy_df1 <- toy_df %>% mutate(new_var = map(., .f = toy_func))

… you're actually kind of doing this:

toy_func(toy_df[[1]])
# [1] 1 2
# [1] -0.5275058  0.0256864

# if toy_df had more columns, it'd then be:
# toy_func(toy_df[[2]])
# toy_func(toy_df[[3]])
# etc.

Since the data frame has two rows, you're passing a two element vector to rnorm each time. And, unfortunately, rnorm is perhaps a little happy to make do with that. From the rnorm documentation:

n: number of observations. If length(n) > 1 , the length is taken to be the number required.

So by passing the data frame's columns onto your toy_func, you're ending up overriding the default argument—not with a constant k, but with a vector whose length is taken to be k by nrorm.

I'm wondering if you wanted to pass the value in each row of the runs column in as k. So for the first row, runs is 1 and you get rnorm(1) (one random number); for the second, rnorm(2) (two random numbers), etc. And then you unnest that. Is that fair?

If that's case, what you want to do is have map break the runs columns up element-by-element and give that to each toy_func run, not the whole data frame column-by-column. This would do the trick:

toy_df2 = toy_df %>% mutate(new_var = map(.$runs, .f = toy_func))
toy_df2
# A tibble: 2 x 2
#    runs new_var  
#   <int> <list>   
# 1     1 <dbl [1]>
# 2     2 <dbl [2]>

toy_df2 %>% unnest()
# A tibble: 3 x 2
#    runs new_var
#   <int>   <dbl>
# 1     1  -0.887
# 2     2   1.80 
# 3     2   1.08

The magic here is .$runs. The pipe passes toy_df to mutate, and then you recall it in map with .—but you use it with the dollar signto narrow it down to one column.

I think you have the right idea about default arguments, but the mechanics of map combined with the mechanics of the pipe make things complicated really quickly I hope that helps!