The map
family is super powerful, but I still find myself getting turned around by it—especially when you start using it in pipes.
I think, before we look at your examples, it's worth contrasting them with a simpler one: just making a new column.
toy_df0 <- toy_df %>% mutate(new_var = toy_func())
# Error in mutate_impl(.data, dots) : Column `new_var` must be length 2 (the number of rows) or one, not 3
toy_df0 <- toy_df %>% mutate(new_var = toy_func(2))
toy_df0
# A tibble: 2 x 2
# runs new_var
# <int> <dbl>
# 1 1 -1.21
# 2 2 -1.30
toy_df0 <- toy_df %>% mutate(new_var = toy_func(1))
toy_df0
# A tibble: 2 x 2
# runs new_var
# <int> <dbl>
# 1 1 -1.68
# 2 2 -1.68
In these examples, toy_func
runs once each. The first time, it runs with the default argument, k = 3
, and that throws an error because it doesn't fit in the existing data frame. With k = 2
it fits perfectly. With k = 1
the vector is "recycled", being concatenated with itself until it fits.
Part of the tricky part of using it in pipes is recognising the context in which the pipe operator works. When you're using map
, you're running toy_func
several times according to the first argument, .x
, and that's missing in the first example (that's what the error means). The pipe is passing toy_df
as the first argument to mutate
, but not to map
. So what you're really running is:
toy_df1 <- mutate(toy_df, new_var = map(.f = toy_func))
The next two examples are identical, and are maybe what you intended to express:
toy_df1 <- mutate(toy_df, new_var = map(toy_df, .f = toy_func))
toy_df1 <- toy_df %>% mutate(new_var = map(., .f = toy_func))
In the second version, the pipe passes toy_df
to mutate
, but you can then use it again with .
, as I have in map
.
If we look at the output here, we can see that it's definitely different to my examples:
toy_df1
# A tibble: 2 x 2
# runs new_var
# <int> <list>
# 1 1 <dbl [2]>
# 2 2 <dbl [2]>
toy_df1$new_var
# [[1]]
# [1] -0.02948677 -0.20796988
# [[2]]
# [1] -0.02948677 -0.20796988
It's nested: each call to toy_func
produces a vector, and each vector becomes one element in the list column. This is different from vector recycling.
The question is, why is each vector of length 2 and not the default, 3?
When map
is called, its first argument, .x
, gets broken up element-by-element and toy_func
is called on each element. In the previous examples where you passed toy_df
all the way into map
using the .
. pronoun, the .
argument is what gets broken up. Under the hood, data frames are really lists (with each column being a list element), so when you do one of these...
toy_df1 <- mutate(toy_df, new_var = map(toy_df, .f = toy_func))
toy_df1 <- toy_df %>% mutate(new_var = map(., .f = toy_func))
… you're actually kind of doing this:
toy_func(toy_df[[1]])
# [1] 1 2
# [1] -0.5275058 0.0256864
# if toy_df had more columns, it'd then be:
# toy_func(toy_df[[2]])
# toy_func(toy_df[[3]])
# etc.
Since the data frame has two rows, you're passing a two element vector to rnorm
each time. And, unfortunately, rnorm
is perhaps a little happy to make do with that. From the rnorm
documentation:
n
: number of observations. If length(n) > 1
, the length is taken to be the number required.
So by passing the data frame's columns onto your toy_func
, you're ending up overriding the default argument—not with a constant k
, but with a vector whose length is taken to be k
by nrorm
.
I'm wondering if you wanted to pass the value in each row of the runs
column in as k
. So for the first row, runs
is 1 and you get rnorm(1)
(one random number); for the second, rnorm(2)
(two random numbers), etc. And then you unnest that. Is that fair?
If that's case, what you want to do is have map
break the runs
columns up element-by-element and give that to each toy_func
run, not the whole data frame column-by-column. This would do the trick:
toy_df2 = toy_df %>% mutate(new_var = map(.$runs, .f = toy_func))
toy_df2
# A tibble: 2 x 2
# runs new_var
# <int> <list>
# 1 1 <dbl [1]>
# 2 2 <dbl [2]>
toy_df2 %>% unnest()
# A tibble: 3 x 2
# runs new_var
# <int> <dbl>
# 1 1 -0.887
# 2 2 1.80
# 3 2 1.08
The magic here is .$runs
. The pipe passes toy_df
to mutate
, and then you recall it in map
with .
—but you use it with the dollar signto narrow it down to one column.
I think you have the right idea about default arguments, but the mechanics of map
combined with the mechanics of the pipe make things complicated really quickly
I hope that helps!