Pros and cons of split() vs nest() with map() workflows

thomas · March 23, 2018, 7:12pm

In examples using the purrr::map() family of functions, I see both split() and nest() being used for generating the inputs to the map function.

Questions:

What are the pros and cons of the two approaches?
Should either approach preferred or recommended over the other in general or for particular problem-types?

Context:
I am helping develop best-practices for my R-using colleagues, most of whom are new to purrr, so I want to get them started in the "right" direction. I've thought about it, but can't come up with a good reason to suggest one over the other, so I wanted to see if the community could provide some of their opinions or reasons for their preference.

My personal preference is nest(), and it seems to me to be slightly more flexible and transparent - albeit a bit more verbose. However, that preference may be due to my underdeveloped "base" R skills (e.g. direct manipulation of lists).

If this question is too "opinion-y" for this forum, I'm happy to withdraw it.

Here is an example of almost the same analysis done with both approaches. (I couldn't quickly figure out how to get a column with cyl in the output of the split()-based method.)

> library(dplyr)
> library(tidyr)
> library(purrr)
> 
> # use nest() 
> mtcars %>% 
+   group_by(cyl) %>% 
+   nest() %>% 
+   mutate(mod_obj   = map(data, ~lm(mpg ~ wt, data = .x)),
+          summaries = map(mod_obj, broom::glance)) %>%
+   select(cyl, summaries) %>% 
+   unnest(summaries)
# A tibble: 3 x 12
    cyl r.squared adj.r.squared    sigma statistic    p.value    df    logLik      AIC      BIC  deviance df.residual
  <dbl>     <dbl>         <dbl>    <dbl>     <dbl>      <dbl> <int>     <dbl>    <dbl>    <dbl>     <dbl>       <int>
1     6 0.4645102     0.3574122 1.165202  4.337245 0.09175766     2  -9.82518 25.65036 25.48809  6.788481           5
2     4 0.5086326     0.4540362 3.332283  9.316233 0.01374278     2 -27.74487 61.48974 62.68342 99.936983           9
3     8 0.4229655     0.3748793 2.024091  8.795985 0.01179281     2 -28.65778 63.31555 65.23272 49.163336          12
> 
> # use split()
> mtcars %>% 
+   split(.$cyl) %>% 
+   map(~lm(mpg ~ wt, data = .)) %>% 
+   map(~broom::glance(.)) %>% 
+   reduce(bind_rows)
  r.squared adj.r.squared    sigma statistic    p.value df    logLik      AIC      BIC  deviance df.residual
1 0.5086326     0.4540362 3.332283  9.316233 0.01374278  2 -27.74487 61.48974 62.68342 99.936983           9
2 0.4645102     0.3574122 1.165202  4.337245 0.09175766  2  -9.82518 25.65036 25.48809  6.788481           5
3 0.4229655     0.3748793 2.024091  8.795985 0.01179281  2 -28.65778 63.31555 65.23272 49.163336          12

tbradley · March 23, 2018, 7:37pm

Personally I like the nest() method as well. I think the big benefit to the nest method is that you can keep everything organized nicely. Looking at your example, the noticeable difference is that the nest method kept which cyl each model results was for. To take it one step further, say you wanted to get both the broom::glance() output and the broom::tidy() results for each model. This is easy to do and keep organized with nest():

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

# use nest() assigning it to model_results
model_results <- mtcars %>% 
 group_by(cyl) %>% 
 nest() %>% 
 mutate(mod_obj = map(data, ~lm(mpg ~ wt, data = .x)),
        summaries = map(mod_obj, broom::glance),
        model_coef = map(mod_obj, broom::tidy)) 
  
model_results
#> # A tibble: 3 x 5
#>     cyl data               mod_obj  summaries             model_coef      
#>   <dbl> <list>             <list>   <list>                <list>          
#> 1    6. <tibble [7 x 10]>  <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
#> 2    4. <tibble [11 x 10]> <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
#> 3    8. <tibble [14 x 10]> <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
  
# now we can access both the model summaries AND 
# the model coeffiencts
model_results %>% 
  unnest(summaries, .drop = TRUE)
#> # A tibble: 3 x 12
#>     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <dbl>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl>
#> 1    6.     0.465         0.357  1.17      4.34  0.0918     2  -9.83  25.7
#> 2    4.     0.509         0.454  3.33      9.32  0.0137     2 -27.7   61.5
#> 3    8.     0.423         0.375  2.02      8.80  0.0118     2 -28.7   63.3
#> # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
  

model_results %>% 
  unnest(model_coef, .drop = TRUE)
#> # A tibble: 6 x 6
#>     cyl term        estimate std.error statistic    p.value
#>   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
#> 1    6. (Intercept)    28.4      4.18       6.79 0.00105   
#> 2    6. wt             -2.78     1.33      -2.08 0.0918    
#> 3    4. (Intercept)    39.6      4.35       9.10 0.00000777
#> 4    4. wt             -5.65     1.85      -3.05 0.0137    
#> 5    8. (Intercept)    23.9      3.01       7.94 0.00000405
#> 6    8. wt             -2.19     0.739     -2.97 0.0118

Created on 2018-03-23 by the reprex package (v0.2.0).

While you can certainly do all of this with the split method. The nest method allows for easier organization of more complex operations and pipelines.

As for the appropriateness of the post, I think that this is perfect for this forum. One of the main purposes is for R/tidyverse users to have these exact sorts of discussions!

jennybryan · March 24, 2018, 2:18am

This pull request and the linked blog posts discuss the split vs nest choice:

github.com/tidyverse/tidyr

"chop()" a tidy-style split() function

tidyverse:master ← coolbutuseless:cleave

opened 10:50AM - 05 Mar 18 UTC

coolbutuseless

+189 -0

The `split()` function in base R has a few issues which makes it less than ideal… for interfacing with tidyverse functions, namely * runtime is quadratic in number of splitting variables (see [1]) * runtime is quadratic in number of groups within each variable (see [1]) * the splitting variable gets recycled if it’s not as long as the data.frame being split (see [2]) * NA levels are dropped from the data (see [2]) A prototype tidyverse function called `cleave_by()` was created [3] and it seems to overcome these issues. This PR introduces a cleaned up version of the prototype (now called `chop()`) with an interface similar to `nest()`. Some notes on behaviour of chop(): - I was originally investigating [4] a replacement for `group_by() + do()` and found `split() + map_dfr()` to be a possible solution (except for all the issues with `split()`) - `nest()` doesn't quite answer my needs as the nesting variables are not present in the nested data.frames. - The return value of `chop()` is an unnamed list, whereas `split()` tries to create a name for each list item based upon the level. - Any grouping on the input data.frame is lost. This is what `nest()` does. - All the split data.frames are converted to tibbles. - Explicit column name arguments take precedence over any groups which may be present in the input data.frame. - Like `split()`, chop puts the groups in alphabetical order, as calculated by `group_indices()`. E.g. the following returns a list with data.frames for `x=a`, `x=b` and `x=c`, in that order. ```r data_frame(x=c('b', 'a', 'c'), y=1:3) %>% chop(x) ``` - I'm not attached to the name `chop()`. [1] https://coolbutuseless.bitbucket.io/2018/03/04/base-r-split-has-issues---part-1-runtime/ [2] https://coolbutuseless.bitbucket.io/2018/03/04/base-r-split-has-issues---part-2-idiosyncrasies/ [3] https://coolbutuseless.bitbucket.io/2018/03/04/cleave_by-a-tidyverse-style-split/ [4] https://coolbutuseless.bitbucket.io/2018/03/03/split-apply-combine-my-search-for-a-replacement-for-group_by---do/

mikkeltp · May 14, 2018, 8:18am

Hi @tbradley,

Fully agree that nest() and map() is a powerful combination.
Recently I have got somehow annoyed with it as the it negatively impacts speed of my operations. My next step is to try Parallel processing with Multidplyr package to see if that improves speed significantly. In your experience is nest() and map() the most powerful combination for similar types of analysis as above or does any other packages offer similar approach but with improved performance?