Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental do() replacement)

Cross posted from Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental do() replacement) · Issue #7666 · tidyverse/dplyr, if allowed.

Hi folks,

My question hinges on this sort of situation. Obviously this example is pretty artificial, but its a situation in which you have a function which acts on a dataframe, returns a dataframe, doesn't return grouping variables, but might need access to them.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

fun <- function(x) {
  if (any(x$Species == "setosa")) {
    tail(x, n = 3) |> select(Petal.Length)
  } else {
    head(x, n = 3) |> select(Petal.Length)
  }
}

iris |>
  group_by(Species) |>
  do(fun(.))
#> # A tibble: 9 × 2
#> # Groups:   Species [3]
#>   Species    Petal.Length
#>   <fct>             <dbl>
#> 1 setosa              1.4
#> 2 setosa              1.5
#> 3 setosa              1.4
#> 4 versicolor          4.7
#> 5 versicolor          4.5
#> 6 versicolor          4.9
#> 7 virginica           6  
#> 8 virginica           5.1
#> 9 virginica           5.9

Created on 2025-03-01 with reprex v2.1.1

do() can deal with this admirably, but I'm unsure what the modern equivalent is.

purrr::map() doesn't behave the same because it drops the group variables, so you don't know what is what:

> purrr::map(
+   split(iris, ~Species),
+   fun
+ ) |>
+   dplyr::bind_rows()
  Petal.Length
1          1.4
2          1.5
3          1.4
4          4.7
5          4.5
6          4.9
7          6.0
8          5.1
9          5.9

nest() also doesn't really work because the nested dataframe has no access to the Species column:

> iris |>
+   nest_by(Species) |>
+   mutate(data = list(fun(data)))
# A tibble: 3 × 2
# Rowwise:  Species
  Species    data            
  <fct>      <list>          
1 setosa     <tibble [3 × 1]>
2 versicolor <tibble [3 × 1]>
3 virginica  <tibble [3 × 1]>
Warning message:
There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `data = list(fun(data))`.
ℹ In row 1.
Caused by warning:
! Unknown or uninitialised column: `Species`.
ℹ Run dplyr::last_dplyr_warnings() to see the 2 remaining warnings. 

group_modify() is recommended by the do() documentation, but it is experimental, so I don't want to put it in packages until its stable. It also doesn't use .by, which may imply this isn't a function that's going to be taken forward? It seems to be the closest thing, however.

reframe() I believe can only act on columns, so I don't think that's quite right either? It is also "experimental".

The answer may lie in pick(), but I'm not quite sure how to apply it to this specific use case.

So I'm at a bit of a loss as to what the new do() actually is!

Cheers.

Don't know if that's the "recommended way", but here's a solution:

suppressMessages(
  library(dplyr)
)

foo <- function() {
  data <- pick(Petal.Length)
  grp <- cur_group()
  .f <- if (grp == "setosa") tail else head
  .f(data, n = 3)
}

iris |> 
  reframe(foo(), .by = Species)
#>      Species Petal.Length
#> 1     setosa          1.4
#> 2     setosa          1.5
#> 3     setosa          1.4
#> 4 versicolor          4.7
#> 5 versicolor          4.5
#> 6 versicolor          4.9
#> 7  virginica          6.0
#> 8  virginica          5.1
#> 9  virginica          5.9

Created on 2025-03-01 with reprex v2.1.1

If you want to use nest_by(), you'll have to unnest or unpack the data first:

suppressMessages(
  library(dplyr)
)

foo <- function() {
  data <- tidyr::unnest(pick(data), cols = data)
  data <- select(data, Petal.Length)
  grp <- cur_group()
  .f <- if (grp == "setosa") tail else head
  .f(data, n = 3)
}

iris |> 
  nest_by(Species) |>
  reframe(foo())
#> # A tibble: 9 × 2
#>   Species    Petal.Length
#>   <fct>             <dbl>
#> 1 setosa              1.4
#> 2 setosa              1.5
#> 3 setosa              1.4
#> 4 versicolor          4.7
#> 5 versicolor          4.5
#> 6 versicolor          4.9
#> 7 virginica           6  
#> 8 virginica           5.1
#> 9 virginica           5.9

Created on 2025-03-01 with reprex v2.1.1