Cross posted from Recommended way to split data by a variable, apply a function, and return a bound dataframe (i.e., a non-experimental do()
replacement) · Issue #7666 · tidyverse/dplyr, if allowed.
Hi folks,
My question hinges on this sort of situation. Obviously this example is pretty artificial, but its a situation in which you have a function which acts on a dataframe, returns a dataframe, doesn't return grouping variables, but might need access to them.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
fun <- function(x) {
if (any(x$Species == "setosa")) {
tail(x, n = 3) |> select(Petal.Length)
} else {
head(x, n = 3) |> select(Petal.Length)
}
}
iris |>
group_by(Species) |>
do(fun(.))
#> # A tibble: 9 × 2
#> # Groups: Species [3]
#> Species Petal.Length
#> <fct> <dbl>
#> 1 setosa 1.4
#> 2 setosa 1.5
#> 3 setosa 1.4
#> 4 versicolor 4.7
#> 5 versicolor 4.5
#> 6 versicolor 4.9
#> 7 virginica 6
#> 8 virginica 5.1
#> 9 virginica 5.9
Created on 2025-03-01 with reprex v2.1.1
do()
can deal with this admirably, but I'm unsure what the modern equivalent is.
purrr::map()
doesn't behave the same because it drops the group variables, so you don't know what is what:
> purrr::map(
+ split(iris, ~Species),
+ fun
+ ) |>
+ dplyr::bind_rows()
Petal.Length
1 1.4
2 1.5
3 1.4
4 4.7
5 4.5
6 4.9
7 6.0
8 5.1
9 5.9
nest()
also doesn't really work because the nested dataframe has no access to the Species
column:
> iris |>
+ nest_by(Species) |>
+ mutate(data = list(fun(data)))
# A tibble: 3 × 2
# Rowwise: Species
Species data
<fct> <list>
1 setosa <tibble [3 × 1]>
2 versicolor <tibble [3 × 1]>
3 virginica <tibble [3 × 1]>
Warning message:
There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `data = list(fun(data))`.
ℹ In row 1.
Caused by warning:
! Unknown or uninitialised column: `Species`.
ℹ Run dplyr::last_dplyr_warnings() to see the 2 remaining warnings.
group_modify()
is recommended by the do()
documentation, but it is experimental, so I don't want to put it in packages until its stable. It also doesn't use .by
, which may imply this isn't a function that's going to be taken forward? It seems to be the closest thing, however.
reframe()
I believe can only act on columns, so I don't think that's quite right either? It is also "experimental".
The answer may lie in pick()
, but I'm not quite sure how to apply it to this specific use case.
So I'm at a bit of a loss as to what the new do()
actually is!
Cheers.