I would like to know where there is any built-in function for us to "pack" a sequence of dplyr
operations into one object, such that the operations are reusable.
Suppose this is what I want to do:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
head(cars)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 7 4
#> 4 7 22
#> 5 8 16
#> 6 9 10
# Create a data set with same column names
n <- 100
set.seed(8701)
cars2 <- data.frame(speed = round(runif(n, 4, 50)),
dist = round(runif(n, 15, 100)))
head(cars2)
#> speed dist
#> 1 33 23
#> 2 38 95
#> 3 16 59
#> 4 26 58
#> 5 19 32
#> 6 28 17
new1 <- cars %>% filter(speed > 10) %>%
mutate(dist2 = dist * 2)
new2 <- cars2 %>% filter(speed > 10) %>%
mutate(dist2 = dist * 2)
head(new1)
#> speed dist dist2
#> 1 11 17 34
#> 2 11 28 56
#> 3 12 14 28
#> 4 12 20 40
#> 5 12 24 48
#> 6 12 28 56
head(new2)
#> speed dist dist2
#> 1 33 23 46
#> 2 38 95 190
#> 3 16 59 118
#> 4 26 58 116
#> 5 19 32 64
#> 6 28 17 34
If I want to do filter(speed > 10) %>% mutate(dist2 = dist * 2)
again and again, each time on a different data frame, can I do something like this?
to_do <- some_function(filter(speed > 10) %>%
mutate(dist2 = dist * 2))
new1 <- cars %>% to_do
new2 <- cars2 %%> to_do
You are on the right track.
suppressMessages(library(dplyr))
my_function <- function(x) {
x %>% filter(speed > 10) %>% mutate(dist2 = dist * 2)
}
cars0 <- cars[1:10,]
cars1 <- cars[11:20,]
cars2 <- cars[21:30,]
cars3 <- cars[31:40,]
cars4 <- cars[41:50,]
my_function(cars0)
#> speed dist dist2
#> 1 11 17 34
my_function(cars1)
#> speed dist dist2
#> 1 11 28 56
#> 2 12 14 28
#> 3 12 20 40
#> 4 12 24 48
#> 5 12 28 56
#> 6 13 26 52
#> 7 13 34 68
#> 8 13 34 68
#> 9 13 46 92
#> 10 14 26 52
my_function(cars2)
#> speed dist dist2
#> 1 14 36 72
#> 2 14 60 120
#> 3 14 80 160
#> 4 15 20 40
#> 5 15 26 52
#> 6 15 54 108
#> 7 16 32 64
#> 8 16 40 80
#> 9 17 32 64
#> 10 17 40 80
my_function(cars3)
#> speed dist dist2
#> 1 17 50 100
#> 2 18 42 84
#> 3 18 56 112
#> 4 18 76 152
#> 5 18 84 168
#> 6 19 36 72
#> 7 19 46 92
#> 8 19 68 136
#> 9 20 32 64
#> 10 20 48 96
my_function(cars4)
#> speed dist dist2
#> 1 20 52 104
#> 2 20 56 112
#> 3 20 64 128
#> 4 22 66 132
#> 5 23 54 108
#> 6 24 70 140
#> 7 24 92 184
#> 8 24 93 186
#> 9 24 120 240
#> 10 25 85 170
my_function(cars)
#> speed dist dist2
#> 1 11 17 34
#> 2 11 28 56
#> 3 12 14 28
#> 4 12 20 40
#> 5 12 24 48
#> 6 12 28 56
#> 7 13 26 52
#> 8 13 34 68
#> 9 13 34 68
#> 10 13 46 92
#> 11 14 26 52
#> 12 14 36 72
#> 13 14 60 120
#> 14 14 80 160
#> 15 15 20 40
#> 16 15 26 52
#> 17 15 54 108
#> 18 16 32 64
#> 19 16 40 80
#> 20 17 32 64
#> 21 17 40 80
#> 22 17 50 100
#> 23 18 42 84
#> 24 18 56 112
#> 25 18 76 152
#> 26 18 84 168
#> 27 19 36 72
#> 28 19 46 92
#> 29 19 68 136
#> 30 20 32 64
#> 31 20 48 96
#> 32 20 52 104
#> 33 20 56 112
#> 34 20 64 128
#> 35 22 66 132
#> 36 23 54 108
#> 37 24 70 140
#> 38 24 92 184
#> 39 24 93 186
#> 40 24 120 240
#> 41 25 85 170
Created on 2023-09-17 with reprex v2.0.2
Thanks!
Is there a built-in function that can do this without writing a function ourselves?
This idea came to me when working on ggplot2
. The +
operator is very convenient when working with themes and similar elements (though not for some others). We can "add" several calls together, and add them to different ggplot2 graphs. We can also combine them in anyway we want, to modify the themes.
I am wondering whether we can do something similar in dplyr
, somehow adding operations together.
I quickly drafted a function to illustrate what I want to have. It definitely is not flexible enough and may fail in some cases but I think it is enough to demonstrate what I would love to see in dplyr
... or maybe there is already such a function somewhere?
pack_dplyr <- function(...) {
args <- match.call()
tmpfct <- function(.data) {
k <- length(args)
data_new <- .data
for (x in seq(2, k)) {
callx <- args[[x]]
callx$.data <- data_new
data_new <- eval(callx)
}
data_new
}
tmpfct
}
cars1 <- cars[1:10, ]
# Create a version with different column orders
cars2 <- data.frame(id = round(runif(20, 10, 20)),
dist = cars[11:20, "dist"],
speed = cars[11:20, "speed"])
head(cars1)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 7 4
#> 4 7 22
#> 5 8 16
#> 6 9 10
head(cars2)
#> id dist speed
#> 1 11 28 11
#> 2 17 14 12
#> 3 14 20 12
#> 4 18 24 12
#> 5 16 28 12
#> 6 17 26 13
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tmp1 <- pack_dplyr(filter(speed > 4),
mutate(dist2 = dist * 2),
select(-dist))
new1 <- cars1 %>% tmp1
new2 <- cars2 %>% tmp1
head(new1)
#> speed dist2
#> 1 7 8
#> 2 7 44
#> 3 8 32
#> 4 9 20
#> 5 10 36
#> 6 10 52
head(new2)
#> id speed dist2
#> 1 11 11 56
#> 2 17 12 28
#> 3 14 12 40
#> 4 18 12 48
#> 5 16 12 56
#> 6 17 13 52
# More oprations
new11 <- cars1 %>% tmp1 %>% slice(1:5)
new22 <- cars2 %>% rename(new_id = id) %>% tmp1
head(new11)
#> speed dist2
#> 1 7 8
#> 2 7 44
#> 3 8 32
#> 4 9 20
#> 5 10 36
head(new22)
#> new_id speed dist2
#> 1 11 11 56
#> 2 17 12 28
#> 3 14 12 40
#> 4 18 12 48
#> 5 16 12 56
#> 6 17 13 52
# Two packaged operations
tmpa1 <- pack_dplyr(filter(speed > 4),
mutate(dist2 = dist * 2))
tmpa2 <- pack_dplyr(slice(1:5),
select(-dist))
newa1 <- cars1 %>% tmpa1 %>% tmpa2
head(newa1)
#> speed dist2
#> 1 7 8
#> 2 7 44
#> 3 8 32
#> 4 9 20
#> 5 10 36
The +
operator in ggplot2
works there because the package is somewhat of a hidden domain specific language and it uses the +
operator to add new objects, ggproto
s to a ggplot
object. And, although it can substitute new mapping
s of data
and aes
arguments to override the original, it's not really analogous to applying different datasets to dplyr
verbs.
I am unaware of any tidyverse function to do this, aside from yours.
1 Like
Oh ... just found that we can already do this. The last example of %>%
in its help page at magrittr
shows exactly what I want.
cars1 <- cars[1:10, ]
# Create a version with different column orders
cars2 <- data.frame(id = round(runif(20, 10, 20)),
dist = cars[11:20, "dist"],
speed = cars[11:20, "speed"])
head(cars1)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 7 4
#> 4 7 22
#> 5 8 16
#> 6 9 10
head(cars2)
#> id dist speed
#> 1 15 28 11
#> 2 14 14 12
#> 3 11 20 12
#> 4 19 24 12
#> 5 10 28 12
#> 6 16 26 13
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tmp1_dot <- . %>% filter(speed > 4) %>%
mutate(dist2 = dist * 2) %>%
select(-dist)
new1 <- cars1 %>% tmp1_dot
new2 <- cars2 %>% tmp1_dot
head(new1)
#> speed dist2
#> 1 7 8
#> 2 7 44
#> 3 8 32
#> 4 9 20
#> 5 10 36
#> 6 10 52
head(new2)
#> id speed dist2
#> 1 15 11 56
#> 2 14 12 28
#> 3 11 12 40
#> 4 19 12 48
#> 5 10 12 56
#> 6 16 13 52
2 Likes
Thanks a lot for your explanation! I understand more about how ggplot2
's +
operator works now.
system
Closed
October 9, 2023, 4:10am
8
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.