You don't need rowwise
, since rowMeans
will evaluate rowwise anyway:
library(tidyverse)
set.seed(47)
set_dat = function(n){
sample(c(rnorm(n), rep(NA, n)), 50, replace = TRUE)
}
d = tibble(smpl_id = sample(c('id1','id2','id3'), 50, replace = TRUE),
origin = sample(1:3, 50, replace = TRUE),
a_group = set_dat(50),
b_group = set_dat(50),
c_group = set_dat(50),
d_group = set_dat(50))
d %>% mutate(group_mean = rowMeans(select(., contains('group')), na.rm = TRUE))
#> # A tibble: 50 x 7
#> smpl_id origin a_group b_group c_group d_group group_mean
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 id3 1 0.926 NA -0.143 2.44 1.08
#> 2 id2 3 NA 1.10 NA NA 1.10
#> 3 id3 1 -0.413 -0.371 NA NA -0.392
#> 4 id3 1 NA 0.0349 NA NA 0.0349
#> 5 id2 1 -0.833 NA NA NA -0.833
#> 6 id3 2 NA NA NA 0.271 0.271
#> 7 id2 1 NA -0.419 NA -0.329 -0.374
#> 8 id2 1 NA -0.263 -1.51 -0.164 -0.646
#> 9 id2 2 NA NA NA 0.436 0.436
#> 10 id3 2 NA NA NA 1.41 1.41
#> # ... with 40 more rows
It—like apply
—will also coerce to a matrix, which is ok in the sense that you're unlikely to take the mean of multiple types, but still possibly expensive.
There are ways to avoid coercion, though each has its flaws:
- Reshape to a tidier long format.
d %>%
rowid_to_column('i') %>%
gather(group, value, contains('group')) %>%
group_by(i) %>%
mutate(group_mean = mean(value, na.rm = TRUE)) %>%
spread(group, value)
#> # A tibble: 50 x 8
#> # Groups: i [50]
#> i smpl_id origin group_mean a_group b_group c_group d_group
#> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 id3 1 1.08 0.926 NA -0.143 2.44
#> 2 2 id2 3 1.10 NA 1.10 NA NA
#> 3 3 id3 1 -0.392 -0.413 -0.371 NA NA
#> 4 4 id3 1 0.0349 NA 0.0349 NA NA
#> 5 5 id2 1 -0.833 -0.833 NA NA NA
#> 6 6 id3 2 0.271 NA NA NA 0.271
#> 7 7 id2 1 -0.374 NA -0.419 NA -0.329
#> 8 8 id2 1 -0.646 NA -0.263 -1.51 -0.164
#> 9 9 id2 2 0.436 NA NA NA 0.436
#> 10 10 id3 2 1.41 NA NA NA 1.41
#> # ... with 40 more rows
This is not necessarily any faster, though this data does arguably belong in long form.
- Use
purrr::pmap
to iterate over rows.
d %>% mutate(group_mean = pmap_dbl(select(., contains('group')),
~mean(c(...), na.rm = TRUE)))
#> # A tibble: 50 x 7
#> smpl_id origin a_group b_group c_group d_group group_mean
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 id3 1 0.926 NA -0.143 2.44 1.08
#> 2 id2 3 NA 1.10 NA NA 1.10
#> 3 id3 1 -0.413 -0.371 NA NA -0.392
#> 4 id3 1 NA 0.0349 NA NA 0.0349
#> 5 id2 1 -0.833 NA NA NA -0.833
#> 6 id3 2 NA NA NA 0.271 0.271
#> 7 id2 1 NA -0.419 NA -0.329 -0.374
#> 8 id2 1 NA -0.263 -1.51 -0.164 -0.646
#> 9 id2 2 NA NA NA 0.436 0.436
#> 10 id3 2 NA NA NA 1.41 1.41
#> # ... with 40 more rows
This avoids coercion, but at the cost of vectorization, so this won't scale well.
- Use
purrr::reduce
to add the variables as vectors.
d %>% mutate(group_mean = reduce(select(., contains('group')),
~.x + coalesce(.y, 0),
.init = 0) /
reduce(select(., contains('group')),
~.x + !is.na(.y),
.init = 0))
#> # A tibble: 50 x 7
#> smpl_id origin a_group b_group c_group d_group group_mean
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 id3 1 0.926 NA -0.143 2.44 1.08
#> 2 id2 3 NA 1.10 NA NA 1.10
#> 3 id3 1 -0.413 -0.371 NA NA -0.392
#> 4 id3 1 NA 0.0349 NA NA 0.0349
#> 5 id2 1 -0.833 NA NA NA -0.833
#> 6 id3 2 NA NA NA 0.271 0.271
#> 7 id2 1 NA -0.419 NA -0.329 -0.374
#> 8 id2 1 NA -0.263 -1.51 -0.164 -0.646
#> 9 id2 2 NA NA NA 0.436 0.436
#> 10 id3 2 NA NA NA 1.41 1.41
#> # ... with 40 more rows
This approach should be pretty efficient, as the math is vectorized and it only iterates over columns. It requires more math and programming, though, making it easier to screw up. It could be refactored to only iterate once, though the function would need to be more complicated.