How to do this with base R rather than using dplyr's filter() function?

BrentHutto · January 28, 2019, 12:53pm

I hate to ask such an extremely simple question but here it is. I've got a time-by-group dataset is "long" format. Within each of the two groups, the N's decline over time due to follow-up attrition. When putting together my descriptive stats I want to express the N at each time (within each group) as a percentage of that group's N at the first time point.

Doing it is bog simple by sticking a filter() function inside a mutate(). I guess I have a habit of staying in my Tidyverse comfort zone even when doing trivial tasks. Can anyone point out an equally simple (or simpler) way of computing a percentage-of-baseline-n quantity in plain old R code?

library(tidyverse)

# Create a fake dataset (3 times by 2 groups) of descriptive stats

df <- data.frame(time=rep(0:2,each=2),
                 grp=rep(0:1,3),
                 n=c(200,190,180,175,150,150),
                 p=c(0.1,0.2,0.15,0.26,0.21,0.32),
                 age=round(rnorm(6,50,6),1))

# Add a variable reflecting loss to followup at t1, t2 (percentage of n from t0)

newdf <- df %>% mutate(pct=100*n/filter(df,time==0,grp==grp)$n)

newdf
#>   time grp   n    p  age       pct
#> 1    0   0 200 0.10 45.4 100.00000
#> 2    0   1 190 0.20 49.4 100.00000
#> 3    1   0 180 0.15 39.0  90.00000
#> 4    1   1 175 0.26 63.3  92.10526
#> 5    2   0 150 0.21 46.1  75.00000
#> 6    2   1 150 0.32 47.5  78.94737

^{Created on 2019-01-28 by the reprex package (v0.2.1)}

Leon · January 28, 2019, 3:21pm

Try this @BrentHutto:

df %>% group_by(grp) %>% mutate(pct = n / max(n) * 100) %>% ungroup
# A tibble: 6 x 6
   time   grp     n     p   age   pct
  <int> <int> <dbl> <dbl> <dbl> <dbl>
1     0     0   200  0.1   38.2 100  
2     0     1   190  0.2   48.5 100  
3     1     0   180  0.15  36.2  90  
4     1     1   175  0.26  47.7  92.1
5     2     0   150  0.21  48.9  75  
6     2     1   150  0.32  50.7  78.9

Hope it helps

adam83 · January 28, 2019, 3:23pm

Hi Brent, I would like to share my 'oldschool' R base solution with you. It's quite not the simplest way, but I try to made it short... I think the tidyverse comfort zone is not such a bad way

So, back to your example:

set.seed(1993)
df <- data.frame(time=rep(0:2,each=2),
                 grp=rep(0:1,3),
                 n=c(200,190,180,175,150,150),
                 p=c(0.1,0.2,0.15,0.26,0.21,0.32),
                 age=round(rnorm(6,50,6),1))

#okay, some base R (the good old days...)  ;D
newdf <- merge(x=df, y=subset(df, time==0, select=c('grp','n')), 
               by='grp', all.x=TRUE, suffixes=c('','.base'))
newdf$'pct' <- 100*newdf$'n'/newdf$'n.base'

#result (note: merge(...) will change your original row order)
newdf
#>   grp time   n    p  age n.base       pct
#> 1   0    0 200 0.10 47.0    200 100.00000
#> 2   0    1 180 0.15 50.3    200  90.00000
#> 3   0    2 150 0.21 56.9    200  75.00000
#> 4   1    0 190 0.20 47.7    190 100.00000
#> 5   1    1 175 0.26 46.1    190  92.10526
#> 6   1    2 150 0.32 51.9    190  78.94737

#final result, reset your row order and get your colnames of interest 
newdf[order(newdf$'time', newdf$'grp'),c(colnames(df),'pct')]
#>   time grp   n    p  age       pct
#> 1    0   0 200 0.10 47.0 100.00000
#> 4    0   1 190 0.20 47.7 100.00000
#> 2    1   0 180 0.15 50.3  90.00000
#> 5    1   1 175 0.26 46.1  92.10526
#> 3    2   0 150 0.21 56.9  75.00000
#> 6    2   1 150 0.32 51.9  78.94737

hughparsonage · January 28, 2019, 3:42pm

Use tapply to do a group_by mutate, though you have to unlist to get the coercion right:

df$pct <- unlist(tapply(df$n, INDEX = df$grp, FUN = function(x) 100 * x / x[1L]))

You can make it a bit nicer by using within and not using an anonymous function:

percent_wrt_1st <- function(x) 100 * x / x[1L]
within(df, pct <- unlist(tapply(n, grp, FUN = percent_wrt_1st)))

BrentHutto · January 28, 2019, 9:38pm

Thanks to everyone for the many variations on this theme. I'm starting to understand why I'm always defaulting to the Tidyverse ways. With Leon's trick of using "max(n)" to take advantage of my initial N's always being the larges, I can lose the filter() but keep the mutate() and get a pretty sleek line of code.

Note to hughparsonage...I'm almost but not quite out of that stage as an R learner where any mention of an "apply" family function makes me feel a little queasy and vertiginous!

hughparsonage · January 29, 2019, 4:10am

It doesn't get better.

I use lapply and vapply daily. They're pretty great.

mapply maybe every few months.

Pretty sure this is the first time I've used tapply in six years.

I've never used rapply.

system · February 19, 2019, 4:10am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.