I hate to ask such an extremely simple question but here it is. I've got a time-by-group dataset is "long" format. Within each of the two groups, the N's decline over time due to follow-up attrition. When putting together my descriptive stats I want to express the N at each time (within each group) as a percentage of that group's N at the first time point.
Doing it is bog simple by sticking a filter() function inside a mutate(). I guess I have a habit of staying in my Tidyverse comfort zone even when doing trivial tasks. Can anyone point out an equally simple (or simpler) way of computing a percentage-of-baseline-n quantity in plain old R code?
library(tidyverse)
# Create a fake dataset (3 times by 2 groups) of descriptive stats
df <- data.frame(time=rep(0:2,each=2),
grp=rep(0:1,3),
n=c(200,190,180,175,150,150),
p=c(0.1,0.2,0.15,0.26,0.21,0.32),
age=round(rnorm(6,50,6),1))
# Add a variable reflecting loss to followup at t1, t2 (percentage of n from t0)
newdf <- df %>% mutate(pct=100*n/filter(df,time==0,grp==grp)$n)
newdf
#> time grp n p age pct
#> 1 0 0 200 0.10 45.4 100.00000
#> 2 0 1 190 0.20 49.4 100.00000
#> 3 1 0 180 0.15 39.0 90.00000
#> 4 1 1 175 0.26 63.3 92.10526
#> 5 2 0 150 0.21 46.1 75.00000
#> 6 2 1 150 0.32 47.5 78.94737
Hi Brent, I would like to share my 'oldschool' R base solution with you. It's quite not the simplest way, but I try to made it short... I think the tidyverse comfort zone is not such a bad way
So, back to your example:
set.seed(1993)
df <- data.frame(time=rep(0:2,each=2),
grp=rep(0:1,3),
n=c(200,190,180,175,150,150),
p=c(0.1,0.2,0.15,0.26,0.21,0.32),
age=round(rnorm(6,50,6),1))
#okay, some base R (the good old days...) ;D
newdf <- merge(x=df, y=subset(df, time==0, select=c('grp','n')),
by='grp', all.x=TRUE, suffixes=c('','.base'))
newdf$'pct' <- 100*newdf$'n'/newdf$'n.base'
#result (note: merge(...) will change your original row order)
newdf
#> grp time n p age n.base pct
#> 1 0 0 200 0.10 47.0 200 100.00000
#> 2 0 1 180 0.15 50.3 200 90.00000
#> 3 0 2 150 0.21 56.9 200 75.00000
#> 4 1 0 190 0.20 47.7 190 100.00000
#> 5 1 1 175 0.26 46.1 190 92.10526
#> 6 1 2 150 0.32 51.9 190 78.94737
#final result, reset your row order and get your colnames of interest
newdf[order(newdf$'time', newdf$'grp'),c(colnames(df),'pct')]
#> time grp n p age pct
#> 1 0 0 200 0.10 47.0 100.00000
#> 4 0 1 190 0.20 47.7 100.00000
#> 2 1 0 180 0.15 50.3 90.00000
#> 5 1 1 175 0.26 46.1 92.10526
#> 3 2 0 150 0.21 56.9 75.00000
#> 6 2 1 150 0.32 51.9 78.94737
Thanks to everyone for the many variations on this theme. I'm starting to understand why I'm always defaulting to the Tidyverse ways. With Leon's trick of using "max(n)" to take advantage of my initial N's always being the larges, I can lose the filter() but keep the mutate() and get a pretty sleek line of code.
Note to hughparsonage...I'm almost but not quite out of that stage as an R learner where any mention of an "apply" family function makes me feel a little queasy and vertiginous!