Hi Nicolle,
The basic strategy that I used to solve your problem (I think I did!) was to use group_by
and mutate
together to create the mean columns. Often, people think they can only use group_by
with summarize
. Really, group_by
can be used with other functions that you want to work within each group.
For the case of looking at var1
, here's what I came up with...
df %>%
mutate(mean_var1 = mean(var1)) %>%
group_by(plate) %>%
mutate(mean_var1_plate = mean(var1)) %>%
ungroup() %>%
mutate(var1_normalized = var1 * mean_var1/mean_var1_plate) %>%
select(plate, var1_normalized)
# A tibble: 23 × 2
plate var1_normalized
<dbl> <dbl>
1 1 -0.385
2 1 0.783
3 1 -0.131
4 1 -0.494
5 1 -0.273
6 2 0.487
7 2 -0.212
8 2 0.0486
9 2 -0.0554
10 2 -0.768
# … with 13 more rows
Basically, the first mutate creates a column that's has the same value in every row that is the same as your mean(df$var1)
. Next we group by the plate and create another variable that is the mean of var1
within that plate. This is the same as your aggregate(var1~plate, df, mean)
. In my code both of these values are stored in mean_var1
and mean_var1_plate
. From there you can ungroup
the data and use another mutate to calculate the normalized values of var1
.
For the bigger question of how to do this for all of your variables, we take the same strategy, except we gather the five variable columns into two columns and add a nesting layer...
df %>%
mutate(observation = 1:nrow(.)) %>%
pivot_longer(-c(plate, observation), names_to="var", values_to="value") %>%
group_by(var) %>%
mutate(mean_var = mean(value)) %>%
group_by(plate, var) %>%
mutate(mean_var_plate = mean(value)) %>%
group_by(var) %>%
mutate(normalized = value * mean_var/mean_var_plate) %>%
ungroup() %>%
select(plate, observation, var, normalized) %>%
pivot_wider(id_cols=c(plate, observation), names_from=var, values_from=normalized) %>%
select(plate, starts_with("var"))
# A tibble: 23 × 4
plate var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 -0.385 0.286 -5.87
2 1 0.783 0.607 9.68
3 1 -0.131 0.450 -0.742
4 1 -0.494 0.386 -0.424
5 1 -0.273 -0.0578 -3.83
6 2 0.487 0.469 -1.16
7 2 -0.212 -0.376 0.474
8 2 0.0486 1.27 -2.33
9 2 -0.0554 -0.578 0.588
10 2 -0.768 0.888 1.25
# … with 13 more rows
Aside from the pivot_longer
the main difference was that I needed to add a column observation
to df
so that when we do the pivot_wider
at the end each row has a unique name. Otherwise, pivot_wider
complains. If this still doesn't make sense, I'd encourage you to run these two pipelines line by line and look at how the output changes as you go along.
Let me know if anything is still unclear!
Pat