Normalizing with group and overall means using dplyr

nicolle · August 17, 2021, 4:00pm

Hi,
I'm trying to 'normalize' (best word I can come up with) some data using the ratio of the overall mean to the group mean which is then mutliplied by all the values. I'm trying to use dplyr, but having trouble knowing how to use both the group mean (using group_by ) and the overall mean (ungrouped) in the same formula.

For example:
set.seed(5)
df=data.frame(plate=c(rep(1:4,5),rep(5,3)), var1=rnorm(23),var2=rnorm(23),var3=rnorm(23))
df<-df[order(df$plate),]
df

The values in each column (e.g., var1, var2) would be multiplied by the (overall mean / plate mean). So, for var1, the overall mean is

mean(df$var1) #-0.1000353

means by plate

aggregate(var1~plate, df, mean)

plate var1

#1 1 -0.2185788
#2 2 -0.2843883
#3 3 -0.2661941
#4 4 -0.3530699
#5 5 1.1034477

So where plate==1, each var1 value would by multiplied by (-0.1000353 / -0.2185788 ), where plate==2, each value would be multiplied by (-0.1000353 /-0.2843883 ), and so on. This would be true for each variable. In theory I'd like this to work for mutliple variables, mutliple plates, with varying numbers of rows per plate (always more one row).

Thank you for any hints/help you can provide. I've had a difficult time trying to find relevant examples.
Nicolle

Riffomonas · August 17, 2021, 6:29pm

Hi Nicolle,

The basic strategy that I used to solve your problem (I think I did!) was to use group_by and mutate together to create the mean columns. Often, people think they can only use group_by with summarize. Really, group_by can be used with other functions that you want to work within each group.

For the case of looking at var1, here's what I came up with...

df %>%
  mutate(mean_var1 = mean(var1)) %>%
  group_by(plate) %>%
  mutate(mean_var1_plate = mean(var1)) %>%
  ungroup() %>%
  mutate(var1_normalized = var1 * mean_var1/mean_var1_plate) %>%
  select(plate, var1_normalized)

# A tibble: 23 × 2
   plate var1_normalized
   <dbl>           <dbl>
 1     1         -0.385 
 2     1          0.783 
 3     1         -0.131 
 4     1         -0.494 
 5     1         -0.273 
 6     2          0.487 
 7     2         -0.212 
 8     2          0.0486
 9     2         -0.0554
10     2         -0.768 
# … with 13 more rows

Basically, the first mutate creates a column that's has the same value in every row that is the same as your mean(df$var1). Next we group by the plate and create another variable that is the mean of var1 within that plate. This is the same as your aggregate(var1~plate, df, mean). In my code both of these values are stored in mean_var1 and mean_var1_plate. From there you can ungroup the data and use another mutate to calculate the normalized values of var1.

For the bigger question of how to do this for all of your variables, we take the same strategy, except we gather the five variable columns into two columns and add a nesting layer...

df %>%
  mutate(observation = 1:nrow(.)) %>%
  pivot_longer(-c(plate, observation), names_to="var", values_to="value") %>%
  group_by(var) %>%
  mutate(mean_var = mean(value)) %>%
  group_by(plate, var) %>%
  mutate(mean_var_plate = mean(value)) %>%
  group_by(var) %>%
  mutate(normalized = value * mean_var/mean_var_plate) %>%
  ungroup() %>%
  select(plate, observation, var, normalized) %>%
  pivot_wider(id_cols=c(plate, observation), names_from=var, values_from=normalized) %>%
  select(plate, starts_with("var"))

# A tibble: 23 × 4
   plate    var1    var2   var3
   <dbl>   <dbl>   <dbl>  <dbl>
 1     1 -0.385   0.286  -5.87 
 2     1  0.783   0.607   9.68 
 3     1 -0.131   0.450  -0.742
 4     1 -0.494   0.386  -0.424
 5     1 -0.273  -0.0578 -3.83 
 6     2  0.487   0.469  -1.16 
 7     2 -0.212  -0.376   0.474
 8     2  0.0486  1.27   -2.33 
 9     2 -0.0554 -0.578   0.588
10     2 -0.768   0.888   1.25 
# … with 13 more rows

Aside from the pivot_longer the main difference was that I needed to add a column observation to df so that when we do the pivot_wider at the end each row has a unique name. Otherwise, pivot_wider complains. If this still doesn't make sense, I'd encourage you to run these two pipelines line by line and look at how the output changes as you go along.

Let me know if anything is still unclear!
Pat

EconProf · August 17, 2021, 10:25pm

I am still learning mutate(across()) and took this as a challenge. Getting the overall mean and the mean for each plate group was easy, but connecting each of var1 to var3 with the corresponding overall mean and plate mean was tricky.

library(tidyverse)

set.seed(5)
df=data.frame(plate=c(rep(1:4,5),rep(5,3)), var1=rnorm(23),var2=rnorm(23),var3=rnorm(23))
df<-df[order(df$plate),]

df %>% mutate(across(var1:var3, mean, .names = "overall_{.col}")) %>%
  group_by(plate) %>%
  mutate(across(var1:var3, mean, .names = "plate_{.col}")) %>%
  ungroup() %>%
  mutate(across(var1:var3, ~ .x * 
    get(glue::glue("overall_{cur_column()}")) / 
    get(glue::glue("plate_{cur_column()}")),
    .names = "norm_{.col}")) %>%
  select(plate, norm_var1:norm_var3)
#> # A tibble: 23 × 4
#>    plate norm_var1 norm_var2 norm_var3
#>    <dbl>     <dbl>     <dbl>     <dbl>
#>  1     1   -0.385     0.286     -5.87 
#>  2     1    0.783     0.607      9.68 
#>  3     1   -0.131     0.450     -0.742
#>  4     1   -0.494     0.386     -0.424
#>  5     1   -0.273    -0.0578    -3.83 
#>  6     2    0.487     0.469     -1.16 
#>  7     2   -0.212    -0.376      0.474
#>  8     2    0.0486    1.27      -2.33 
#>  9     2   -0.0554   -0.578      0.588
#> 10     2   -0.768     0.888      1.25 
#> # … with 13 more rows

^{Created on 2021-08-17 by the reprex package (v2.0.1)}

system · September 7, 2021, 10:26pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.