How to print for loop results into a data frame

Hi,

I am trying to automate my analysis, however I am lacking the last part. I have made some illustration data, as I can not share my own data.

I have several factors of 3 levels, and several variables of different metabolites. Example data:

data_df <- data.frame(sleepFactor = c("A", "B", "C","A", "B", "C"),
                      sleepFactor2 = c("A", "C", "A","B", "C", "B"),
                      Metabolite1 = c(1,2,3,4,6,4),
                      Metabolite2 = c(1,3,5,4,2,4),
                      Metabolite3 = c(1,4,5,4,6,4),
                      Metabolite4 = c(1,5,6,4,4,4),
                      Metabolite5 = c(1,2,3,4,4,4))

I want to make an anova of each metabolite for each factor. For that I make a function and then a for loop:

my_function <- function(metabolite, sleep_factor, df){
  print("################################################################")
  print(paste("This is metabolite:",colnames(df[i])))
  
  model <-aov(metabolite ~ sleep_factor, df)
  print(paste("Summary of one way ANOVA"))
  print(summary(model))
  
  if (summary(model)[[1]][1,5] <= 0.05){
   print(paste("Pairwise comparisons with Tukey test:"))
   print(TukeyHSD(model, conf.level = 0.95)) 
  }
}

for(i in 3:ncol(data_df)){
  my_function(data_df[ , i],
              data_df$sleepFactor,
              data_df)
}

This does indeed work. However, now i wish to collect all the results in a data frame. Where the columns are the mean of each group and p value, and the rows represent each metabolite. Like this format:

Res.table <- data.frame(Group1_mean = NA,
                        Group2_mean = NA,
                        Group3_mean = NA,
                        PValue = NA)

I am not sure how to incorporate the results into such a table. I am thinking that I should add it to the for loop, where adding an inner loop. However, I am unsure of how to.

Do you have any tip/idea of how to do this?

Thank you!

There are various ways to extract parameters from models, for example:

data_df <- data.frame(sleepFactor = c("A", "B", "C","A", "B", "C"),
                      sleepFactor2 = c("A", "C", "A","B", "C", "B"),
                      Metabolite1 = c(1,2,3,4,6,4),
                      Metabolite2 = c(1,3,5,4,2,4),
                      Metabolite3 = c(1,4,5,4,6,4),
                      Metabolite4 = c(1,5,6,4,4,4),
                      Metabolite5 = c(1,2,3,4,4,4))


model <-aov(Metabolite1 ~ sleepFactor, data_df)
hsd <- TukeyHSD(model, conf.level = 0.95)

coef(model)
#>  (Intercept) sleepFactorB sleepFactorC 
#>          2.5          1.5          1.0
hsd$sleepFactor
#>     diff       lwr       upr     p adj
#> B-A  1.5 -7.198789 10.198789 0.7697962
#> C-A  1.0 -7.698789  9.698789 0.8851187
#> C-B -0.5 -9.198789  8.198789 0.9690213

But the easiest way is probably with the {broom} package that already puts everything in a nice dataframe:

broom::tidy(model)
#> # A tibble: 2 × 6
#>   term           df sumsq meansq statistic p.value
#>   <chr>       <dbl> <dbl>  <dbl>     <dbl>   <dbl>
#> 1 sleepFactor     2  2.33   1.17     0.269   0.781
#> 2 Residuals       3 13      4.33    NA      NA
broom::tidy(hsd)
#> # A tibble: 3 × 7
#>   term        contrast null.value estimate conf.low conf.high adj.p.value
#>   <chr>       <chr>         <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
#> 1 sleepFactor B-A               0     1.50    -7.20     10.2        0.770
#> 2 sleepFactor C-A               0     1.00    -7.70      9.70       0.885
#> 3 sleepFactor C-B               0    -0.5     -9.20      8.20       0.969

Created on 2023-08-11 with reprex v2.0.2

Thank you for the reply.
I did not know about broom.

I want to have all results printed in a shared table. For now, I do print out each model using print(), however, I would like to eg. take the anova p value for each model and paste it into a shared table. I am still thinking I need to do it in the loop however, I am not sure.

Well, the exact computations depend a lot on what you need in the result. Taking your initial post, I assume you want the group means and the ANOVA p-value (not the post-hoc hsd).

In base R, you can get the group means for one comparison with:

aggregate(Metabolite1 ~ sleepFactor, data = data_df, FUN = mean)
#>   sleepFactor Metabolite1
#> 1           A         2.5
#> 2           B         4.0
#> 3           C         3.5

And you can extract the means by taking the second column:

aggregate(Metabolite1 ~ sleepFactor, data = data_df, FUN = mean)[,2]
#> [1] 2.5 4.0 3.5

So you could create such a loop:

# initialize the results data.frame
nb_sleepFactor <- 2
nb_tests <- (ncol(data_df) - nb_sleepFactor)
Res.table <- data.frame(Group1_mean = double(nb_tests),
                        Group2_mean = double(nb_tests),
                        Group3_mean = double(nb_tests),
                        PValue = double(nb_tests))

# run the loop
for(i in 1:nrow(Res.table)){
  formula <- as.formula(paste0(colnames(data_df)[nb_sleepFactor+i], "~ sleepFactor"))
  message("i=",i,", testing column ", colnames(data_df)[nb_sleepFactor+i])
  Res.table[i,1:3] <- aggregate(formula,
                               data = data_df,
                               FUN = mean)[,2]
  model <- aov(formula, data = data_df)
  Res.table[i,4] <- summary(model)[[1]]$`Pr(>F)`[1]
}

# visualize the results
Res.table
#>   Group1_mean Group2_mean Group3_mean    PValue
#> 1         2.5         4.0         3.5 0.7806568
#> 2         2.5         2.5         4.5 0.3617436
#> 3         2.5         5.0         4.5 0.3535534
#> 4         2.5         4.5         5.0 0.3535534
#> 5         2.5         3.0         3.5 0.8184876

Technically this approach is not "the best", but it is quite extensible: if you can figure out how to compute a value, you can just add it to your results data.frame.

Another way to aproach the problem is using the tidyverse and a more functional approach.

library(tidyverse)

Let's start by making the metabolite a single column.

data_df |>
  pivot_longer(cols = -starts_with("sleepFactor"),
               names_to = "metabolite",
               values_to = "val_metabolite")
#> # A tibble: 30 × 4
#>    sleepFactor sleepFactor2 metabolite  val_metabolite
#>    <chr>       <chr>        <chr>                <dbl>
#>  1 A           A            Metabolite1              1
#>  2 A           A            Metabolite2              1
#>  3 A           A            Metabolite3              1
#>  4 A           A            Metabolite4              1
#>  5 A           A            Metabolite5              1
#>  6 B           C            Metabolite1              2
#>  7 B           C            Metabolite2              3
#>  8 B           C            Metabolite3              4
#>  9 B           C            Metabolite4              5
#> 10 B           C            Metabolite5              2
#> # ℹ 20 more rows
#> # ℹ Use `print(n = ...)` to see more rows

We could also do that with the sleepFactors if it makes sense for subsequent analysis:

data_df |>
  pivot_longer(cols = -starts_with("sleepFactor"),
               names_to = "metabolite",
               values_to = "val_metabolite") |>
  pivot_longer(cols = starts_with("sleepFactor"),
               names_to = "sleep_factor_type",
               values_to = "val_sleep_factor")

At this point, we can already group by both sleepFactor and metabolite, to get the mean of each metabolite in each sleepFactor group:

data_df |>
  pivot_longer(cols = -starts_with("sleepFactor"),
               names_to = "metabolite",
               values_to = "val_metabolite") |>
  group_by(sleepFactor, metabolite) |>
  summarize(mean = mean(val_metabolite),
            .groups = "drop")
#> # A tibble: 15 × 3
#>    sleepFactor metabolite   mean
#>    <chr>       <chr>       <dbl>
#>  1 A           Metabolite1   2.5
#>  2 A           Metabolite2   2.5
#>  3 A           Metabolite3   2.5
#>  4 A           Metabolite4   2.5
#>  5 A           Metabolite5   2.5
#>  6 B           Metabolite1   4  
#>  7 B           Metabolite2   2.5
#>  8 B           Metabolite3   5  
#>  9 B           Metabolite4   4.5
#> 10 B           Metabolite5   3  
#> 11 C           Metabolite1   3.5
#> 12 C           Metabolite2   4.5
#> 13 C           Metabolite3   4.5
#> 14 C           Metabolite4   5  
#> 15 C           Metabolite5   3.5

This has advantages, you can use split() to make a list of data.frames, then use map() to call a function on each data.frame. And as a bonus, if your function returns a data.frame (so you have a new list of data.frames), you can assemble them in a single data.frame with dplyr::bind_rows().

data_long <- data_df |>
  pivot_longer(cols = -starts_with("sleepFactor"),
               names_to = "metabolite",
               values_to = "val_metabolite")

single_model <- function(df){
  model <- aov(val_metabolite ~ sleepFactor, data = df)
  broom::tidy(model)
}

split(data_long, f = data_long$metabolite) |>
  map(single_model) |>
  bind_rows()
#> # A tibble: 10 × 6
#>    term           df sumsq meansq statistic p.value
#>    <chr>       <dbl> <dbl>  <dbl>     <dbl>   <dbl>
#>  1 sleepFactor     2  2.33  1.17      0.269   0.781
#>  2 Residuals       3 13     4.33     NA      NA    
#>  3 sleepFactor     2  5.33  2.67      1.45    0.362
#>  4 Residuals       3  5.50  1.83     NA      NA    
#>  5 sleepFactor     2  7.00  3.50      1.50    0.354
#>  6 Residuals       3  7.00  2.33     NA      NA    
#>  7 sleepFactor     2  7     3.5       1.5     0.354
#>  8 Residuals       3  7     2.33     NA      NA    
#>  9 sleepFactor     2  1.00  0.500     0.214   0.818
#> 10 Residuals       3  7.00  2.33     NA      NA

Created on 2023-08-15 with reprex v2.0.2

This approach may seem confusing at first, and it's totally fine if you prefer to stick with the for loop, but it's also very powerful: if you do a lot of statistics like this it will simplify your life.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.