How to create a for loop to perform correlation analysis in R?

Hi,

I am interested in performing correlation analysis in R by considering all variables in the column Cell.type, currently, I have just used only one variable i.e., Whole Blood under this (example dataset and R code given below). Is it possible to perform for loop to consider all variables, and create a correlation matrix as the final expected output? Note: We need to calculate row means for specific cell type for instance in the example below for whole blood, similarly, we need to calculate row means for Neutrophils and CD8.

dput(M8.3.sample.anno)
#>    Samples   Cell.type        Subjects M8.3_EPSTI1 M8.3_HERC5 M8.3_HES4
#> 1   lib224 Whole Blood Type 1 Diabetes    5.058453   4.887020 1.7671376
#> 2   lib225 Whole Blood Type 1 Diabetes    4.450353   4.718768 1.2454535
#> 3   lib259 Whole Blood          Sepsis    5.135682   3.956515 1.3199113
#> 4   lib265 Whole Blood          Sepsis    1.880949   2.522847 0.1416930
#> 5   lib272 Whole Blood          Sepsis    3.169424   1.957587 1.5035259
#> 6   lib308 Whole Blood             ALS    5.247415   4.754137 2.2679888
#> 7   lib322 Whole Blood             ALS    5.263406   4.661308 2.4815427
#> 8   lib328 Whole Blood Type 1 Diabetes    5.274372   5.289623 2.1011256
#> 9   lib335 Whole Blood Type 1 Diabetes    4.778972   4.913245 2.5630369
#> 10  lib355 Whole Blood             ALS    4.768332   4.582032 1.5406692
#> 11  lib242 Neutrophils Type 1 Diabetes    5.253648   7.039871 1.3155759
#> 12  lib248 Neutrophils Type 1 Diabetes    3.877802   6.501353 2.2587263
#> 13  lib253 Neutrophils          Sepsis    4.645075   4.384019 0.2715198
#> 14  lib260 Neutrophils          Sepsis    2.322131   2.971893 0.0000000
#> 15  lib266 Neutrophils          Sepsis    1.183076   2.266580 0.0000000
#> 16  lib302 Neutrophils             ALS    4.421727   5.900319 2.3356959
#> 17  lib316 Neutrophils             ALS    5.257992   6.495806 3.4147835
#> 18  lib323 Neutrophils Type 1 Diabetes    6.116180   7.677564 2.6296078
#> 19  lib329 Neutrophils Type 1 Diabetes    4.955377   6.370284 2.0181739
#> 20  lib349 Neutrophils             ALS    5.275520   5.591712 0.9455944
#> 21  lib246         CD8 Type 1 Diabetes    3.889786   3.608076 0.9921601
#> 22  lib252         CD8 Type 1 Diabetes    3.534158   3.501803 0.9057364
#> 23  lib257         CD8          Sepsis    4.198376   3.794708 1.0898153
#> 24  lib264         CD8          Sepsis    3.187057   3.552989 0.2858425
#> 25  lib270         CD8          Sepsis    3.849569   3.864689 0.2660382
#> 26  lib306         CD8             ALS    3.854546   3.361594 1.7794049
#> 27  lib320         CD8             ALS    4.451968   3.259045 1.1280926
#> 28  lib327         CD8 Type 1 Diabetes    4.306357   3.709678 0.8845468
#> 29  lib333         CD8 Type 1 Diabetes    3.527393   2.824222 0.7876389
#> 30  lib353         CD8             ALS    3.995013   3.441016 0.9899005


## Filter by one cell type
M8.3.sample.anno.WB <- dplyr::filter(M8.3.sample.anno, Cell.type == "Whole Blood")
# M8.3.sample.anno.Neu <- dplyr::filter(M8.3.sample.anno, Cell.type == "Neutrophils")
# M8.3.sample.anno.CD8 <- dplyr::filter(M8.3.sample.anno, Cell.type == "CD8")

library(dplyr)
# Calculate row means for selected columns, specific to cell type and add as a new column named 'M8.3_Avg'
M8.3.sample.anno.WB <- M8.3.sample.anno.WB %>%
  mutate(M8.3_Avg = rowMeans(select(., starts_with("M8.3_")), na.rm = TRUE))
# M8.3.sample.anno.Neu <- M8.3.sample.anno.Neu %>%
 # mutate(M8.3_Avg = rowMeans(select(., starts_with("M8.3_")), na.rm = TRUE))
# M8.3.sample.anno.CD8 <- M8.3.sample.anno.CD8 %>%
 # mutate(M8.3_Avg = rowMeans(select(., starts_with("M8.3_")), na.rm = TRUE))

## relocate average column
M8.3.sample.anno.WB <- relocate(M8.3.sample.anno.WB, M8.3_Avg)

# List of gene columns, replace with actual gene column names if they are different
gene_columns.M8.3 <- colnames(M8.3.sample.anno.WB)[grepl("^M8.3_", colnames(M8.3.sample.anno.WB))]

# Initialize a vector to store correlation coefficients
correlations <- c()

# Calculate correlation for each gene
for(gene in gene_columns.M8.3) {
  # Skip if it's the "M8.3_Avg" column
  if(gene != "M8.3_Avg") {
    # Calculate Pearson correlation
    correlation <- cor(M8.3.sample.anno.WB$M8.3_Avg, M8.3.sample.anno.WB[[gene]], use = "everything",  method = "spearman")
    # Add to the correlations vector, named by the gene
    correlations[gene] <- correlation
  }
}

# Create a dataframe to store the results
M8.3_cor <- data.frame(`Whole Blood` = correlations)

dput(M8.3_cor)
#>             Whole.Blood
#> M8.3_EPSTI1   0.8666667
#> M8.3_HERC5    0.8060606
#> M8.3_HES4     0.8424242

Expected output:

Whole.Blood Neutrophils      CD8
M8.3_EPSTI1    0.8666667   0.7575758 0.793939
M8.3_HERC5     0.8060606   0.9030303 0.212121
M8.3_HES4      0.8424242   0.8753840 0.745455

Thank you,
Toufiq

Will this work for despite having only on for loop?

M8.3.sample.anno <- read.csv("~/R/Play/Dummy.csv")

library(dplyr)

geneNames <- gene_columns.M8.3 <- colnames(M8.3.sample.anno)[grepl("^M8.3_", colnames(M8.3.sample.anno))]

# Calculate row means for selected columns, specific to cell type and add as a new column named 'M8.3_Avg'
M8.3.sample.anno <- M8.3.sample.anno %>%
  mutate(M8.3_Avg = rowMeans(select(., starts_with("M8.3_")), na.rm = TRUE))

TYPES <- unique(M8.3.sample.anno$Cell.type)
# Initialize a matrix to store correlation coefficients
correlations <- matrix(NA, nrow = 3,ncol = 3)
for(i in seq_along(TYPES)) {
  tmp <- M8.3.sample.anno |> filter(Cell.type == TYPES[i]) |> 
    select(starts_with("M8.3"))
  CorVals <- tmp |> cor(method = "spearman")
  correlations[, i] <- CorVals[1:3, 4]
}
dimnames(correlations) = list(geneNames, TYPES)
correlations
#>             Whole Blood Neutrophils       CD8
#> M8.3_EPSTI1   0.8666667   0.7575758 0.7939394
#> M8.3_HERC5    0.8060606   0.9030303 0.2121212
#> M8.3_HES4     0.8424242   0.8753840 0.7454545

Created on 2024-04-04 with reprex v2.0.2

1 Like

I haven't done this in a while, but I'm pretty sure there's a base R function that allows an all-variable comparison, and these days, there's likely also a package out there that is ggplot-based.

Here we see that you in principle have a plan for how you might manually iterate over what you want to do; but you are doing it in an adhoc way; In general you should look to write a function to do the thing that needs repeating with the ease of a function call. and then you pass parameters to do the variation needed; then iterating become a simple task of passing a series of parameters.

@FJCC thank you very much for the solution.

@dromano and @nirgrahamuk thank you for the advise and suggestions. I was able to follow @FJCC solution, and my issue was solved.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.