How to best visualize, compare or correlate scores obtained from two methods?

abdulkhayum.mahaboob · January 13, 2024, 9:07pm

Hi,

I have a question, I have obtained scores ranging from (0 to 10) resulting from 2 different methods but they are based ran on the same process. Now, I am interested to correlate these scores or find concordance. I am seeking help to how to best represent these scores or correlate scores. I tried using correlation analysis (code below) using corrplot but this wasn't very helpful because I could not correlate which row (feature) belong to which column (process). I am interested to correlate each feature to each process or column, and find concordance. Probably, I am not sure, a heatmap coupled with boxplot or violin plot would help in this case?

The scores are in the form of 2 dataframes in R as templates given below:

dput(Method_1)
#>           A_Process B_Process C_Process D_Process E_Process F_Process G_Process
#> Feature_1         9         5         0         5         7         0         3
#> Feature_2         2         6         2         4         7         7         4
#> Feature_3         0         0         0         0         0         0         2
#> Feature_4         2         4         1         1         7         0         2
#> Feature_5         0         6         0         0         6         7         2
#>           H_Process I_Process J_Process
#> Feature_1         4         6         2
#> Feature_2         0         0         0
#> Feature_3         5         7         7
#> Feature_4         3         5         6
#> Feature_5         3         5         6

dput(Method_2)
#>           A_Process B_Process C_Process D_Process E_Process F_Process G_Process
#> Feature_1         1         5         2         3         7         2         1
#> Feature_2         9         6         1         3         6         7         0
#> Feature_3         9         7         2         6         6         2         3
#> Feature_4         0         8         6         6         8         7         4
#> Feature_5         8         5         2         2         6         6         3
#>           H_Process I_Process J_Process
#> Feature_1         1         3         8
#> Feature_2         0         5         5
#> Feature_3         4         4         7
#> Feature_4         4         4         7
#> Feature_5         5         4         6

library(corrplot)
combined_df <- cbind(Method_1, Method_2)
correlation_matrix <- cor(combined_df)  # 'use' parameter handles missing values
corrplot(correlation_matrix, method = "circle")

^{Created on 2024-01-13 with reprex v2.0.2}

Best Regards,
Abdul

FJCC · January 13, 2024, 11:16pm

If I understand you correctly, you want to correlate the values obtained for each combination of Feature and Process for the two methods. For example (Feature_1, A_Process,Method_1) = 9 vs (Feature_1, A_Process,Method_2) = 1.
Here is how I would do that. The idea is to reshape the data so there are three column, Feature, Process, and Value, and then line up the values from the two methods. I did not use all of your columns, just A - G, but I think the code conveys the idea.

Meth_1 <- read.csv("~/R/Play/Dummy.csv")
Meth_2 <- read.csv("~/R/Play/Meth_2")
Meth_1
#>     Feature A_Process B_Process C_Process D_Process E_Process F_Process
#> 1 Feature_1         9         5         0         5         7         0
#> 2 Feature_2         2         6         2         4         7         7
#> 3 Feature_3         0         0         0         0         0         0
#> 4 Feature_4         2         4         1         1         7         0
#> 5 Feature_5         0         6         0         0         6         7
#>   G_Process
#> 1         3
#> 2         4
#> 3         2
#> 4         2
#> 5         2
library(tidyr)
library(dplyr)
library(ggplot2)

Meth_1_long <- Meth_1 |> pivot_longer(cols = A_Process:G_Process, 
                                      names_to = "Process", values_to = "Meth1")

Meth_2_long <- Meth_2 |> pivot_longer(cols = A_Process:G_Process, 
                                      names_to = "Process", values_to = "Meth2")

Meth_1_long
#> # A tibble: 35 × 3
#>    Feature   Process   Meth1
#>    <chr>     <chr>     <int>
#>  1 Feature_1 A_Process     9
#>  2 Feature_1 B_Process     5
#>  3 Feature_1 C_Process     0
#>  4 Feature_1 D_Process     5
#>  5 Feature_1 E_Process     7
#>  6 Feature_1 F_Process     0
#>  7 Feature_1 G_Process     3
#>  8 Feature_2 A_Process     2
#>  9 Feature_2 B_Process     6
#> 10 Feature_2 C_Process     2
#> # ℹ 25 more rows

AllData <- inner_join(Meth_1_long, Meth_2_long, by = c("Feature", "Process"))
AllData
#> # A tibble: 35 × 4
#>    Feature   Process   Meth1 Meth2
#>    <chr>     <chr>     <int> <int>
#>  1 Feature_1 A_Process     9     1
#>  2 Feature_1 B_Process     5     5
#>  3 Feature_1 C_Process     0     2
#>  4 Feature_1 D_Process     5     3
#>  5 Feature_1 E_Process     7     7
#>  6 Feature_1 F_Process     0     2
#>  7 Feature_1 G_Process     3     1
#>  8 Feature_2 A_Process     2     9
#>  9 Feature_2 B_Process     6     6
#> 10 Feature_2 C_Process     2     1
#> # ℹ 25 more rows
ggplot(AllData, aes(Meth1, Meth2)) + geom_point() +
  geom_smooth(formula = y~x, method = "lm")

FIT <- lm(Meth2 ~ Meth1, data = AllData)
summary(FIT)
#> 
#> Call:
#> lm(formula = Meth2 ~ Meth1, data = AllData)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.6608 -2.2373  0.2333  1.8921  4.7627 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   4.2373     0.6561   6.459 2.52e-07 ***
#> Meth1         0.1059     0.1630   0.649    0.521    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.706 on 33 degrees of freedom
#> Multiple R-squared:  0.01262,    Adjusted R-squared:  -0.0173 
#> F-statistic: 0.4218 on 1 and 33 DF,  p-value: 0.5205

^{Created on 2024-01-13 with reprex v2.0.2}

abdulkhayum.mahaboob · January 13, 2024, 11:52pm

@FJCC thank you for the inputs, but I was interested to see the individual feature to process level comparisons instead. I tried the below, but, the issue is there "Inf" values in the dataframe, and plot uses this.

log_ratios <- log(Method_1 / Method_2)

# Heatmap visualization
library(ggplot2)
library(reshape2)

# Assuming 'log_ratios' is a dataframe where the rownames are features
log_ratios$Feature <- rownames(log_ratios)
log_ratios_melted <- melt(log_ratios, id.vars = "Feature")

# Now plotting
ggplot(log_ratios_melted, aes(variable, Feature, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  theme_minimal() +
  xlab("Process") +
  ylab("Feature") +
  ggtitle("Heatmap of Concordance")

FJCC · January 14, 2024, 12:30am

Yes, log(Method_1 / Method_2) will return Inf if Method_2 has a value of zero, it will return -Inf if Method_1 is zero, and it will return NaN if both values are zero. I don't know enough about your data to suggest what to do about that.

abdulkhayum.mahaboob · January 14, 2024, 5:37pm

@FJCC OK, thank you for looking into this.

system · February 25, 2024, 5:38pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.