How to weight principal componets by variance?

nikos_geo · June 27, 2024, 4:43pm

I'm following the paper New ECOSTRESS and MODIS Land Surface Temperature Data Reveal Fine-Scale Heat Vulnerability in Cities: A Case Study for Los Angeles County, California, and I quote:

In accordance with Kaisers rule, only those PCs that had an eigenvalue greater than one were retained for analysis. The PCs that had eigenvalues greater than one were then rotated using a varimax rotation to improve their interpretation and maximize the dispersion of loadings across PCs. These rotated PC scores were weighted by variance and then used to reconstruct the original observations.

I run PCA on 4 variables using the prcomp library. All variables were normalized to have a mean of zero and a standard deviation of one (z-score) before the PCA.

prc <- prcomp(adpt.pca)

The results are below:

Standard deviations (1, .., p=4):
[1] 1.4053803 1.0682221 0.8730050 0.3488124

Rotation (n x k) = (4 x 4):
                 PC1       PC2         PC3         PC4
income    -0.6183326 0.3817256 -0.18174425 -0.66250989
ndvi       0.1800165 0.7300110  0.65526401  0.07284918
cs_dist    0.3476116 0.5522182 -0.73234784  0.19464803
education -0.6814873 0.1281581 -0.03556312  0.71964281

Then I selected the PCs with eigenvalues > 1 and I performed a varimax rotation, like so:

varimax2 <- varimax(prc$rotation[, 1:2])

The results of the varimax rotation:

$loadings

Loadings:
          PC1    PC2   
income    -0.716  0.122
ndvi      -0.107  0.744
cs_dist    0.115  0.642
education -0.680 -0.137

                PC1  PC2
SS loadings    1.00 1.00
Proportion Var 0.25 0.25
Cumulative Var 0.25 0.50

$rotmat
           [,1]      [,2]
[1,]  0.9269670 0.3751429
[2,] -0.3751429 0.9269670

How can I weight these rotated PC scores by variance in order to reconstruct the original observations?

The complete code:

library(data.table)
library(dplyr)

wd <- "path/"

mydt <- read.table(paste0(wd, "mydt.csv"), sep = ",", header = TRUE)

# glimpse(mydt)

# Z-score normalize
adpt.pca <- as.data.frame(scale(mydt), center = TRUE, scale = TRUE)

prc <- prcomp(adpt.pca)

varimax2 <- varimax(prc$rotation[, 1:2])

Below a sample of the original dataset (before the Z-score standarization) with 20 rows:

dput(mydt)
structure(list(income = c(0.0001063, 0.000106, 6.72e-05, 7.97e-05, 
0.0001197, 4.09e-05, 5.17e-05, 0.0001092, 8.62e-05, 7.27e-05, 
0.0001034, 0.0001159, 7.24e-05, 9.17e-05, 8.06e-05, 0.0001049, 
8.15e-05, 9.05e-05, 0.0001063, 5.99e-05), ndvi = c(0.434779405593872, 
0.519024193286896, 0.484442293643951, 0.358367592096329, 0.613705396652222, 
0.508738815784454, 0.705485105514526, 0.454894632101059, 0.396738857030869, 
0.408085465431213, 0.425091296434402, 0.360570818185806, 0.455742985010147, 
0.44114676117897, 0.498669385910034, 0.404618799686432, 0.51068776845932, 
0.295410215854645, 0.606453955173492, 0.46584877371788), cs_dist = c(1515.64929199219, 
3037.51879882812, 2663.20043945312, 1761.39184570312, 344.697448730469, 
252.047805786133, 5528.3486328125, 2387.2802734375, 2771.0546875, 
877.851745605469, 1342.23034667969, 3318.9130859375, 1075.06188964844, 
5190.70166015625, 739.960021972656, 4005.1572265625, 684.494079589844, 
426.935241699219, 1222.70263671875, 2597.5166015625), education = c(0.0001015, 
9.71e-05, 6.14e-05, 8.47e-05, 9.97e-05, 5.29e-05, 4.74e-05, 0.0001464, 
0.0001042, 7.53e-05, 0.0001143, 9.4e-05, 6.57e-05, 5.52e-05, 
7.98e-05, 9.5e-05, 6.98e-05, 9.64e-05, 0.0001063, 6.43e-05)), row.names = c(NA, 
-20L), class = c("data.table", "data.frame"), na.action = structure(c(`2402` = 2174L, 
`2404` = 2176L), class = "omit"), .internal.selfref = <pointer: 0x0000023dc7801200>)

Note: The results I showed were the outcome when using my complete dataset and not the 20 rows I shared. If needed I can share the entire dataset.

R 4.4.0, RStudio 2024.04.2 Build 764, Windows 11.

nikos_geo · June 30, 2024, 3:37pm

I'm not sure yet if this approach is the correct one, so if anyone has any insights it would be nice to comment.

# z-score normalization
mydt_normalized <- as.data.frame(scale(mydt))

# PCA
pca_result <- PCA(mydt_normalized, graph = FALSE)

# extract PCs with eigenvalues > 1, this step is to identify the desired # of PCs
eigenvalues <- pca_result$eig[, "eigenvalue"]
n_components <- sum(eigenvalues > 1)

# PCA with the selected number of components 
pca_selected <- PCA(mydt_normalized, ncp = n_components, graph = FALSE)

# varimax rotation
rotated_loadings <- varimax(pca_selected$var$coord)

# calculation of rotated PC scores
rotated_scores <- as.matrix(mydt_normalized) %*% rotated_loadings$loadings

# weight the rotated PC scores by variance
variance_weights <- eigenvalues[1:n_components] / sum(eigenvalues[1:n_components])
weighted_scores <- sweep(rotated_scores, 2, variance_weights, "*")

# reconstruction the original observations
reconstructed_data <- weighted_scores %*% t(rotated_loadings$loadings)

# Calculate the mean and standard deviation of the original data
original_means <- colMeans(mydt)
original_sds <- apply(mydt, 2, sd)

# Convert the reconstructed data back to original scale
reconstructed_original <- sweep(reconstructed_data, 2, original_sds, "*")
reconstructed_original <- sweep(reconstructed_original, 2, original_means, "+")

# option to display full numbers
options(scipen = 999)

# comparison of the original vs reconstructed obs
# the original observations
mydt[1, ]
      income      ndvi  cs_dist education
       <num>     <num>    <num>     <num>
1: 0.0001063 0.4347794 1515.649 0.0001015

# the reconstructed observations
income            ndvi          cs_dist          education
  <num>           <num>         <num>            <num>
1: 0.0001071987   0.4290230290  1864.3736031922  0.0001081663

dromano · June 30, 2024, 7:36pm

I think the term "loadings" may be ambiguous, and could represent either unit-length eigenvectors or these scaled by the variances, which I think are the square roots of the eigenvalues. So the "loadings" (and so scores) may already be "weighted by variance" in a sense; dividing by the sum of the variances would — in usual usage of "weighting" — be the only difference between "scaling" and "weighting". You'd want to know whether "loadings" refers to unit-length eigenvectors or variance-scaled eigenvectors by the particular R functions you're applying.

system · September 28, 2024, 7:37pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.