Dear community,
I am trying to perform a PCA on a dataset which contains a survey results. The survey was conducted on companies (companies are in rows) and they were asked multiple questions (questions and answers are in columns). Most of the questions were based on a pattern "Please choose an answer from a set X of answers X = {1,2,3,4...}. There are some boolean values but a good share of answers has more variation.
What I would like to do is to reduce the dimensions and look for the similarities among the companies. For this purpose I would like to perform a PCA.
The dataset I will be using can be downloaded from: https://www.kaggle.com/jakubdbrowski/datapca
datapca <- read.csv2("datapca.csv")
datapca <- datapca[,-c(1)]
I need to drop the first column which does not have any information. The dataset was cleaned and prepared beforehand. Now I can perform a PCA.
xxx.pca <- prcomp(datapca, center = TRUE, scale.= TRUE)
Now I would like to look for the numbers of clusters I could get from my data.
fviz_nbclust(xxx.pca$x, FUNcluster=kmeans, k.max = 8)
It looks like it could be difficult to find clusters in this particular dataset.
hopkins(datapca, n=nrow(xxx.pca$x)-1)
However, I would like to continue the analysis to go through the whole analytical process. Once I will receive the updated data, maybe the results will be better.
So I will create two clusters as suggested.
km1<-eclust(xxx.pca$x, "kmeans", hc_metric="eucliden",k=2)
And at this point comes my question. Right now, I would like to try to look at the clusters and determine which loadings are responsible for clustering and characterize the two clusters?
I would also like to ask, whether it is possible to determine the most important loadings, reduce their number (right now there are 150 which makes the graph too comlicated) and plot them in a clearer way? Both graphs below are to complicated.
fviz_pca_var(xxx.pca, col.var = "black")
biplot(xxx.pca, showLoadings = TRUE, lab = NULL)
Thank you very much in advance!