Without a a reprex
(see the FAQ), it's hard to offer much that will be of immediate use.
The big picture is that this is a problem in three parts:
- How does each variable relate to the three categories, and is there any overlap?
- What are the appropriate statistical tools to classify observations into the categories?
- How to code those tools in
R
?
Here's an example of dimensional reduction in the case of all binary variables.
# install.packages("rARPACK")
# devtools::install_github("andland/logisticPCA")
library(logisticPCA)
library(ggplot2)
data("house_votes84")
logsvd_model = logisticSVD(house_votes84, k = 2)
logsvd_model
#> 435 rows and 16 columns
#> Rank 2 solution
#>
#> 63.6% of deviance explained
#> 549 iterations to converge
logpca_cv = cv.lpca(house_votes84, ks = 2, ms = 1:10)
plot(logpca_cv) + theme_minimal()
#> Warning in type.convert.default(colnames(x)): 'as.is' should be specified by
#> the caller; using TRUE
#> Warning in type.convert.default(rownames(x)): 'as.is' should be specified by
#> the caller; using TRUE
logpca_model = logisticPCA(house_votes84, k = 2, m = which.min(logpca_cv))
clogpca_model = convexLogisticPCA(house_votes84, k = 2, m = which.min(logpca_cv))
plot(clogpca_model, type = "trace") + theme_minimal()
plot(logsvd_model, type = "trace") + theme_minimal()
party = rownames(house_votes84)
plot(logsvd_model, type = "scores") +
geom_point(aes(colour = party)) +
ggtitle("Exponential Family PCA") +
scale_colour_manual(values = c("blue", "red")) +
theme_minimal()
Created on 2023-06-29 with reprex v2.0.2