Multidimensional analysis of genotype in R

Without a a reprex (see the FAQ), it's hard to offer much that will be of immediate use.

The big picture is that this is a problem in three parts:

  1. How does each variable relate to the three categories, and is there any overlap?
  2. What are the appropriate statistical tools to classify observations into the categories?
  3. How to code those tools in R?

Here's an example of dimensional reduction in the case of all binary variables.

# install.packages("rARPACK")
# devtools::install_github("andland/logisticPCA")
library(logisticPCA)
library(ggplot2)
data("house_votes84")
logsvd_model = logisticSVD(house_votes84, k = 2)
logsvd_model
#> 435 rows and 16 columns
#> Rank 2 solution
#> 
#> 63.6% of deviance explained
#> 549 iterations to converge
logpca_cv = cv.lpca(house_votes84, ks = 2, ms = 1:10)
plot(logpca_cv) + theme_minimal()
#> Warning in type.convert.default(colnames(x)): 'as.is' should be specified by
#> the caller; using TRUE
#> Warning in type.convert.default(rownames(x)): 'as.is' should be specified by
#> the caller; using TRUE

logpca_model = logisticPCA(house_votes84, k = 2, m = which.min(logpca_cv))
clogpca_model = convexLogisticPCA(house_votes84, k = 2, m = which.min(logpca_cv))
plot(clogpca_model, type = "trace") + theme_minimal()

plot(logsvd_model, type = "trace") + theme_minimal()

party = rownames(house_votes84)
plot(logsvd_model, type = "scores") + 
  geom_point(aes(colour = party)) + 
  ggtitle("Exponential Family PCA") + 
  scale_colour_manual(values = c("blue", "red")) +
  theme_minimal()

Created on 2023-06-29 with reprex v2.0.2