hierarchical clustering

Hi. I followed Statology's code for an example of hierarchical clustering fom

They conclude with a Ward's method model with optimal number of 4 clusters.

How does one plot the resulting dendrogram for this final model?

library(factoextra)
library(cluster)

#load data
df <- USArrests

#remove rows with missing values
df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1
df <- scale(df)

#define linkage methods
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

#function to compute agglomerative coefficient
ac <- function(x) {
agnes(df, method = x)$ac
}

#calculate agglomerative coefficient for each clustering linkage method
sapply(m, ac)

#perform hierarchical clustering using Ward's minimum variance
clust <- agnes(df, method = "ward")

#produce dendrogram
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")

#calculate gap statistic for each number of clusters (up to 10 clusters)
gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)

#produce plot of clusters vs. gap statistic
fviz_gap_stat(gap_stat)

#compute distance matrix
d <- dist(df, method = "euclidean")

#perform hierarchical clustering using Ward's method
final_clust <- hclust(d, method = "ward.D2" )

#cut the dendrogram into 4 clusters
groups <- cutree(final_clust, k=4)

Number of members in each cluster

table(groups)

#append cluster labels to original data
final_data <- cbind(USArrests, cluster = groups)

#display first six rows of final data
head(final_data)

#find mean values for each cluster
aggregate(final_data, by=list(cluster=final_data$cluster), mean)

Probably not the ideal method, but a manual approach: use table(cutree(final_clust, h = xxx)) changing xxx to find the number of clusters you decided on. Here, h = 5 does give you 4 clusters. Then, you can plot it directly with:

plot(final_clust, cex = 0.6, hang = -1, main = "Dendrogram")
abline(h = 5, lty = "dashed", col="grey")

Or the nicer-looking

library(ggdendro)

ggdendrogram(final_clust) +
	geom_hline(aes(yintercept = 5),
			   linetype = "dashed", color = "grey")

Or, next level, rebuilding everything (but with lots of manual adjustments needed):


ddata <- dendro_data(final_clust)

ggplot() +
  geom_segment(data = segment(ddata),
               aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_text(data = ddata$labels |> 
              mutate(group = as.factor(groups[ddata$labels$label])),
            aes(x = x, y = y, label = label, color = group),
            angle = 90, hjust = 1, vjust = 0.5, size = 2.5) +
  scale_y_continuous(limits = c(-10,20)) +
  theme_dendro() +
  theme(axis.text.x = element_text(angle = angle, 
                                   hjust = 1, vjust = 0.5)) +
  theme(axis.text.y = element_text(angle = angle, 
                                   hjust = 1)) +
  geom_hline(aes(yintercept = 5),
             linetype = "dashed", color = "grey")

Thanks to both of you.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.