Hi there,
I have a question that I'm hoping to get some help with (using this for teaching purposes and one of my students bought this to my attention). I have produced an elbow plot for kmeans clustering using total within sum of squares in the fvis_nbclust function from the NbClust package, as below. As you can see, the total within cluster sum of squares appears to increase at k = 5.
fviz_nbclust(df, kmeans, method = "wss", diss=NULL) +
labs(subtitle = "Elbow method")
However, if I run the code to see what the total within cluster sum of squares is for each number of clusters, k, we see it decreases with increasing k and does not show an increase at k = 5.
> k1 <- kmeans(df, centers = 1, nstart = 25)
> k2 <- kmeans(df, centers = 2, nstart = 25)
> k3 <- kmeans(df, centers = 3, nstart = 25)
> k4 <- kmeans(df, centers = 4, nstart = 25)
> k5 <- kmeans(df, centers = 5, nstart = 25)
> k6 <- kmeans(df, centers = 6, nstart = 25)
> k7<- kmeans(df, centers = 7, nstart = 25)
> k8<- kmeans(df, centers = 8, nstart = 25)
> k9<- kmeans(df, centers = 9, nstart = 25)
> k10<- kmeans(df, centers = 10, nstart = 25)
> k1$tot.withinss
[1] 161588
> k2$tot.withinss
[1] 56691.25
> k3$tot.withinss
[1] 29190.08
> k4$tot.withinss
[1] 11436.23
> k5$tot.withinss
[1] 8061.152
> k6$tot.withinss
[1] 5586.169
> k7$tot.withinss
[1] 3943.596
> k8$tot.withinss
[1] 3372.4
> k9$tot.withinss
[1] 2466.881
> k10$tot.withinss
[1] 2184.007
So I tried to plot the clusters in two different alternative ways, and got another different output for each plot produced (with a small peak at k=6, code to produce the plot just below) and one that matches the total within sum of squares outputs given by the kmeans function (code to produce the plot underneath).
#Create function to compute total within sum of squares
kmean_withinss <-function(k){
cluster<-kmeans(df, k)
return (cluster$tot.withinss)
}
#set maximum cluster
max_k <-10
#Run algorithm over a range of k values (here set to 10)
wss<-sapply(1:max_k, kmean_withinss)
elbow<-data.frame(1:max_k,wss)
#plot number of clusters v total within ss
ggplot(elbow, aes(x = 1:max_k, y = wss))+
geom_point()+
geom_line()+
scale_x_continuous(breaks = seq(1,10, by = 1))
# Initialize total within sum of squares error: wss
wss <- 0
# For 1 to 10 cluster centers
for (i in 1:10) {
km.out <- kmeans(df, centers = i, nstart=20)
# Save total within sum of squares to wss variable
wss[i] <- km.out$tot.withinss
}
#plot number of clusters v total within ss
plot(1:10, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Total within cluster sum of squares")
Can someone please help me to understand what is going on and why, what I think should be using the same output to create the plots, is giving these different plots. Thank you in advance.