fvis_nbcluster kmeans, method = "wss" plot output does not appear to use the tot.withinss for all clusters

Hi there,
I have a question that I'm hoping to get some help with (using this for teaching purposes and one of my students bought this to my attention). I have produced an elbow plot for kmeans clustering using total within sum of squares in the fvis_nbclust function from the NbClust package, as below. As you can see, the total within cluster sum of squares appears to increase at k = 5.

fviz_nbclust(df, kmeans, method = "wss", diss=NULL) +
labs(subtitle = "Elbow method")


However, if I run the code to see what the total within cluster sum of squares is for each number of clusters, k, we see it decreases with increasing k and does not show an increase at k = 5.

> k1 <- kmeans(df, centers = 1, nstart = 25)
> k2 <- kmeans(df, centers = 2, nstart = 25)
> k3 <- kmeans(df, centers = 3, nstart = 25)
> k4 <- kmeans(df, centers = 4, nstart = 25)
> k5 <- kmeans(df, centers = 5, nstart = 25)
> k6 <- kmeans(df, centers = 6, nstart = 25)
> k7<-  kmeans(df, centers = 7, nstart = 25)
> k8<-  kmeans(df, centers = 8, nstart = 25)
> k9<-  kmeans(df, centers = 9, nstart = 25)
> k10<-  kmeans(df, centers = 10, nstart = 25)
> k1$tot.withinss
[1] 161588
> k2$tot.withinss
[1] 56691.25
> k3$tot.withinss
[1] 29190.08
> k4$tot.withinss
[1] 11436.23
> k5$tot.withinss
[1] 8061.152
> k6$tot.withinss
[1] 5586.169
> k7$tot.withinss
[1] 3943.596
> k8$tot.withinss
[1] 3372.4
> k9$tot.withinss
[1] 2466.881
> k10$tot.withinss
[1] 2184.007

So I tried to plot the clusters in two different alternative ways, and got another different output for each plot produced (with a small peak at k=6, code to produce the plot just below) and one that matches the total within sum of squares outputs given by the kmeans function (code to produce the plot underneath).

#Create function to compute total within sum of squares
kmean_withinss <-function(k){
  cluster<-kmeans(df, k)
  return (cluster$tot.withinss)
}
#set maximum cluster
max_k <-10
#Run algorithm over a range of k values (here set to 10)
wss<-sapply(1:max_k, kmean_withinss)
elbow<-data.frame(1:max_k,wss)
#plot number of clusters v total within ss
ggplot(elbow, aes(x = 1:max_k, y = wss))+
  geom_point()+
  geom_line()+
  scale_x_continuous(breaks = seq(1,10, by = 1))
# Initialize total within sum of squares error: wss
wss <- 0
# For 1 to 10 cluster centers
for (i in 1:10) {
  km.out <- kmeans(df, centers = i, nstart=20)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}
#plot number of clusters v total within ss
plot(1:10, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Total within cluster sum of squares")

Can someone please help me to understand what is going on and why, what I think should be using the same output to create the plots, is giving these different plots. Thank you in advance.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.