I have a dataframe with columns 'x' and 'y' corresponding to x/y coordinates of a scatterplot I've made using ggplot2. I'm looking for some way to ask, "how many clusters exist here?". I understand that maybe some user input may be required here for what you want to call a 'cluster'.
I have found some success using Seurat, because it contains a function to label clusters. However, it's more like finding the clusters that correspond to a vector of labels provided by the user (ex: I proivde 5 unique labels so just go find 5 clusters).
Min Reprex:
Seurat's LabelClusters function is very useful for labeling clusters starting solely from X/Y coordinates:
However, I have a need to detect total # of clusters from these data (without providing labels). By this I mean first detecting how many clusters exist . Maybe here we would see 5 clusters instead of 3 :
I'm not aware of a way to automatically choose the number of clusters. kmeans requires you specify the number of clusters. You can try a range of number of clusters and look at the within and between cluster sum of squares to choose the number of clusters that strikes the desired balance. Or do a hierarchical clustering and choose from those results.
library(tidyverse)
# pca and k-means clusters
pc <- iris %>% select_if(is.numeric) %>% princomp()
km <- iris %>% select_if(is.numeric) %>% list() %>% map2(1:10, kmeans) %>% set_names(1:length(.))
# sum of squares analysis
clus <- km %>% map_dfr(~.["tot.withinss"], .id = "num_clusters")
print(clus)
#> # A tibble: 10 x 2
#> num_clusters tot.withinss
#> <chr> <dbl>
#> 1 1 681.
#> 2 2 152.
#> 3 3 78.9
#> 4 4 71.4
#> 5 5 49.8
#> 6 6 39.4
#> 7 7 36.9
#> 8 8 33.1
#> 9 9 32.3
#> 10 10 27.0
# add principal components and cluster ids to the data frame
df <- iris %>%
as_tibble() %>%
bind_cols(pc$scores, cluster = km[[4]]$cluster) %>%
mutate(cluster = as.factor(cluster))
#plot
df %>%
ggplot() +
aes(Comp.1, Comp.2, color = cluster) +
geom_point()