k means clustering in high dimensions – how to find out what variable k means used for clustering

thatstoni · September 27, 2022, 2:33pm

Hey everyone,

I have a biological data set with 38 columns and 158 rows. Each row represents one human cell with 38 measured variables. My goal is to find possible clusters within all of my data points. To find the optimal cluster number I used the Silhouette Clustering Method. My optimal number of clusters is 9. To run k means with 9 clusters works fine. But how do I figure out after what variables k means ordered my cells to which cluster?

Thanks, for your help.

Best,

Toni

nirgrahamuk · September 27, 2022, 3:32pm

by definition it must consider all the variables that comprise the observation ... If you asked it to perform over 38 columns, then the answer is all 38 columns

thatstoni · September 28, 2022, 7:38am

thanks for the answer, but it doesn´t answer my question the way I hoped… I want to find out in what variables my clusters are different. So how can I produce a Dissimilarity / Similarity Matrix in R comparing my clusters with one another?

nirgrahamuk · September 28, 2022, 9:00am

Maybe this sort of thing:


library(tidyverse)

# start example prep

df_ <- data.frame(x=c(1,1.1,5,5.5,3.3),
                  y=c(2.2,2,4.5,4,3.3))

(my_k <- kmeans(df_,2))

df_$cluster_id <- factor(my_k$cluster)

df_

# end example prep

# plot the example
ggplot(data=df_,
       mapping=aes(x=x,y=y,
                   color=cluster_id)) + geom_point(size=5)


# dissimilarity

library(cluster)
(overall_dissimilarity <- daisy(df_ |> select(-cluster_id)))

(per_cluster_dissimilarity <- 
group_by(df_ ,cluster_id) |> 
  summarise(d=list(daisy(cur_data()))) )

pull(per_cluster_dissimilarity,d)

system · October 19, 2022, 9:01am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.