Tidymodels KKNN -- how do I return which categories belong to which clusters?

pathos · April 12, 2022, 11:31am

I've been exploring here:

The video shows mostly continuous predictors. However, I would like to:

Use categorical predictors
Return which categories (or columns, see explanation below) exhibits high-level 'belonging' to which clusters

Side note: I think UMAP would also work (though in Tidymodels, it is only a step_umap() at the moment)

So to clarify, let's take the following scenario:

One hot encode bunch of categorical variables
Run KNN (or UMAP)
See if above X% (e.g. 90% -- tune?) of the values from one dummy/category belong to one cluster
Do 3) for rest of the clusters as a multinomial outcome (bonus: tune optimal number of clusters)

Essentially, I would like for KKNN (Euclidean distance can be problematic for high dimensions, so I'm also thinking Pearson's or Chi Square, but unsure on how to implement decision-making process for boundaries) to return which columns/categories belong to which clusters. Would that be possible?

nirgrahamuk · April 13, 2022, 11:23am

It seems like you are asking to calculate correlation of a categoric variable with a cluster id , that would seem to me to be relatively straightforward

pathos · April 13, 2022, 1:31pm

Ah yes, that's one way to look at it. How can I extract correlation of each of the categorical variables to each of the clusters? Would this also be doable with step_knn() from tidymodels?

nirgrahamuk · April 13, 2022, 1:48pm

the only ingredients you would need is your original data, and the cluster assigment of the knn on that data.
Then I think cor() function using spearman method

system · May 4, 2022, 1:48pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.