I've been exploring here:
The video shows mostly continuous predictors. However, I would like to:
- Use categorical predictors
- Return which categories (or columns, see explanation below) exhibits high-level 'belonging' to which clusters
Side note: I think UMAP would also work (though in Tidymodels, it is only a step_umap()
at the moment)
So to clarify, let's take the following scenario:
- One hot encode bunch of categorical variables
- Run KNN (or UMAP)
- See if above X% (e.g. 90% -- tune?) of the values from one dummy/category belong to one cluster
- Do 3) for rest of the clusters as a multinomial outcome (bonus: tune optimal number of clusters)
Essentially, I would like for KKNN (Euclidean distance can be problematic for high dimensions, so I'm also thinking Pearson's or Chi Square, but unsure on how to implement decision-making process for boundaries) to return which columns/categories belong to which clusters. Would that be possible?