I am working on a clustering analysis where my data is a mix of bool and numeric. From research it looks like one should not do a simple kmeans with mixed data types since the extreme end of scaled data will make the bools of 1/0 dominate.

The general gist of what I found online was to use a gower difference matrix which I did using `cluster::daisy()`

. But, nevertheless, both on my actual data set and on the diamonds example below, I get perfect separation in clusters based on the booleans. This must be 'wrong' since the goal of using a gower matrix was to be able to cluster on mixed data types to avoid this kind of problem. So I'm hoping someone can spot where I might have misunderstood or offer some pointers.

My Rmd script to reproduce:

```
knitr::opts_chunk$set(echo = TRUE)
pacman::p_load(tidyverse, cluster, dbscan)
```

Get a shortened version of diamonds with a mix of bool and numeric data.

```
my_diamonds <- diamonds |>
sample_n(20000) |>
mutate(
is_premium = ifelse(cut == 'Premium', 1, 0),
is_color_def = ifelse(color %in% c('D', 'E', 'F'), 1, 0)
) |>
select(carat, is_premium, is_color_def, depth:x)
```

Create a gower difference matrix using cluster::daisy(). Specify the bool fields in `type = list(symm = c('is_premium', 'is_color_def')`

.

As I understand it, since I'm calculating a difference matrix I don't have to scale or transform the data,

```
# create a gower distance matrix since we have mixed data types (bool and numeric)
gower_dist <- daisy(my_diamonds, metric = "gower", stand = T, type = list(symm = c('is_premium', 'is_color_def')))
gower_dist |> summary()
199990000 dissimilarities, summarized :
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.1143 0.1967 0.2042 0.2895 0.6926
Metric : mixed ; Types = I, S, S, I, I, I, I
Number of objects : 20000
```

See if this plot helps choose values for eps in dbscan

```
# Determine optimal eps value to start with
dbscan::kNNdistplot(gower_dist, k = 100) # no idea what to put for K
# no kink or 'knee'
```

Run dbscan() using the difference matrix.

```
# Perform DBSCAN clustering
dbscan_result <- dbscan(gower_dist, eps = 0.02, minPts = 10) # eps from median of summary(gower_dist)
# Get cluster labels
cluster_labels <- dbscan_result$cluster
# Get noise points (not assigned to any cluster)
noise_points <- which(cluster_labels == 0)
# Get number of clusters
num_clusters <- max(cluster_labels)
cat("Number of Clusters:", num_clusters, "\n")
cluster_labels |> table()
# Print the results
my_diamonds$cluster <- cluster_labels
my_diamonds %>%
group_by(cluster) %>%
summarize(count = n(), across(everything(), list(avg = ~ mean(., na.rm = TRUE)), .names = "{.col}_{.fn}")) %>%
select(count, everything())
```

Results print out as:

```
Number of Clusters: 4
cluster_labels
0 1 2 3 4
197 7273 7530 2736 2264
```

And the data frame with average of each var per cluster. You can see that the bool vars highlighted are 1 or 0 indicating perfect separation:

Have I misunderstood how to use daisy with mixed data types? Must I transform my difference matrix in someway?

How can I get around this perfect separation of clusters that the boolean vars create?

[edit]

I tried adding weights to the bools when I create the dissimilarity matrix like so:

```
weights <- c(1, 0.05, 0.05, 1,1,1,1) # bools weighted tiny now
gower_dist <- daisy(my_diamonds, metric = "gower", stand = T, weights = weights, type = list(symm = c('is_premium', 'is_color_def')))
```

But the issue persists, the bools cause perfect separation.