Constrained clustering

Hi

I've been struggling with constrained k-means clustering for quite some time now, as I am trying to cluster demand data based on coordinates and demand, with a fixed demand constraint of 21,400 in each cluster.

I am new to R and not able to write the code from scratch, so I've been using this code, which according to the text is supposed to achieve exactly what I need: optimization - How do I make clusters based on a fixed capacity of each cluster in R? - Stack Overflow

I am writing the exact same code as in the link, but I get all kinds of error messages. I tried uploading the code below and highlighting the error codes (some of these are translated from my language to english)

df <- read.delim(file.choose())
> set.seed(123)
> id <- seq(1:1198)
> k <- 5
> par(mfrow = c(1, 3))
> km <- kmeans(cbind(df$lat, df$lon), centers = k)
> cols <- hcl.colors(n = k, "Cold")
> plot(df$lon,
+      df$lat,
+      type = "p", pch = 19, cex = 0.5,
+      main = "kmeans")
> for (i in seq_len(k)) {
+     lines(df$lon[km$cluster == i],
+           df$lat[km$cluster == i],
+           type = "p", pch = 19,
+           col = cols[i])
+ }
> library("NMOF")
> x0 <- sample(1:k, length(id), replace = TRUE)
> X <- as.matrix(df[, 2:3])

> sum_diff <- function(x, X, k, ...) {
+     groups <- seq_len(k)
+     d_centre <- numeric(k)
+     for (g in groups) {
+         centre <- colMeans(X[x == g, ], )
+         d <- t(X[x == g, ]) - centre
+         d_centre[g] <- sum(sqrt(colSums(d * d)))
+     }
+     sum(d_centre)
+ }
> sum_diff <- function(x, X, k, ...) {
+     groups <- seq_len(k)
+     d_centre <- numeric(k)
+     for (g in groups) {
+         centre <- colMeans(X[x == g, ], )
+         d <- t(X[x == g, ]) - centre
+         d_centre[g] <- sum(sqrt(colSums(d * d)))
+     }
+     sum(d_centre)
+ }
> sol <- TAopt(sum_diff,
+              list(x0 = x0,
+                   nI = 1998,
+                   neighbour = nb),
+              X = as.matrix(df[, 2:3]),
+              k = k)
**Error in lapply(X = X, FUN = FUN, ...) : object 'nb' not found**
> plot(df$lon,
+      df$lat,
+      type = "p", pch = 19, cex = 0.5,
+      main = "Local search")
> for (i in seq_len(k)) {
+     lines(df$lon[sol$xbest == i],
+           df$lat[sol$xbest == i],
+           type = "p", pch = 19, 
+           col = cols[i])
+ }
Error in lines(df$lon[sol$xbest == i], df$lat[sol$xbest == i], type = "p",  : 
  **object 'sol' was not found**
> max.demand <- 21400
> all(tapply(df$demand, x0, sum) < max.demand)
**Error in tapply(df$demand, x0, sum) : argumenter must have same length**
> nb_constr <- function(x, k, demand, max.demand,...) {
+     groups <- seq_len(k)
+     x_new <- x
+     p <- sample.int(length(x), 1)
+     g_ <- groups[-x_new[p]]    
+     x_new[p] <- g_[sample.int(length(g_), 1)]
+ if (sum(demand[x_new == x_new[p]]) > max.demand)
+     x
+ else
+     x_new
+ }
> sol <- TAopt(sum_diff,
+              list(x0 = x0,
+                   nI = 20000,
+                   neighbour = nb_constr),
+              X = as.matrix(df[, 2:3]),
+              k = k,
+              demand = df$demand,
+              max.demand = max.demand)

Threshold Accepting

  Computing thresholds ... 
 **Error in X[x == g, ] : (subscript) logical subscript too long** 

Sample data:
|id|lon|lat|demand|
|184|55.68966|12.51967|260.6036778|
|105|55.22817757|11.76248632|762.4494806|
|10|55.658126|12.597756|100.5990028|
|497|55.714389|12.5775499|356.5022306|
|551|55.32906261|11.96025237|584.3102194|
|589|56.050275|12.124795|2234.967231|
|115|55.454485|12.16898|1026.997086|
|725|55.462895|11.56769|30298.13954|
|868|55.35309093|11.69224481|271.5239806|
|169|55.4160844|12.1439072|777.7671889|
|171|55.75662893|11.70916898|179.5704444|
|872|55.63474|12.053885|857.4982028|
|186|55.23246115|11.7606181|494.3379194|
|197|55.5039675|11.2906625|257.4017889|
|104|54.79746005|11.6377104|389.0988139|
|95|55.567875|12.09508|105.5160833|
|04|55.73692911|12.14944027|220.8881444|

The data set has 1998 rows, including the headers.

Can anyone help clarify what I am doing wrong - or maybe there's an easier way to achieve what I'm trying to achieve? I've tried clustering them manually, with no good results.

You have to build your skills. I hope you have experience from other language(s) and then start simple and build up. Given a car analogy, would it make sense to show a picture of a car and then state "How do I build this car?".

There are several very nice online and free ressources for getting you started on your R-journey - You won't regret it :+1:

Its not great to share code from your console as it gets polluted with + symbols.
better to share code from your script pane...

the error about nb being absent is because you didnt include the definition of the nb function, whereas that is available in the stackoverflow page.

For learning base R skills, the R package swirl is a great option, its interactive and comprehensive.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.