Hi
I've been struggling with constrained k-means clustering for quite some time now, as I am trying to cluster demand data based on coordinates and demand, with a fixed demand constraint of 21,400 in each cluster.
I am new to R and not able to write the code from scratch, so I've been using this code, which according to the text is supposed to achieve exactly what I need: optimization - How do I make clusters based on a fixed capacity of each cluster in R? - Stack Overflow
I am writing the exact same code as in the link, but I get all kinds of error messages. I tried uploading the code below and highlighting the error codes (some of these are translated from my language to english)
df <- read.delim(file.choose())
> set.seed(123)
> id <- seq(1:1198)
> k <- 5
> par(mfrow = c(1, 3))
> km <- kmeans(cbind(df$lat, df$lon), centers = k)
> cols <- hcl.colors(n = k, "Cold")
> plot(df$lon,
+ df$lat,
+ type = "p", pch = 19, cex = 0.5,
+ main = "kmeans")
> for (i in seq_len(k)) {
+ lines(df$lon[km$cluster == i],
+ df$lat[km$cluster == i],
+ type = "p", pch = 19,
+ col = cols[i])
+ }
> library("NMOF")
> x0 <- sample(1:k, length(id), replace = TRUE)
> X <- as.matrix(df[, 2:3])
> sum_diff <- function(x, X, k, ...) {
+ groups <- seq_len(k)
+ d_centre <- numeric(k)
+ for (g in groups) {
+ centre <- colMeans(X[x == g, ], )
+ d <- t(X[x == g, ]) - centre
+ d_centre[g] <- sum(sqrt(colSums(d * d)))
+ }
+ sum(d_centre)
+ }
> sum_diff <- function(x, X, k, ...) {
+ groups <- seq_len(k)
+ d_centre <- numeric(k)
+ for (g in groups) {
+ centre <- colMeans(X[x == g, ], )
+ d <- t(X[x == g, ]) - centre
+ d_centre[g] <- sum(sqrt(colSums(d * d)))
+ }
+ sum(d_centre)
+ }
> sol <- TAopt(sum_diff,
+ list(x0 = x0,
+ nI = 1998,
+ neighbour = nb),
+ X = as.matrix(df[, 2:3]),
+ k = k)
**Error in lapply(X = X, FUN = FUN, ...) : object 'nb' not found**
> plot(df$lon,
+ df$lat,
+ type = "p", pch = 19, cex = 0.5,
+ main = "Local search")
> for (i in seq_len(k)) {
+ lines(df$lon[sol$xbest == i],
+ df$lat[sol$xbest == i],
+ type = "p", pch = 19,
+ col = cols[i])
+ }
Error in lines(df$lon[sol$xbest == i], df$lat[sol$xbest == i], type = "p", :
**object 'sol' was not found**
> max.demand <- 21400
> all(tapply(df$demand, x0, sum) < max.demand)
**Error in tapply(df$demand, x0, sum) : argumenter must have same length**
> nb_constr <- function(x, k, demand, max.demand,...) {
+ groups <- seq_len(k)
+ x_new <- x
+ p <- sample.int(length(x), 1)
+ g_ <- groups[-x_new[p]]
+ x_new[p] <- g_[sample.int(length(g_), 1)]
+ if (sum(demand[x_new == x_new[p]]) > max.demand)
+ x
+ else
+ x_new
+ }
> sol <- TAopt(sum_diff,
+ list(x0 = x0,
+ nI = 20000,
+ neighbour = nb_constr),
+ X = as.matrix(df[, 2:3]),
+ k = k,
+ demand = df$demand,
+ max.demand = max.demand)
Threshold Accepting
Computing thresholds ...
**Error in X[x == g, ] : (subscript) logical subscript too long**
Sample data:
|id|lon|lat|demand|
|184|55.68966|12.51967|260.6036778|
|105|55.22817757|11.76248632|762.4494806|
|10|55.658126|12.597756|100.5990028|
|497|55.714389|12.5775499|356.5022306|
|551|55.32906261|11.96025237|584.3102194|
|589|56.050275|12.124795|2234.967231|
|115|55.454485|12.16898|1026.997086|
|725|55.462895|11.56769|30298.13954|
|868|55.35309093|11.69224481|271.5239806|
|169|55.4160844|12.1439072|777.7671889|
|171|55.75662893|11.70916898|179.5704444|
|872|55.63474|12.053885|857.4982028|
|186|55.23246115|11.7606181|494.3379194|
|197|55.5039675|11.2906625|257.4017889|
|104|54.79746005|11.6377104|389.0988139|
|95|55.567875|12.09508|105.5160833|
|04|55.73692911|12.14944027|220.8881444|
The data set has 1998 rows, including the headers.
Can anyone help clarify what I am doing wrong - or maybe there's an easier way to achieve what I'm trying to achieve? I've tried clustering them manually, with no good results.