average distance between all combinations of xy coordinates in data set

dependent_square · May 4, 2022, 2:11pm

I have a big data set with over 600 xy coordinates. Now I want to know the mean of all the distances between all combinations of points; so I want to calculate one number which is the mean. I have a lot of combinations and I can't use dist() and then calculate the average because i reached the maximum amount of point in the matrix.

Can somebody help me please?

nirgrahamuk · May 4, 2022, 2:32pm

What are the maximum points of a matrix ?

dependent_square · May 5, 2022, 11:57am

i don't know but it says it exceeded it, can you help me?

nirgrahamuk · May 5, 2022, 12:14pm

I suppose theres something about how you described your challenge that I'm not picking up on. Perhaps you can say more about it.

set.seed(42)
d1 <- data.frame(x=rnorm(700),
                 y=rnorm(700))

(dist_of_d1 <- dist(d1))

mean(dist_of_d1)

This example involves over 600pairs (700 to be precise).
The dist_of_d1 object is about 2mb

dvetsch75 · May 5, 2022, 2:28pm

I am doubtful that the problem is the sheer number of combinations, since 600 points has only 360,000 possible combinations, which would fit on most machines. Rather, I think your issue is the dimensions matrix that dist gives, since one row isn't necessarily equal to one column in terms of memory space.
Therefore, I think a better bet would might be to generate all of the pairs first - then you are in control of the dimensions of the resulting matrix. Here are a couple different ideas for how you might approach getting to the mean:

1. Generate all possible combinations of points, calculate the distance between them, and take the average of all of those distances:

library(dplyr)
library(tidyr)

coords <- lapply(
    1:600,
    function(x) {
        data.frame(x = rnorm(1), y = rnorm(1))
    }
)

combos <- expand.grid(
    coords,
    coords
) %>% 
    unnest(everything(), names_repair = 'unique') %>% 
    rename(
        'x1' = 1,
        'y1' = 2,
        'x2' = 3,
        'y2' = 4
    )
#> New names:
#> * x -> x...1
#> * y -> y...2
#> * x -> x...3
#> * y -> y...4

mean_dist <- combos %>% 
    mutate(
        distance = sqrt((x2 - x1)^2 + (y2 - y1)^2)
    ) %>%
    pull(distance) %>% 
    mean

mean_dist
#> [1] 1.833072

^{Created on 2022-05-05 by the reprex package (v1.0.0)}

2. Create the numerator and denominator elementwise without storing the combinations

One other option would be to break the problem down into two pieces: How to generate the combinations, and how to calculate the distance. Once you solve those two problems, finding the mean is really trivial. Working backwards, we realize that a mean is just \dfrac{sum}{n}, where sum is the sum of the distances calculated and n is the number of distances calculated. So if we are really memory constrained in solving the problem, we just won't store all of the combinations in memory and instead only increment sum and n for each combination. Here is an example of that:

numerator <- 0
denominator <- 0

for(coord1 in coords) {
    for(coord2 in coords) {
        distance <- sqrt((coord2$x - coord1$x)^2 + (coord2$y - coord1$y)^2)
        numerator <- numerator + distance
        denominator <- denominator + 1
    }
}
numerator / denominator
#> [1]  1.833072

system · May 26, 2022, 2:28pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.