I am working with the R programming language.
Suppose I have the following two data frames:
set.seed(123)
df_1 <- data.frame(
name_1 = c("john", "david", "alex", "kevin", "trevor", "xavier", "tom", "michael", "troy", "kelly", "chris", "henry", "taylor", "ryan", "peter"),
lon = rnorm(15, mean = -74.0060, sd = 0.01),
lat = rnorm(15, mean = 40.7128, sd = 0.01)
)
df_2 <- data.frame(
name_2 = c("matthew", "tyler", "sebastian", "julie", "anna", "tim", "david", "nigel", "sarah", "steph", "sylvia", "boris", "theo", "malcolm"),
lon = rnorm(14, mean = -74.0060, sd = 0.01),
lat = rnorm(14, mean = 40.7128, sd = 0.01)
)
My Problem: For each person in df_1, I am trying to find out the 5 closest people (haversine distance) to this person in df_1 and record various distance statistics (e.g. mean, median, max, min standard deviation).
This my attempt:
First I defined the distance function:
library(geosphere)
haversine_distance <- function(lon1, lat1, lon2, lat2) {
distHaversine(c(lon1, lat1), c(lon2, lat2))
}
Then, I calculated the distance between each person in df_1 and ALL people in df_2:
# Create a matrix to store results
distances <- matrix(nrow = nrow(df_1), ncol = nrow(df_2))
# calculate the distances
for (i in 1:nrow(df_1)) {
for (j in 1:nrow(df_2)) {
distances[i, j] <- haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
}
}
# Create final
final <- data.frame(
name_1 = rep(df_1$name_1, each = nrow(df_2)),
lon_1 = rep(df_1$lon, each = nrow(df_2)),
lat_1 = rep(df_1$lat, each = nrow(df_2)),
name_2 = rep(df_2$name_2, nrow(df_1)),
lon_2 = rep(df_2$lon, nrow(df_1)),
lat_2 = rep(df_2$lat, nrow(df_1)),
distance = c(distances)
)
Finally, for each person in df_1, I kept the 5 minimum distances and recorded the distance statistics:
# Keep only first 5 rows for each unique value of final$name_1
final <- final[order(final$name_1, final$distance), ]
final <- final[ave(final$distance, final$name_1, FUN = seq_along) <= 5, ]
# Calculate summary statistics for each unique person in final$name_1
final_summary <- aggregate(distance ~ name_1,
data = final,
FUN = function(x) c(min = min(x),
max = max(x),
mean = mean(x),
median = median(x),
sd = sd(x)))
final_summary <- do.call(data.frame, final_summary)
names(final_summary)[-(1)] <- c("min_distance", "max_distance", "mean_distance", "median_distance", "sd_distance")
final_summary$closest_people <- tapply(final$name_2,
final$name_1,
FUN = function(x) paste(sort(x), collapse = ", "))
# break closest_people column into multiple columns
n <- 5
closest_people_split <- strsplit(final_summary$closest_people, ", ")
final_summary[paste0("closest_", seq_len(n))] <- do.call(rbind, closest_people_split)
The final results look like this:
name_1 min_distance max_distance mean_distance median_distance sd_distance closest_people closest_1 closest_2 closest_3 closest_4 closest_5
1 alex 342.8375 1158.1408 717.0810 650.9167 358.7439 boris, david, matthew, nigel, sarah boris david matthew nigel sarah
2 chris 195.4891 1504.8199 934.6618 895.8301 489.5175 boris, david, malcolm, nigel, steph boris david malcolm nigel steph
3 david 549.4500 830.2758 716.3839 807.6626 143.9571 matthew, sarah, steph, sylvia, tim matthew sarah steph sylvia tim
4 henry 423.1875 975.1733 639.5657 560.1101 223.2389 anna, boris, matthew, sebastian, tim anna boris matthew sebastian tim
5 john 415.8956 1174.1631 849.4313 965.2928 313.2616 boris, julie, matthew, theo, tyler boris julie matthew theo tyler
6 kelly 489.7949 828.5550 657.5908 658.7015 120.6485 david, julie, matthew, sebastian, steph david julie matthew sebastian steph
My Question: Although this code seems to run without errors, I have a feeling that this code will start to take a long time to run when the sizes of df_1 and df_2 begin to grow. As such, I am looking for ways to improve the efficiency of this code - can someone please suggest how to fix this for larger data frames?
Thanks!