Moran's I for point data

Hi, I have a dataframe of house prices (1,500 entries) with longitude and latitude coordinates. I'm trying to find the Moran's I to measure spatial autocorrelation between houses. So far, I've mainly seen people use polygons with summary statistics, whereas I have individual property coordinates and prices.

  • I have attempted to use dist() to define an inverse distance weight matrix, and ape::Moran.I for the coefficient calculation, this doesn't work. I think the dataset is too big.

Can you please advise me as to how I can measure the spatial autocorrelation in this point data?

Hi @alstudento ,

since you didn't posted any code or the link to the dataset i can offer you the following code snippet:

# some data to work with
# data from
house <- sf::read_sf("path \\clev_sls_154_core.shp",
                     options = "ENCODING=UTF-8")
# get coordinates in nice format
houseLongLat <- house |>
  st_transform(4326) |>
  st_coordinates(house) |>
# rename columns
names(houseLongLat) <- c("long", "lat")
# get final dataframe 
houseDF <- cbind(house, houseLongLat) |>

### Here the moran stuff starts
# following
houseDF.dists <- as.matrix(dist(cbind(houseDF$long, houseDF$lat)))
houseDF.dists.inv <- 1/houseDF.dists
diag(houseDF.dists.inv) <- 0

Moran.I(houseDF$sale_price, houseDF.dists.inv)
# $observed
# [1] 0.1100791
# $expected
# [1] -0.004901961
# $sd
# [1] 0.01453089
# $p.value
# [1] 2.442491e-15

It is basically following the instructions from How can I calculate Moran’s I in R? | R FAQ using data that can be downloaded here Cleveland Home Sales (2015).

I tested it with the 2014-15 Home Sales in King County, WA which has ~21.000 entries. Here the distance matrix gets ~4Gb large - so i get your pain. So some kind of clustering would be the approach to get your dataset down to a managable number. Or Calculating Moran's I on large point dataframe in R - Geographic Information Systems Stack Exchange seems to be promising since the below example

houseDF <- read.csv("path//kc_house_data.csv")
moranfast(houseDF$price, houseDF$long, houseDF$lat)

took 2 seconds on my machine.

Hope it can get you at least started.

Thank you very much for your reply.

I ran the code on my data, (long, lat, price) and recieved the following error:

Error in if (obs <= ei) 2 * pv else 2 * (1 - pv) :
missing value where TRUE/FALSE needed

I see there is no clear solution on stack, Do you know how to resolve this?

I think I fixed it!!

Seems there was inf values in the matrix, so I added this code.

df.dists.inv[is.infinite(df.dists.inv)] <- 0
1 Like

Your question is interesting; for Moran's I in its raw form does not require a distance matrix. What it requires is a contiguity matrix (which observations are neighbors?) and weights matrix (giving a quantification to said neighborhood).

The inverse distance weights calculated using dist() is a special case of this - legit, but definitely not a required case of weights, and a distance matrix of a longish object can get out of hand quickly.

What I suggest is:

  • calculating the weights matrices and Moran's I within framework of {spdep} instead of {ape} - spdep has a special object for weight matrices, and stores them in a sparse form as a list. This can get handy when working with largish datasets
  • calculating the neighbors using either KNN (each point has exactly N neighbors) or distance based neighbors (all points within X distance units from a point are its neighbors); both of these can be used with house price data and should handle 1,500 observations with a relative ease

I have written a blog piece about creating the contiguity / weights matrices a while back; you may find it interesting as it includes code: Spatial Neighborhoods · Jindra Lacko


Hi @jlacko,

I've mainly seen this approach applied to polygon neighbourhoods,

My data is coordinate points representing individual houses. Will this still work for me?

This will depend on what you are trying to achieve :slight_smile:

But in principle I see no reason why it should not. The technique doesn't care about the nature of weights - I have seen used euclidean disances, road distances, phylogeny distances (not spatial at all - very weird!) and graph based networks (i.e. neighboring polygons). The weights have to be present, and have to be summable (so infinity messes the calculation up). But the exact nature of weights is up to you.

My suggestion would be to consider distance based neighbors technique, i.e. spdep::dnearneigh(), with distance cutoff depending on the nature of your data.

Hi @jlacko , I have implemented the Morans.test with spdep::knearneigh() (since my data is sparse in some locations), and received a different observed Moran's I to my initial result using vedoa 's suggested approach.

Here's my code.
The first approach uses the 10 nearest neighbours to calculate the weights, and results in (Morans I = 0.3631838298, expected = -0.0006680027).

### Using the nearest neighbours approach 
# extract longitude and latitude from the data (hd)
longlats <- cbind(long = hd$longitude, lat = hd$latitude) %>%

nb_list <- knn2nb(knearneigh(longlats, k=10, longlat = TRUE, use_kd_tree=FALSE))
nb_weights <- nb2listw(nb_list, style="W")
spdep::moran.test(hd$price, nb_weights)

This second approach, uses the inverse distance matrix, results = (observed = 0.1266674, expected = -0.0006680027)

## Using distance matrix approach 
# creating distance matrix 
dist_matrix <- as.matrix(dist(longlats))
inv_dist_matrix <- 1/dist_matrix
diag(inv_dist_matrix) <- 0
inv_dist_matrix[is.infinite(inv_dist_matrix)] <- 0

# Morans I for distance. 
ape ::Moran.I(hd$price, inv_dist_matrix)

My understanding is that the observed Moran's I differ since the first approach only considers 10 neighbours, whereas the second approach is considering all of the data.
Can you provide any support on which approach is preferred?

For context: I am looking to report Moran's I on residuals for different spatial models. I'm new to this so I really appreciate your support, and any advice on my approach!

It is difficult to give a definite comment on your results without having access to your data - but the figures are broadly what you would expect. Your code is formally correct, the two results look roughly in line.

You are seeing a mild positive autocorellation of property prices - expensive properties tend to be surrounded by other expensive ones, cheap by cheap ones. This makes intuitive sense.

What you want to do now are two things:

  • have a look at the Z score; in the realm of {spdep} it is called Moran I statistic standard deviate, I am uncertain how {ape} calls this figure but I am certain it supports it (somehow). The Z score has normal distribution, so it will give you an indication of significance (p-value)
  • compare the Moran's I statistic for raw prices with the same statistic for residuals after modelling. A nice output would be material autocorellation of raw prices that goes away for residuals (it does not always work out; but it looks nice when it does).

A small issue that kind of nags me is the line when you replace infinity values with zeroes - I understand why you need it, but it implies that you have at least one pair of points with zero distances; this may mean data quality issues, so you may want to double check this.

Should you confirm that this is a valid and material issue (zero distance are valid => infinity inverted distances are correct and it is not a matter of negligible number of points) you probably don't want to use inverse distance weights. You are overriding a very high number (infinity) by zero. At the very least I would consider some other replacement than zero.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.