Hi, I have a dataframe of house prices (1,500 entries) with longitude and latitude coordinates. I'm trying to find the Moran's I to measure spatial autocorrelation between houses. So far, I've mainly seen people use polygons with summary statistics, whereas I have individual property coordinates and prices.
I have attempted to use dist() to define an inverse distance weight matrix, and ape::Moran.I for the coefficient calculation, this doesn't work. I think the dataset is too big.
Can you please advise me as to how I can measure the spatial autocorrelation in this point data?
Your question is interesting; for Moran's I in its raw form does not require a distance matrix. What it requires is a contiguity matrix (which observations are neighbors?) and weights matrix (giving a quantification to said neighborhood).
The inverse distance weights calculated using dist() is a special case of this - legit, but definitely not a required case of weights, and a distance matrix of a longish object can get out of hand quickly.
What I suggest is:
calculating the weights matrices and Moran's I within framework of {spdep} instead of {ape} - spdep has a special object for weight matrices, and stores them in a sparse form as a list. This can get handy when working with largish datasets
calculating the neighbors using either KNN (each point has exactly N neighbors) or distance based neighbors (all points within X distance units from a point are its neighbors); both of these can be used with house price data and should handle 1,500 observations with a relative ease
I have written a blog piece about creating the contiguity / weights matrices a while back; you may find it interesting as it includes code: Spatial Neighborhoods · Jindra Lacko
This will depend on what you are trying to achieve
But in principle I see no reason why it should not. The technique doesn't care about the nature of weights - I have seen used euclidean disances, road distances, phylogeny distances (not spatial at all - very weird!) and graph based networks (i.e. neighboring polygons). The weights have to be present, and have to be summable (so infinity messes the calculation up). But the exact nature of weights is up to you.
My suggestion would be to consider distance based neighbors technique, i.e. spdep::dnearneigh(), with distance cutoff depending on the nature of your data.
Hi @jlacko , I have implemented the Morans.test with spdep::knearneigh() (since my data is sparse in some locations), and received a different observed Moran's I to my initial result using vedoa 's suggested approach.
Here's my code.
The first approach uses the 10 nearest neighbours to calculate the weights, and results in (Morans I = 0.3631838298, expected = -0.0006680027).
### Using the nearest neighbours approach
# extract longitude and latitude from the data (hd)
longlats <- cbind(long = hd$longitude, lat = hd$latitude) %>% as.data.frame()
nb_list <- knn2nb(knearneigh(longlats, k=10, longlat = TRUE, use_kd_tree=FALSE))
nb_weights <- nb2listw(nb_list, style="W")
spdep::moran.test(hd$price, nb_weights)
This second approach, uses the inverse distance matrix, results = (observed = 0.1266674, expected = -0.0006680027)
## Using distance matrix approach
# creating distance matrix
dist_matrix <- as.matrix(dist(longlats))
inv_dist_matrix <- 1/dist_matrix
diag(inv_dist_matrix) <- 0
inv_dist_matrix[is.infinite(inv_dist_matrix)] <- 0
# Morans I for distance.
ape ::Moran.I(hd$price, inv_dist_matrix)
My understanding is that the observed Moran's I differ since the first approach only considers 10 neighbours, whereas the second approach is considering all of the data.
Can you provide any support on which approach is preferred?
For context: I am looking to report Moran's I on residuals for different spatial models. I'm new to this so I really appreciate your support, and any advice on my approach!
It is difficult to give a definite comment on your results without having access to your data - but the figures are broadly what you would expect. Your code is formally correct, the two results look roughly in line.
You are seeing a mild positive autocorellation of property prices - expensive properties tend to be surrounded by other expensive ones, cheap by cheap ones. This makes intuitive sense.
What you want to do now are two things:
have a look at the Z score; in the realm of {spdep} it is called Moran I statistic standard deviate, I am uncertain how {ape} calls this figure but I am certain it supports it (somehow). The Z score has normal distribution, so it will give you an indication of significance (p-value)
compare the Moran's I statistic for raw prices with the same statistic for residuals after modelling. A nice output would be material autocorellation of raw prices that goes away for residuals (it does not always work out; but it looks nice when it does).
A small issue that kind of nags me is the line when you replace infinity values with zeroes - I understand why you need it, but it implies that you have at least one pair of points with zero distances; this may mean data quality issues, so you may want to double check this.
Should you confirm that this is a valid and material issue (zero distance are valid => infinity inverted distances are correct and it is not a matter of negligible number of points) you probably don't want to use inverse distance weights. You are overriding a very high number (infinity) by zero. At the very least I would consider some other replacement than zero.