Spatial autocorrelation

In the last section, you learned how to encode spatial relationships between geometries into weights matrices represented by Graph objects and started touching on spatial autocorrelation with spatial lag and Moran plot. This section explores spatial autocorrelation further in its global and local variants.

Spatial autocorrelation has to do with the degree to which the similarity in values between observations in a dataset is related to the similarity in locations of such observations. Not completely unlike the traditional correlation between two variables -which informs us about how the values in one variable change as a function of those in the other- and analogous to its time-series counterpart -which relates the value of a variable at a given point in time with those in previous periods-, spatial autocorrelation relates the value of the variable of interest in a given location with values of the same variable in surrounding locations.

A key idea in this context is that of spatial randomness: a situation in which the location of observation gives no information whatsoever about its value. In other words, a variable is spatially random if it is distributed following no discernible pattern over space. Spatial autocorrelation can thus be formally defined as the “absence of spatial randomness”, which gives room for two main classes of autocorrelation, similar to the traditional case: positive spatial autocorrelation, when similar values tend to group together in similar locations; and negative spatial autocorrelation, in cases where similar values tend to be dispersed and further apart from each other.

In this session, you will learn how to explore spatial autocorrelation in a given dataset, interrogating the data about its presence, nature, and strength. To do this, you will use a set of tools collectively known as Exploratory Spatial Data Analysis (ESDA), specifically designed for this purpose. The range of ESDA methods is very wide and spans from less sophisticated approaches like choropleths and general table querying to more advanced and robust methodologies that include statistical inference and explicit recognition of the geographical dimension of the data. The purpose of this session is to dip your toes into the latter group.

ESDA techniques are usually divided into two main groups: tools to analyse global, and local spatial autocorrelation. The former considers the overall trend that the location of values follows and makes possible statements about the degree of clustering in the dataset. Do values generally follow a particular pattern in their geographical distribution? Are similar values closer to other similar values than you would expect from pure chance? These are some of the questions that tools for global spatial autocorrelation allow to answer. You will practice with global spatial autocorrelation on the Join Counts statistic and Moran’s \(I\) statistic.

Tools for local spatial autocorrelation instead focus on spatial instability: the departure of parts of a map from the general trend. The idea here is that, even though there is a given trend for the data in terms of the nature and strength of spatial association, some particular areas can diverge quite substantially from the general pattern. Regardless of the overall degree of concentration in the values, you can observe pockets of unusually high (low) values close to other high (low) values in what you will call hot (cold) spots. Additionally, it is also possible to observe some high (low) values surrounded by low (high) values, and you will name these “spatial outliers”. The main technique you will review in this session to explore local spatial autocorrelation is the Local Indicators of Spatial Association (LISA).

import geopandas as gpd
import esda
import matplotlib.pyplot as plt
import seaborn as sns

from libpysal import graph

Data

For this session, you will look at the election data. In particular, the results of the second round of the presidential elections in Czechia in 2023, between Petr Pavel and Andrej Babiš, on a level of municipalities. Each polygon has a percentage of votes for either of the candidates attached, as well as some additional information. Election data are provided by the Czech Statistical Office, and geometries are retrieved from ČÚZK. The dataset is preprocessed for the purpose of this course. If you want to see how the table was created, a notebook is available here.

To make things easier, you will read data from a file posted online, so you do not need to download any dataset:

elections = gpd.read_file(
    "https://martinfleischmann.net/sds/autocorrelation/data/cz_elections_2023.gpkg"
)
1elections = elections.set_index("name")
2elections.explore(
3    "PetrPavel",
4    cmap="coolwarm",
5    vmin=0,
    vmax=100,
6    prefer_canvas=True,
7    tiles="CartoDB Positron",
)
1
Use the name of each municipality as an index. It will help you link them to the weights matrix.
2
Create a plot to explore the data.
3
"PetrPavel" is the name of the column with the proportions of votes for Petr Pavel.
4
Use "coolwarm" divergent colormap to distinguish between municipalities where Petr Pavel won and those where Andrej Babiš did.
5
Normalise the colormap between 0 and 100 to ensure that the split between blue and red is at 50%.
6
With larger tables, using Canvas rendering instead of the default one is helpful.
7
Use less intense background tiles than the default OpenStreetMap.
Make this Notebook Trusted to load map: File -> Trust Notebook