When trying to determine the effect of some (independent) variables on the outcome of phenomena (dependent variable), you often use regression to model such an outcome and understand the influence each of the variables has in the model. With spatial regression, it is the same. You just need to use the spatial dimension in a mindful way.

This session provides an introduction to ways of incorporating space into regression models, from spatial variables in standard linear regression to geographically weighted regression.

import esdaimport geopandas as gpdimport matplotlib.pyplot as pltimport mgwrimport numpy as npimport pandas as pdimport seaborn as snsimport statsmodels.formula.api as smfrom libpysal import graph

You will work with the same data you already used in the session on spatial autocorrelation - the results of the second round of the presidential elections in Czechia in 2023, between Petr Pavel and Andrej Babiš, on a level of municipalities. You can read the election data directly from the original location.

Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:

Download the file by right-clicking on this link and saving the file

Place the file in the same folder as the notebook where you intend to read it

The election results give you the dependent variable - you will look at the percentage of votes Petr Pavel, the winner, received. From the map of the results and the analysis you did when exploring spatial autocorrelation you already know that there are some significant spatial patterns. Let’s look whether these patterns correspond to the composition of education levels within each municipality.

You can use the data from the Czech Statistical Office reflecting the situation during the Census 2021. The original table has been preprocessed and is available as a CSV.

Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:

Download the file by right-clicking on this link and saving the file

Place the file in the same folder as the notebook where you intend to read it

Replace the code in the cell above with:

education = pd.read_csv("education.csv",)

The first thing you need to do is to merge the two tables, to have both dependent and independent variables together. The municipality code in the elections table is in the "nationalCode" column, while in the education table in the "uzemi_kod" column.

That is all sorted and ready to be used in a regression.

Non-spatial linear regression

Before jumping into spatial regression, let’s start with the standard linear regression. A useful start is to explore the data using an ordinary least squares (OLS) linear regression model.

OLS model

While this course is not formula-heavy, in this case, it is useful to use the formula to explain the logic of the algorithm. The OLS tries to model the dependent variable \(y\) as the linear combination of independent variables \(x_1, x_2, ... x_n\):

where \(\epsilon_{i}\) represents unobserved random variables and \(\alpha\) represents an intercept - a constant. You know the \(y_i\), all of the \(x_i\) and try to estimate the coefficients. In Python, you can run linear regression using implementations from more than one package (e.g., statsmodels, scikit-learn, spreg). This course covers statsmodels approach as it has a nice API to work with.

First, you need a list of names of independent variables. That is equal to column names without a few of the columns that represent other data.

In the formula, specify the dependent variable ("PetrPavel") as a function of ("~") independent variables ("undetermined + incomplete_primary_education + ...").

With the formula ready, you can fit the model and estimate all betas and \(\varepsilon\).

ols = sm.ols(formula, data=elections_data).fit()

The ols object offers a handy summary() function providing most of the results from the fitting in one place.

ols.summary()

OLS Regression Results

Dep. Variable:

PetrPavel

R-squared:

0.423

Model:

OLS

Adj. R-squared:

0.422

Method:

Least Squares

F-statistic:

352.6

Date:

Tue, 08 Oct 2024

Prob (F-statistic):

0.00

Time:

13:24:11

Log-Likelihood:

-22397.

No. Observations:

6254

AIC:

4.482e+04

Df Residuals:

6240

BIC:

4.492e+04

Df Model:

13

Covariance Type:

nonrobust

coef

std err

t

P>|t|

[0.025

0.975]

Intercept

0.1283

0.006

19.748

0.000

0.116

0.141

without_education

0.3621

0.093

3.914

0.000

0.181

0.543

undetermined

0.1879

0.041

4.542

0.000

0.107

0.269

incomplete_primary_education

-0.0881

0.119

-0.737

0.461

-0.322

0.146

lower_secondary_and_secondary_education

0.2890

0.013

21.435

0.000

0.263

0.315

further_education

0.9665

0.116

8.312

0.000

0.739

1.194

post_maturita_studies

1.3528

0.204

6.635

0.000

0.953

1.752

bachelors_degree

1.1634

0.092

12.581

0.000

0.982

1.345

doctoral_degree

1.2223

0.220

5.550

0.000

0.791

1.654

masters_degree

1.1231

0.036

31.201

0.000

1.053

1.194

higher_vocational_education

1.7312

0.132

13.124

0.000

1.473

1.990

higher_vocational_education_in_a_conservatory

2.7664

0.577

4.796

0.000

1.636

3.897

primary_education

0.0723

0.033

2.213

0.027

0.008

0.136

complete_secondary_vocational_education

0.8683

0.032

27.316

0.000

0.806

0.931

complete_secondary_general_education

0.8121

0.038

21.247

0.000

0.737

0.887

Omnibus:

130.315

Durbin-Watson:

1.680

Prob(Omnibus):

0.000

Jarque-Bera (JB):

215.929

Skew:

0.189

Prob(JB):

1.29e-47

Kurtosis:

3.828

Cond. No.

6.06e+17

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 3.61e-29. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.

It is clear that education composition has a significant effect on the outcome of the elections but can explain only about 42% of its variance (adjusted \(R^2\) is 0.422). A higher amount of residents with only primary education tends to lower Pavel’s gain while a higher amount of university degrees tends to increase the number of votes he received. That is nothing unexpected. However, let’s make use of geography and unpack these results a bit.

Spatial exploration of the model (hidden structures)

Start with the visualisation of the prediction the OLS model produces using the coefficients shown above.

Plot the predicted data on the elections_data geometry.

3

Plot the original results.

4

Set titles for axes in the subplot.

5

Remove axes borders.

The general patterns are captured but there are some areas of the country which seem to be quite off. The actual error between prediction and the dependent variable is captured as residuals, which are directly available in ols as ols.resid attribute. Let’s plot to get a better comparison.

Assign residuals as a column. This is not needed for the plot but it will be useful later.

2

Identify the maximum residual value based on absolute value to specify vmin and vmax values of the colormap.

3

Plot the data using diverging colormap centred around 0.

All of the municipalities in blue (residual above 0) have reported higher gains for Petr Pavel than the model assumes based on education structure, while all in red reported lower gains than what is expected. However, as data scientists, we have better tools to analyse the spatial structure of residuals than eyeballing it. Let’s recall the session on spatial autocorrelation again and figure out the spatial clusters of residuals.

First, create a contiguity graph and row-normalise it.