When trying to determine the effect of some (independent) variables on the outcome of phenomena (dependent variable), you often use regression to model such an outcome and understand the influence each of the variables has in the model. With spatial regression, it is the same. You just need to use the spatial dimension in a mindful way.
This session provides an introduction to ways of incorporating space into regression models, from spatial variables in standard linear regression to geographically weighted regression.
import esdaimport geopandas as gpdimport matplotlib.pyplot as pltimport mgwrimport numpy as npimport pandas as pdimport seaborn as snsimport statsmodels.formula.api as smfrom libpysal import graph
You will work with the same data you already used in the session on spatial autocorrelation - the results of the second round of the presidential elections in Czechia in 2023, between Petr Pavel and Andrej Babiš, on a level of municipalities. You can read the election data directly from the original location.
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:
Download the file by right-clicking on this link and saving the file
Place the file in the same folder as the notebook where you intend to read it
The election results give you the dependent variable - you will look at the percentage of votes Petr Pavel, the winner, received. From the map of the results and the analysis you did when exploring spatial autocorrelation you already know that there are some significant spatial patterns. Let’s look whether these patterns correspond to the composition of education levels within each municipality.
You can use the data from the Czech Statistical Office reflecting the situation during the Census 2021. The original table has been preprocessed and is available as a CSV.
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:
Download the file by right-clicking on this link and saving the file
Place the file in the same folder as the notebook where you intend to read it
Replace the code in the cell above with:
education = pd.read_csv("education.csv",)
The first thing you need to do is to merge the two tables, to have both dependent and independent variables together. The municipality code in the elections table is in the "nationalCode" column, while in the education table in the "uzemi_kod" column.
That is all sorted and ready to be used in a regression.
Non-spatial linear regression
Before jumping into spatial regression, let’s start with the standard linear regression. A useful start is to explore the data using an ordinary least squares (OLS) linear regression model.
OLS model
While this course is not formula-heavy, in this case, it is useful to use the formula to explain the logic of the algorithm. The OLS tries to model the dependent variable \(y\) as the linear combination of independent variables \(x_1, x_2, ... x_n\):
where \(\epsilon_{i}\) represents unobserved random variables and \(\alpha\) represents an intercept - a constant. You know the \(y_i\), all of the \(x_i\) and try to estimate the coefficients. In Python, you can run linear regression using implementations from more than one package (e.g., statsmodels, scikit-learn, spreg). This course covers statsmodels approach as it has a nice API to work with.
First, you need a list of names of independent variables. That is equal to column names without a few of the columns that represent other data.
In the formula, specify the dependent variable ("PetrPavel") as a function of ("~") independent variables ("undetermined + incomplete_primary_education + ...").
With the formula ready, you can fit the model and estimate all betas and \(\varepsilon\).
ols = sm.ols(formula, data=elections_data).fit()
The ols object offers a handy summary() function providing most of the results from the fitting in one place.
ols.summary()
OLS Regression Results
Dep. Variable:
PetrPavel
R-squared:
0.423
Model:
OLS
Adj. R-squared:
0.422
Method:
Least Squares
F-statistic:
352.6
Date:
Tue, 08 Oct 2024
Prob (F-statistic):
0.00
Time:
13:24:11
Log-Likelihood:
-22397.
No. Observations:
6254
AIC:
4.482e+04
Df Residuals:
6240
BIC:
4.492e+04
Df Model:
13
Covariance Type:
nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
0.1283
0.006
19.748
0.000
0.116
0.141
without_education
0.3621
0.093
3.914
0.000
0.181
0.543
undetermined
0.1879
0.041
4.542
0.000
0.107
0.269
incomplete_primary_education
-0.0881
0.119
-0.737
0.461
-0.322
0.146
lower_secondary_and_secondary_education
0.2890
0.013
21.435
0.000
0.263
0.315
further_education
0.9665
0.116
8.312
0.000
0.739
1.194
post_maturita_studies
1.3528
0.204
6.635
0.000
0.953
1.752
bachelors_degree
1.1634
0.092
12.581
0.000
0.982
1.345
doctoral_degree
1.2223
0.220
5.550
0.000
0.791
1.654
masters_degree
1.1231
0.036
31.201
0.000
1.053
1.194
higher_vocational_education
1.7312
0.132
13.124
0.000
1.473
1.990
higher_vocational_education_in_a_conservatory
2.7664
0.577
4.796
0.000
1.636
3.897
primary_education
0.0723
0.033
2.213
0.027
0.008
0.136
complete_secondary_vocational_education
0.8683
0.032
27.316
0.000
0.806
0.931
complete_secondary_general_education
0.8121
0.038
21.247
0.000
0.737
0.887
Omnibus:
130.315
Durbin-Watson:
1.680
Prob(Omnibus):
0.000
Jarque-Bera (JB):
215.929
Skew:
0.189
Prob(JB):
1.29e-47
Kurtosis:
3.828
Cond. No.
6.06e+17
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 3.61e-29. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
It is clear that education composition has a significant effect on the outcome of the elections but can explain only about 42% of its variance (adjusted \(R^2\) is 0.422). A higher amount of residents with only primary education tends to lower Pavel’s gain while a higher amount of university degrees tends to increase the number of votes he received. That is nothing unexpected. However, let’s make use of geography and unpack these results a bit.
Spatial exploration of the model (hidden structures)
Start with the visualisation of the prediction the OLS model produces using the coefficients shown above.
Plot the predicted data on the elections_data geometry.
3
Plot the original results.
4
Set titles for axes in the subplot.
5
Remove axes borders.
The general patterns are captured but there are some areas of the country which seem to be quite off. The actual error between prediction and the dependent variable is captured as residuals, which are directly available in ols as ols.resid attribute. Let’s plot to get a better comparison.
Assign residuals as a column. This is not needed for the plot but it will be useful later.
2
Identify the maximum residual value based on absolute value to specify vmin and vmax values of the colormap.
3
Plot the data using diverging colormap centred around 0.
All of the municipalities in blue (residual above 0) have reported higher gains for Petr Pavel than the model assumes based on education structure, while all in red reported lower gains than what is expected. However, as data scientists, we have better tools to analyse the spatial structure of residuals than eyeballing it. Let’s recall the session on spatial autocorrelation again and figure out the spatial clusters of residuals.
First, create a contiguity graph and row-normalise it.