import esda
import geopandas as gpd
import matplotlib.pyplot as plt
import mgwr
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as sm
from libpysal import graph
Regression and geography
When trying to determine the effect of some (independent) variables on the outcome of phenomena (dependent variable), you often use regression to model such an outcome and understand the influence each of the variables has in the model. With spatial regression, it is the same. You just need to use the spatial dimension in a mindful way.
This session provides an introduction to ways of incorporating space into regression models, from spatial variables in standard linear regression to geographically weighted regression.
Data
You will work with the same data you already used in the session on spatial autocorrelation - the results of the second round of the presidential elections in Czechia in 2023, between Petr Pavel and Andrej Babiš, on a level of municipalities. You can read the election data directly from the original location.
= gpd.read_file(
elections "https://martinfleischmann.net/sds/autocorrelation/data/cz_elections_2023.gpkg"
)= elections.set_index("name")
elections elections.head()
nationalCode | PetrPavel | AndrejBabis | sourceOfName | geometry | |
---|---|---|---|---|---|
name | |||||
Želechovice nad Dřevnicí | 500011 | 61.73 | 38.26 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-518835.63 -1170505.52, -51790... |
Petrov nad Desnou | 500020 | 49.07 | 50.92 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-559207.18 -1075123.25, -55624... |
Libhošť | 500046 | 47.78 | 52.21 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-486502.29 -1123694.06, -48717... |
Krhová | 500062 | 58.79 | 41.20 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-494514.83 -1136457.23, -49539... |
Poličná | 500071 | 58.20 | 41.79 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-499678.51 -1143457.52, -49799... |
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:
- Download the file by right-clicking on this link and saving the file
- Place the file in the same folder as the notebook where you intend to read it
- Replace the code in the cell above with:
= gpd.read_file(
elections "cz_elections_2023.gpkg",
)
The election results give you the dependent variable - you will look at the percentage of votes Petr Pavel, the winner, received. From the map of the results and the analysis you did when exploring spatial autocorrelation you already know that there are some significant spatial patterns. Let’s look whether these patterns correspond to the composition of education levels within each municipality.
You can use the data from the Czech Statistical Office reflecting the situation during the Census 2021. The original table has been preprocessed and is available as a CSV.
= pd.read_csv(
education "https://martinfleischmann.net/sds/regression/data/education.csv"
) education.head()
uzemi_kod | without_education | undetermined | incomplete_primary_education | lower_secondary_and_secondary_education | further_education | post_maturita_studies | bachelors_degree | doctoral_degree | masters_degree | higher_vocational_education | higher_vocational_education_in_a_conservatory | primary_education | complete_secondary_vocational_education | complete_secondary_general_education | okres | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 500011 | 0.570704 | 3.741281 | 1.141408 | 34.242232 | 1.775523 | 0.507292 | 2.853519 | 0.634115 | 12.935954 | 1.395054 | 0.126823 | 11.350666 | 17.945466 | 10.779962 | Zlín |
1 | 500020 | 0.885827 | 3.346457 | 1.968504 | 40.157480 | 2.066929 | 0.885827 | 1.771654 | 0.492126 | 6.299213 | 1.574803 | 0.000000 | 15.059055 | 16.338583 | 9.153543 | Šumperk |
2 | 500046 | 0.359195 | 3.232759 | 0.790230 | 39.152299 | 2.514368 | 0.790230 | 3.520115 | 0.215517 | 10.632184 | 1.364943 | 0.143678 | 9.770115 | 15.301724 | 12.212644 | Nový Jičín |
3 | 500062 | 0.238237 | 3.573556 | 1.072067 | 32.757594 | 2.084574 | 1.131626 | 3.037522 | 0.178678 | 13.281715 | 0.714711 | 0.119119 | 11.316260 | 18.701608 | 11.792734 | Vsetín |
4 | 500071 | 0.412939 | 2.890571 | 1.238816 | 34.067447 | 1.720578 | 0.757054 | 3.028217 | 0.137646 | 11.080523 | 0.894701 | 0.000000 | 9.772884 | 20.027529 | 13.971094 | Vsetín |
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:
- Download the file by right-clicking on this link and saving the file
- Place the file in the same folder as the notebook where you intend to read it
- Replace the code in the cell above with:
= pd.read_csv(
education "education.csv",
)
The first thing you need to do is to merge the two tables, to have both dependent and independent variables together. The municipality code in the elections
table is in the "nationalCode"
column, while in the education table in the "uzemi_kod"
column.
= elections.merge(education, left_on="nationalCode", right_on="uzemi_kod")
elections_data elections_data.head()
nationalCode | PetrPavel | AndrejBabis | sourceOfName | geometry | uzemi_kod | without_education | undetermined | incomplete_primary_education | lower_secondary_and_secondary_education | ... | post_maturita_studies | bachelors_degree | doctoral_degree | masters_degree | higher_vocational_education | higher_vocational_education_in_a_conservatory | primary_education | complete_secondary_vocational_education | complete_secondary_general_education | okres | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 500011 | 61.73 | 38.26 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-518835.63 -1170505.52, -51790... | 500011 | 0.570704 | 3.741281 | 1.141408 | 34.242232 | ... | 0.507292 | 2.853519 | 0.634115 | 12.935954 | 1.395054 | 0.126823 | 11.350666 | 17.945466 | 10.779962 | Zlín |
1 | 500020 | 49.07 | 50.92 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-559207.18 -1075123.25, -55624... | 500020 | 0.885827 | 3.346457 | 1.968504 | 40.157480 | ... | 0.885827 | 1.771654 | 0.492126 | 6.299213 | 1.574803 | 0.000000 | 15.059055 | 16.338583 | 9.153543 | Šumperk |
2 | 500046 | 47.78 | 52.21 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-486502.29 -1123694.06, -48717... | 500046 | 0.359195 | 3.232759 | 0.790230 | 39.152299 | ... | 0.790230 | 3.520115 | 0.215517 | 10.632184 | 1.364943 | 0.143678 | 9.770115 | 15.301724 | 12.212644 | Nový Jičín |
3 | 500062 | 58.79 | 41.20 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-494514.83 -1136457.23, -49539... | 500062 | 0.238237 | 3.573556 | 1.072067 | 32.757594 | ... | 1.131626 | 3.037522 | 0.178678 | 13.281715 | 0.714711 | 0.119119 | 11.316260 | 18.701608 | 11.792734 | Vsetín |
4 | 500071 | 58.20 | 41.79 | Český úřad zeměměřický a katastrální | MULTIPOLYGON (((-499678.51 -1143457.52, -49799... | 500071 | 0.412939 | 2.890571 | 1.238816 | 34.067447 | ... | 0.757054 | 3.028217 | 0.137646 | 11.080523 | 0.894701 | 0.000000 | 9.772884 | 20.027529 | 13.971094 | Vsetín |
5 rows × 21 columns
That is all sorted and ready to be used in a regression.
Non-spatial linear regression
Before jumping into spatial regression, let’s start with the standard linear regression. A useful start is to explore the data using an ordinary least squares (OLS) linear regression model.
OLS model
While this course is not formula-heavy, in this case, it is useful to use the formula to explain the logic of the algorithm. The OLS tries to model the dependent variable \(y\) as the linear combination of independent variables \(x_1, x_2, ... x_n\):
\[y_{i}=\alpha+\beta _{1}\ x_{i1}+\beta _{2}\ x_{i2}+\cdots +\beta _{p}\ x_{ip}+\varepsilon _{i}\]
where \(\epsilon_{i}\) represents unobserved random variables and \(\alpha\) represents an intercept - a constant. You know the \(y_i\), all of the \(x_i\) and try to estimate the coefficients. In Python, you can run linear regression using implementations from more than one package (e.g., statsmodels
, scikit-learn
, spreg
). This course covers statsmodels
approach as it has a nice API to work with.
First, you need a list of names of independent variables. That is equal to column names without a few of the columns that represent other data.
= education.columns.drop(["uzemi_kod", "okres"])
independent_names independent_names
Index(['without_education', 'undetermined', 'incomplete_primary_education',
'lower_secondary_and_secondary_education', 'further_education',
'post_maturita_studies', 'bachelors_degree', 'doctoral_degree',
'masters_degree', 'higher_vocational_education',
'higher_vocational_education_in_a_conservatory', 'primary_education',
'complete_secondary_vocational_education',
'complete_secondary_general_education'],
dtype='object')
statsmodels
(above imported as sm
) offers an intuitive formula API to define the linear regression.
1= f"PetrPavel ~ {' + '.join(independent_names)}"
formula formula
- 1
-
In the formula, specify the dependent variable (
"PetrPavel"
) as a function of ("~"
) independent variables ("undetermined + incomplete_primary_education + ..."
).
'PetrPavel ~ without_education + undetermined + incomplete_primary_education + lower_secondary_and_secondary_education + further_education + post_maturita_studies + bachelors_degree + doctoral_degree + masters_degree + higher_vocational_education + higher_vocational_education_in_a_conservatory + primary_education + complete_secondary_vocational_education + complete_secondary_general_education'
With the formula ready, you can fit the model and estimate all betas and \(\varepsilon\).
= sm.ols(formula, data=elections_data).fit() ols
The ols
object offers a handy summary()
function providing most of the results from the fitting in one place.
ols.summary()
Dep. Variable: | PetrPavel | R-squared: | 0.423 |
Model: | OLS | Adj. R-squared: | 0.422 |
Method: | Least Squares | F-statistic: | 352.6 |
Date: | Tue, 18 Mar 2025 | Prob (F-statistic): | 0.00 |
Time: | 11:15:13 | Log-Likelihood: | -22397. |
No. Observations: | 6254 | AIC: | 4.482e+04 |
Df Residuals: | 6240 | BIC: | 4.492e+04 |
Df Model: | 13 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 0.1283 | 0.006 | 19.748 | 0.000 | 0.116 | 0.141 |
without_education | 0.3621 | 0.093 | 3.914 | 0.000 | 0.181 | 0.543 |
undetermined | 0.1879 | 0.041 | 4.542 | 0.000 | 0.107 | 0.269 |
incomplete_primary_education | -0.0881 | 0.119 | -0.737 | 0.461 | -0.322 | 0.146 |
lower_secondary_and_secondary_education | 0.2890 | 0.013 | 21.435 | 0.000 | 0.263 | 0.315 |
further_education | 0.9665 | 0.116 | 8.312 | 0.000 | 0.739 | 1.194 |
post_maturita_studies | 1.3528 | 0.204 | 6.635 | 0.000 | 0.953 | 1.752 |
bachelors_degree | 1.1634 | 0.092 | 12.581 | 0.000 | 0.982 | 1.345 |
doctoral_degree | 1.2223 | 0.220 | 5.550 | 0.000 | 0.791 | 1.654 |
masters_degree | 1.1231 | 0.036 | 31.201 | 0.000 | 1.053 | 1.194 |
higher_vocational_education | 1.7312 | 0.132 | 13.124 | 0.000 | 1.473 | 1.990 |
higher_vocational_education_in_a_conservatory | 2.7664 | 0.577 | 4.796 | 0.000 | 1.636 | 3.897 |
primary_education | 0.0723 | 0.033 | 2.213 | 0.027 | 0.008 | 0.136 |
complete_secondary_vocational_education | 0.8683 | 0.032 | 27.316 | 0.000 | 0.806 | 0.931 |
complete_secondary_general_education | 0.8121 | 0.038 | 21.247 | 0.000 | 0.737 | 0.887 |
Omnibus: | 130.315 | Durbin-Watson: | 1.680 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 215.929 |
Skew: | 0.189 | Prob(JB): | 1.29e-47 |
Kurtosis: | 3.828 | Cond. No. | 6.06e+17 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.61e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
It is clear that education composition has a significant effect on the outcome of the elections but can explain only about 42% of its variance (adjusted \(R^2\) is 0.422). A higher amount of residents with only primary education tends to lower Pavel’s gain while a higher amount of university degrees tends to increase the number of votes he received. That is nothing unexpected. However, let’s make use of geography and unpack these results a bit.