Regression and geography

When trying to determine the effect of some (independent) variables on the outcome of phenomena (dependent variable), you often use regression to model such an outcome and understand the influence each of the variables has in the model. With spatial regression, it is the same. You just need to use the spatial dimension in a mindful way.

This session provides an introduction to ways of incorporating space into regression models, from spatial variables in standard linear regression to geographically weighted regression.

import esda
import geopandas as gpd
import matplotlib.pyplot as plt
import mgwr
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as sm
from libpysal import graph
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/utils.py:222: SyntaxWarning: invalid escape sequence '\{'
  """
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/utils.py:261: SyntaxWarning: invalid escape sequence '\d'
  """
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/utils.py:633: SyntaxWarning: invalid escape sequence '\d'
  """
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/output.py:137: SyntaxWarning: invalid escape sequence '\_'
  df_2['Variable'] = df_2['Variable'].str.replace("_", "\_").str.replace("%", "\%")
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/output.py:137: SyntaxWarning: invalid escape sequence '\%'
  df_2['Variable'] = df_2['Variable'].str.replace("_", "\_").str.replace("%", "\%")
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/diagnostics_sp.py:291: SyntaxWarning: invalid escape sequence '\d'
  """
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/diagnostics_sp.py:444: SyntaxWarning: invalid escape sequence '\d'
  """
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/diagnostics_sp.py:920: SyntaxWarning: invalid escape sequence '\d'
  """
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/error_sp_het.py:1575: SyntaxWarning: invalid escape sequence '\l'
  """
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/spreg/error_sp_hom.py:1514: SyntaxWarning: invalid escape sequence '\p'
  """

Data

You will work with the same data you already used in the session on spatial autocorrelation - the results of the second round of the presidential elections in Czechia in 2023, between Petr Pavel and Andrej Babiš, on a level of municipalities. You can read the election data directly from the original location.

elections = gpd.read_file(
    "https://martinfleischmann.net/sds/autocorrelation/data/cz_elections_2023.gpkg"
)
elections = elections.set_index("name")
elections.head()
nationalCode PetrPavel AndrejBabis sourceOfName geometry
name
Želechovice nad Dřevnicí 500011 61.73 38.26 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-518835.63 -1170505.52, -51790...
Petrov nad Desnou 500020 49.07 50.92 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-559207.18 -1075123.25, -55624...
Libhošť 500046 47.78 52.21 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-486502.29 -1123694.06, -48717...
Krhová 500062 58.79 41.20 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-494514.83 -1136457.23, -49539...
Poličná 500071 58.20 41.79 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-499678.51 -1143457.52, -49799...

Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:

  1. Download the file by right-clicking on this link and saving the file
  2. Place the file in the same folder as the notebook where you intend to read it
  3. Replace the code in the cell above with:
elections = gpd.read_file(
    "cz_elections_2023.gpkg",
)

The election results give you the dependent variable - you will look at the percentage of votes Petr Pavel, the winner, received. From the map of the results and the analysis you did when exploring spatial autocorrelation you already know that there are some significant spatial patterns. Let’s look whether these patterns correspond to the composition of education levels within each municipality.

You can use the data from the Czech Statistical Office reflecting the situation during the Census 2021. The original table has been preprocessed and is available as a CSV.

education = pd.read_csv(
    "https://martinfleischmann.net/sds/regression/data/education.csv"
)
education.head()
uzemi_kod without_education undetermined incomplete_primary_education lower_secondary_and_secondary_education further_education post_maturita_studies bachelors_degree doctoral_degree masters_degree higher_vocational_education higher_vocational_education_in_a_conservatory primary_education complete_secondary_vocational_education complete_secondary_general_education okres
0 500011 0.570704 3.741281 1.141408 34.242232 1.775523 0.507292 2.853519 0.634115 12.935954 1.395054 0.126823 11.350666 17.945466 10.779962 Zlín
1 500020 0.885827 3.346457 1.968504 40.157480 2.066929 0.885827 1.771654 0.492126 6.299213 1.574803 0.000000 15.059055 16.338583 9.153543 Šumperk
2 500046 0.359195 3.232759 0.790230 39.152299 2.514368 0.790230 3.520115 0.215517 10.632184 1.364943 0.143678 9.770115 15.301724 12.212644 Nový Jičín
3 500062 0.238237 3.573556 1.072067 32.757594 2.084574 1.131626 3.037522 0.178678 13.281715 0.714711 0.119119 11.316260 18.701608 11.792734 Vsetín
4 500071 0.412939 2.890571 1.238816 34.067447 1.720578 0.757054 3.028217 0.137646 11.080523 0.894701 0.000000 9.772884 20.027529 13.971094 Vsetín

Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:

  1. Download the file by right-clicking on this link and saving the file
  2. Place the file in the same folder as the notebook where you intend to read it
  3. Replace the code in the cell above with:
education = pd.read_csv(
    "education.csv",
)

The first thing you need to do is to merge the two tables, to have both dependent and independent variables together. The municipality code in the elections table is in the "nationalCode" column, while in the education table in the "uzemi_kod" column.

elections_data = elections.merge(education, left_on="nationalCode", right_on="uzemi_kod")
elections_data.head()
nationalCode PetrPavel AndrejBabis sourceOfName geometry uzemi_kod without_education undetermined incomplete_primary_education lower_secondary_and_secondary_education ... post_maturita_studies bachelors_degree doctoral_degree masters_degree higher_vocational_education higher_vocational_education_in_a_conservatory primary_education complete_secondary_vocational_education complete_secondary_general_education okres
0 500011 61.73 38.26 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-518835.63 -1170505.52, -51790... 500011 0.570704 3.741281 1.141408 34.242232 ... 0.507292 2.853519 0.634115 12.935954 1.395054 0.126823 11.350666 17.945466 10.779962 Zlín
1 500020 49.07 50.92 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-559207.18 -1075123.25, -55624... 500020 0.885827 3.346457 1.968504 40.157480 ... 0.885827 1.771654 0.492126 6.299213 1.574803 0.000000 15.059055 16.338583 9.153543 Šumperk
2 500046 47.78 52.21 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-486502.29 -1123694.06, -48717... 500046 0.359195 3.232759 0.790230 39.152299 ... 0.790230 3.520115 0.215517 10.632184 1.364943 0.143678 9.770115 15.301724 12.212644 Nový Jičín
3 500062 58.79 41.20 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-494514.83 -1136457.23, -49539... 500062 0.238237 3.573556 1.072067 32.757594 ... 1.131626 3.037522 0.178678 13.281715 0.714711 0.119119 11.316260 18.701608 11.792734 Vsetín
4 500071 58.20 41.79 Český úřad zeměměřický a katastrální MULTIPOLYGON (((-499678.51 -1143457.52, -49799... 500071 0.412939 2.890571 1.238816 34.067447 ... 0.757054 3.028217 0.137646 11.080523 0.894701 0.000000 9.772884 20.027529 13.971094 Vsetín

5 rows × 21 columns

That is all sorted and ready to be used in a regression.

Non-spatial linear regression

Before jumping into spatial regression, let’s start with the standard linear regression. A useful start is to explore the data using an ordinary least squares (OLS) linear regression model.

OLS model

While this course is not formula-heavy, in this case, it is useful to use the formula to explain the logic of the algorithm. The OLS tries to model the dependent variable \(y\) as the linear combination of independent variables \(x_1, x_2, ... x_n\):

\[y_{i}=\alpha+\beta _{1}\ x_{i1}+\beta _{2}\ x_{i2}+\cdots +\beta _{p}\ x_{ip}+\varepsilon _{i}\]

where \(\epsilon_{i}\) represents unobserved random variables and \(\alpha\) represents an intercept - a constant. You know the \(y_i\), all of the \(x_i\) and try to estimate the coefficients. In Python, you can run linear regression using implementations from more than one package (e.g., statsmodels, scikit-learn, spreg). This course covers statsmodels approach as it has a nice API to work with.

First, you need a list of names of independent variables. That is equal to column names without a few of the columns that represent other data.

independent_names = education.columns.drop(["uzemi_kod", "okres"])
independent_names
Index(['without_education', 'undetermined', 'incomplete_primary_education',
       'lower_secondary_and_secondary_education', 'further_education',
       'post_maturita_studies', 'bachelors_degree', 'doctoral_degree',
       'masters_degree', 'higher_vocational_education',
       'higher_vocational_education_in_a_conservatory', 'primary_education',
       'complete_secondary_vocational_education',
       'complete_secondary_general_education'],
      dtype='object')

statsmodels (above imported as sm) offers an intuitive formula API to define the linear regression.

1formula = f"PetrPavel ~ {' + '.join(independent_names)}"
formula
1
In the formula, specify the dependent variable ("PetrPavel") as a function of ("~") independent variables ("undetermined + incomplete_primary_education + ...").
'PetrPavel ~ without_education + undetermined + incomplete_primary_education + lower_secondary_and_secondary_education + further_education + post_maturita_studies + bachelors_degree + doctoral_degree + masters_degree + higher_vocational_education + higher_vocational_education_in_a_conservatory + primary_education + complete_secondary_vocational_education + complete_secondary_general_education'

With the formula ready, you can fit the model and estimate all betas and \(\varepsilon\).

ols = sm.ols(formula, data=elections_data).fit()

The ols object offers a handy summary() function providing most of the results from the fitting in one place.

ols.summary()
OLS Regression Results
Dep. Variable: PetrPavel R-squared: 0.423
Model: OLS Adj. R-squared: 0.422
Method: Least Squares F-statistic: 352.6
Date: Tue, 08 Oct 2024 Prob (F-statistic): 0.00
Time: 13:24:11 Log-Likelihood: -22397.
No. Observations: 6254 AIC: 4.482e+04
Df Residuals: 6240 BIC: 4.492e+04
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.1283 0.006 19.748 0.000 0.116 0.141
without_education 0.3621 0.093 3.914 0.000 0.181 0.543
undetermined 0.1879 0.041 4.542 0.000 0.107 0.269
incomplete_primary_education -0.0881 0.119 -0.737 0.461 -0.322 0.146
lower_secondary_and_secondary_education 0.2890 0.013 21.435 0.000 0.263 0.315
further_education 0.9665 0.116 8.312 0.000 0.739 1.194
post_maturita_studies 1.3528 0.204 6.635 0.000 0.953 1.752
bachelors_degree 1.1634 0.092 12.581 0.000 0.982 1.345
doctoral_degree 1.2223 0.220 5.550 0.000 0.791 1.654
masters_degree 1.1231 0.036 31.201 0.000 1.053 1.194
higher_vocational_education 1.7312 0.132 13.124 0.000 1.473 1.990
higher_vocational_education_in_a_conservatory 2.7664 0.577 4.796 0.000 1.636 3.897
primary_education 0.0723 0.033 2.213 0.027 0.008 0.136
complete_secondary_vocational_education 0.8683 0.032 27.316 0.000 0.806 0.931
complete_secondary_general_education 0.8121 0.038 21.247 0.000 0.737 0.887
Omnibus: 130.315 Durbin-Watson: 1.680
Prob(Omnibus): 0.000 Jarque-Bera (JB): 215.929
Skew: 0.189 Prob(JB): 1.29e-47
Kurtosis: 3.828 Cond. No. 6.06e+17


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.61e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

It is clear that education composition has a significant effect on the outcome of the elections but can explain only about 42% of its variance (adjusted \(R^2\) is 0.422). A higher amount of residents with only primary education tends to lower Pavel’s gain while a higher amount of university degrees tends to increase the number of votes he received. That is nothing unexpected. However, let’s make use of geography and unpack these results a bit.

Spatial exploration of the model (hidden structures)

Start with the visualisation of the prediction the OLS model produces using the coefficients shown above.

1predicted = ols.predict(elections_data)
predicted.head()
1
Use the predict() method with the original data to get the prediction using the model.
0    59.782801
1    50.879659
2    58.582439
3    60.674179
4    60.372217
dtype: float64

Make a plot comparing the prediction with the actual results.

1f, axs = plt.subplots(2, 1, figsize=(7, 8))
2elections_data.plot(
    predicted, legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[0]
)
3elections_data.plot(
    "PetrPavel", legend=True, cmap="coolwarm", vmin=0, vmax=100, ax=axs[1]
)
4axs[0].set_title("OLS prediction")
axs[1].set_title("Actual results")

5axs[0].set_axis_off()
axs[1].set_axis_off()
1
Create a subplot with two axes.
2
Plot the predicted data on the elections_data geometry.
3
Plot the original results.
4
Set titles for axes in the subplot.
5
Remove axes borders.

OLS prediction and the actual outcome

The general patterns are captured but there are some areas of the country which seem to be quite off. The actual error between prediction and the dependent variable is captured as residuals, which are directly available in ols as ols.resid attribute. Let’s plot to get a better comparison.

1elections_data["residual"] = ols.resid
2max_residual = ols.resid.abs().max()
3ax = elections_data.plot(
    "residual", legend=True, cmap="RdBu", vmin=-max_residual, vmax=max_residual
)
ax.set_axis_off()
1
Assign residuals as a column. This is not needed for the plot but it will be useful later.
2
Identify the maximum residual value based on absolute value to specify vmin and vmax values of the colormap.
3
Plot the data using diverging colormap centred around 0.

Residuals of the OLS prediction

All of the municipalities in blue (residual above 0) have reported higher gains for Petr Pavel than the model assumes based on education structure, while all in red reported lower gains than what is expected. However, as data scientists, we have better tools to analyse the spatial structure of residuals than eyeballing it. Let’s recall the session on spatial autocorrelation again and figure out the spatial clusters of residuals.

First, create a contiguity graph and row-normalise it.

contiguity_r = graph.Graph.build_contiguity(elections_data).transform("r")

Then you can generate a Moran plot of residuals. For that, you will need the lag of residuals.

elections_data["residual_lag"] = contiguity_r.lag(elections_data["residual"])

And then you can use the code from the earlier session to generate a Moran scatterplot using seaborn.

f, ax = plt.subplots(1, figsize=(6, 6))
sns.regplot(
    x="residual",
    y="residual_lag",
    data=elections_data,
    marker=".",
    scatter_kws={"alpha": 0.2},
    line_kws=dict(color="lightcoral")
)
plt.axvline(0, c="black", alpha=0.5)
plt.axhline(0, c="black", alpha=0.5);

Moran Plot

That looks like a pretty strong relationship. Use the local version of Moran’s statistic to find out the clusters.

1lisa = esda.Moran_Local(elections_data['residual'], contiguity_r)
1
Use Moran_Local function from esda.

Let’s visualise the results.

lisa.explore(elections_data, prefer_canvas=True, tiles="CartoDB Positron")
Make this Notebook Trusted to load map: File -> Trust Notebook