import geopandas as gpd
import pandas as pd
import seaborn as sns
from libpysal import graph
from sklearn import cluster
Clustering and regionalisation
This session is all about finding groups of similar observations in data using clustering techniques.
Many questions and topics are complex phenomena that involve several dimensions and are hard to summarise into a single variable. In statistical terms, you call this family of problems multivariate, as opposed to univariate cases where only a single variable is considered in the analysis. Clustering tackles this kind of questions by reducing their dimensionality -the number of relevant variables the analyst needs to look at - and converting it into a more intuitive set of classes that even non-technical audiences can look at and make sense of. For this reason, it is widely used in applied contexts such as policymaking or marketing. In addition, since these methods do not require many preliminary assumptions about the structure of the data, it is a commonly used exploratory tool, as it can quickly give clues about the shape, form and content of a dataset.
The basic idea of statistical clustering is to summarise the information contained in several variables by creating a relatively small number of categories. Each observation in the dataset is then assigned to one, and only one, category depending on its values for the variables originally considered in the classification. If done correctly, the exercise reduces the complexity of a multi-dimensional problem while retaining all the meaningful information contained in the original dataset. This is because once classified, the analyst only needs to look at in which category every observation falls into, instead of considering the multiple values associated with each of the variables and trying to figure out how to put them together in a coherent sense. When the clustering is performed on observations that represent areas, the technique is often called geodemographic analysis.
Although there exist many techniques to statistically group observations in a dataset, all of them are based on the premise of using a set of attributes to define classes or categories of observations that are similar within each of them, but differ between groups. How similarity within groups and dissimilarity between them is defined and how the classification algorithm is operationalised is what makes techniques differ and also what makes each of them particularly well suited for specific problems or types of data.
In the case of analysing spatial data, there is a subset of methods that are of particular interest for many common cases in Spatial Data Science. These are the so-called regionalisation techniques. Regionalisation methods can also take many forms and faces but, at their core, they all involve statistical clustering of observations with the additional constraint that observations need to be geographical neighbours to be in the same category. Because of this, rather than category, you will use the term area for each observation and region for each category, hence regionalisation, the construction of regions from smaller areas.
The Python package you will use for clustering today is called scikit-learn
and can be imported as sklearn
.
Attribute-based clustering
In this session, you will be working with another dataset you should already be familiar with - the Scottish Index of Multiple Deprivation. This time, you will focus only on the area of Edinburgh prepared for this course.
Scottish Index of Multiple Deprivation
As always, the table can be read from the site:
= gpd.read_file(
simd "https://martinfleischmann.net/sds/clustering/data/edinburgh_simd_2020.gpkg"
)
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:
- Download the file by right-clicking on this link and saving the file
- Place the file in the same folder as the notebook where you intend to read it
- Replace the code in the cell above with:
= gpd.read_file(
simd "edinburgh_simd_2020.gpkg",
)
Inspect the structure of the table:
simd.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 597 entries, 0 to 596
Data columns (total 52 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DataZone 597 non-null object
1 DZName 597 non-null object
2 LAName 597 non-null object
3 SAPE2017 597 non-null int64
4 WAPE2017 597 non-null int64
5 Rankv2 597 non-null int64
6 Quintilev2 597 non-null int64
7 Decilev2 597 non-null int64
8 Vigintilv2 597 non-null int64
9 Percentv2 597 non-null int64
10 IncRate 597 non-null object
11 IncNumDep 597 non-null int64
12 IncRankv2 597 non-null float64
13 EmpRate 597 non-null object
14 EmpNumDep 597 non-null int64
15 EmpRank 597 non-null float64
16 HlthCIF 597 non-null int64
17 HlthAlcSR 597 non-null int64
18 HlthDrugSR 597 non-null int64
19 HlthSMR 597 non-null int64
20 HlthDprsPc 597 non-null object
21 HlthLBWTPc 597 non-null object
22 HlthEmergS 597 non-null int64
23 HlthRank 597 non-null int64
24 EduAttend 597 non-null object
25 EduAttain 597 non-null float64
26 EduNoQuals 597 non-null int64
27 EduPartici 597 non-null object
28 EduUniver 597 non-null object
29 EduRank 597 non-null int64
30 GAccPetrol 597 non-null float64
31 GAccDTGP 597 non-null float64
32 GAccDTPost 597 non-null float64
33 GAccDTPsch 597 non-null float64
34 GAccDTSsch 597 non-null float64
35 GAccDTRet 597 non-null float64
36 GAccPTGP 597 non-null float64
37 GAccPTPost 597 non-null float64
38 GAccPTRet 597 non-null float64
39 GAccBrdbnd 597 non-null object
40 GAccRank 597 non-null int64
41 CrimeCount 597 non-null int64
42 CrimeRate 597 non-null int64
43 CrimeRank 597 non-null float64
44 HouseNumOC 597 non-null int64
45 HouseNumNC 597 non-null int64
46 HouseOCrat 597 non-null object
47 HouseNCrat 597 non-null object
48 HouseRank 597 non-null float64
49 Shape_Leng 597 non-null float64
50 Shape_Area 597 non-null float64
51 geometry 597 non-null geometry
dtypes: float64(16), geometry(1), int64(22), object(13)
memory usage: 242.7+ KB
Before you jump into exploring the data, one additional step that will come in handy down the line. Not every variable in the table is an attribute that you will want for the clustering. In particular, you are interested in sub-ranks based on individual SIMD domains, so you will only consider those. Hence, first manually write them so they are easier to subset:
= [
subranks "IncRankv2",
"EmpRank",
"HlthRank",
"EduRank",
"GAccRank",
"CrimeRank",
"HouseRank"
]
You can quickly familiarise yourself with those variables by plotting a few maps like the one below to build your intuition about what is going to happen.
"IncRankv2", "geometry"]].explore("IncRankv2", tiles="CartoDB Positron", tooltip=False) simd[[
You can see a decent degree of spatial variation between different sub-ranks. Even though you only have seven variables, it is very hard to “mentally overlay” all of them to come up with an overall assessment of the nature of each part of Edinburgh. For bivariate correlations, a useful tool is the correlation matrix plot, available in seaborn
:
= sns.pairplot(simd[subranks],height=1, plot_kws={"s":1}) _
This is helpful to consider uni and bivariate questions such as: what is the relationship between the ranks? Is health correlated with income? However, sometimes, this is not enough and you are interested in more sophisticated questions that are truly multivariate and, in these cases, the figure above cannot help us. For example, it is not straightforward to answer questions like: what are the main characteristics of the South of Edinburgh? What areas are similar to the core of the city? Are the East and West of Edinburgh similar in terms of deprivation levels? For these kinds of multi-dimensional questions -involving multiple variables at the same time- you require a truly multidimensional method like statistical clustering.
K-Means
A cluster analysis involves the classification of the areas that make up a geographical map into groups or categories of observations that are similar within each other but different between them. The classification is carried out using a statistical clustering algorithm that takes as input a set of attributes and returns the group (“labels” in the terminology) each observation belongs to. Depending on the particular algorithm employed, additional parameters, such as the desired number of clusters employed or more advanced tuning parameters (e.g. bandwith, radius, etc.), also need to be entered as inputs. For your classification of SIMD in Edinburgh, you will start with one of the most popular clustering algorithms: K-means. This technique only requires as input the observation attributes and the final number of groups that you want it to cluster the observations into. In your case, you will use five to begin with as this will allow us to have a closer look into each of them.
Although the underlying algorithm is not trivial, running K-means in Python is streamlined thanks to scikit-learn
. Similar to the extensive set of available algorithms in the library, its computation is a matter of two lines of code. First, you need to specify the parameters in the KMeans
method (which is part of scikit-learn
’s cluster
submodule). Note that, at this point, you do not even need to pass the data:
1= cluster.KMeans(n_clusters=5, random_state=42) kmeans5
- 1
-
n_clusters
specifies the number of clusters you want to get andrandom_state
sets the random generator to a known state, ensuring that the result is always the same.
This sets up an object that holds all the parameters required to run the algorithm. To actually run the algorithm on the attributes, you need to call the fit
method in kmeans5
:
1 kmeans5.fit(simd[subranks])
- 1
-
fit()
takes an array of data; therefore, pass the columns ofsimd
with sub-ranks and run the clustering algorithm on that.
KMeans(n_clusters=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=42)
The kmeans5
object now contains several components that can be useful for an analysis. For now, you will use the labels, which represent the different categories in which you have grouped the data. Remember, in Python, life starts at zero, so the group labels go from zero to four. Labels can be extracted as follows:
kmeans5.labels_
array([2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 3, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3, 1, 1,
1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1,
1, 3, 0, 3, 0, 3, 3, 3, 3, 3, 1, 1, 3, 3, 0, 3, 3, 1, 0, 0, 0, 0,
4, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 4, 2, 2, 2, 2, 2, 1, 1, 3,
3, 3, 3, 1, 1, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 2, 3, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 3, 1, 1, 1, 3, 1, 1,
1, 3, 4, 2, 2, 4, 4, 4, 2, 3, 1, 1, 1, 3, 1, 1, 2, 2, 2, 2, 2, 3,
2, 4, 1, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 4, 2, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 4, 4, 0, 0, 0, 0, 3, 4, 0, 4, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 3, 1, 3, 3, 3, 1, 3, 3,
3, 0, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 1, 3,
3, 3, 3, 1, 3, 1, 3, 3, 3, 0, 3, 3, 4, 0, 0, 2, 2, 4, 1, 2, 1, 3,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 2, 2, 2, 4,
4, 4, 4, 3, 4, 3, 3, 1, 3, 3, 4, 4, 4, 4, 3, 4, 4, 3, 1, 3, 0, 1,
4, 4, 4, 4, 3, 1, 1, 3, 1, 1, 1, 1, 1, 1, 3, 0, 0, 3, 3, 4, 3, 1,
3, 0, 3, 3, 3, 1, 3, 1, 1, 3, 0, 3, 1, 3, 0, 3, 1, 3, 1, 3, 1, 1,
3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 0, 3, 3, 3, 0, 0, 0, 0,
1, 3, 3, 3, 3, 0, 0, 1, 0, 4, 4, 4, 4, 3, 4, 0, 4, 4, 4, 2, 4, 4,
4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0,
0, 4, 0, 0, 4, 0, 4, 2, 4, 0, 4, 0, 0, 0, 0, 0, 4, 4, 0, 4, 4, 3,
4, 4, 2, 4, 2, 2, 4, 2, 2, 2, 0, 2, 2, 4, 4, 2, 4, 2, 4, 4, 1, 1,
3, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 3, 1, 1,
1, 1, 1, 3, 3, 1, 1, 1, 1, 3, 1, 1, 4, 4, 0, 2, 4, 4, 2, 2, 2, 2,
2, 2, 2, 4, 3, 3, 1, 1, 1, 3, 2, 2, 3, 1, 3, 2, 2, 2, 2, 4, 4, 2,
4, 2, 2, 4, 2, 2, 4, 4, 3, 4, 4, 0, 3, 3, 4, 3, 3, 3, 3, 4, 4, 3,
0, 0, 2, 3, 2, 3, 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 4, 3, 3, 4, 1, 4,
2, 2, 2], dtype=int32)
Each number represents a different category, so two observations with the same number belong to the same group. The labels are returned in the same order as the input attributes were passed in, which means you can append them to the original table of data as an additional column:
"kmeans_5"] = kmeans5.labels_
simd["kmeans_5"].head() simd[
0 2
1 2
2 2
3 2
4 3
Name: kmeans_5, dtype: int32
It is useful to display the categories created on a map to better understand the classification you have just performed. For this, you will use a unique values choropleth, which will automatically assign a different colour to each category:
"kmeans_5", 'geometry']].explore("kmeans_5", categorical=True, tiles="CartoDB Positron") simd[[