Clustering and regionalisation

This session is all about finding groups of similar observations in data using clustering techniques.

Many questions and topics are complex phenomena that involve several dimensions and are hard to summarise into a single variable. In statistical terms, you call this family of problems multivariate, as opposed to univariate cases where only a single variable is considered in the analysis. Clustering tackles this kind of questions by reducing their dimensionality -the number of relevant variables the analyst needs to look at - and converting it into a more intuitive set of classes that even non-technical audiences can look at and make sense of. For this reason, it is widely used in applied contexts such as policymaking or marketing. In addition, since these methods do not require many preliminary assumptions about the structure of the data, it is a commonly used exploratory tool, as it can quickly give clues about the shape, form and content of a dataset.

The basic idea of statistical clustering is to summarise the information contained in several variables by creating a relatively small number of categories. Each observation in the dataset is then assigned to one, and only one, category depending on its values for the variables originally considered in the classification. If done correctly, the exercise reduces the complexity of a multi-dimensional problem while retaining all the meaningful information contained in the original dataset. This is because once classified, the analyst only needs to look at in which category every observation falls into, instead of considering the multiple values associated with each of the variables and trying to figure out how to put them together in a coherent sense. When the clustering is performed on observations that represent areas, the technique is often called geodemographic analysis.

Although there exist many techniques to statistically group observations in a dataset, all of them are based on the premise of using a set of attributes to define classes or categories of observations that are similar within each of them, but differ between groups. How similarity within groups and dissimilarity between them is defined and how the classification algorithm is operationalised is what makes techniques differ and also what makes each of them particularly well suited for specific problems or types of data.

In the case of analysing spatial data, there is a subset of methods that are of particular interest for many common cases in Spatial Data Science. These are the so-called regionalisation techniques. Regionalisation methods can also take many forms and faces but, at their core, they all involve statistical clustering of observations with the additional constraint that observations need to be geographical neighbours to be in the same category. Because of this, rather than category, you will use the term area for each observation and region for each category, hence regionalisation, the construction of regions from smaller areas.

The Python package you will use for clustering today is called scikit-learn and can be imported as sklearn.

import geopandas as gpd
import pandas as pd
import seaborn as sns
from libpysal import graph
from sklearn import cluster

Attribute-based clustering

In this session, you will be working with another dataset you should already be familiar with - the Scottish Index of Multiple Deprivation. This time, you will focus only on the area of Edinburgh prepared for this course.

Scottish Index of Multiple Deprivation

As always, the table can be read from the site:

simd = gpd.read_file(
    "https://martinfleischmann.net/sds/clustering/data/edinburgh_simd_2020.gpkg"
)

Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:

  1. Download the file by right-clicking on this link and saving the file
  2. Place the file in the same folder as the notebook where you intend to read it
  3. Replace the code in the cell above with:
simd = gpd.read_file(
    "edinburgh_simd_2020.gpkg",
)

Inspect the structure of the table:

simd.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 597 entries, 0 to 596
Data columns (total 52 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   DataZone    597 non-null    object  
 1   DZName      597 non-null    object  
 2   LAName      597 non-null    object  
 3   SAPE2017    597 non-null    int64   
 4   WAPE2017    597 non-null    int64   
 5   Rankv2      597 non-null    int64   
 6   Quintilev2  597 non-null    int64   
 7   Decilev2    597 non-null    int64   
 8   Vigintilv2  597 non-null    int64   
 9   Percentv2   597 non-null    int64   
 10  IncRate     597 non-null    object  
 11  IncNumDep   597 non-null    int64   
 12  IncRankv2   597 non-null    float64 
 13  EmpRate     597 non-null    object  
 14  EmpNumDep   597 non-null    int64   
 15  EmpRank     597 non-null    float64 
 16  HlthCIF     597 non-null    int64   
 17  HlthAlcSR   597 non-null    int64   
 18  HlthDrugSR  597 non-null    int64   
 19  HlthSMR     597 non-null    int64   
 20  HlthDprsPc  597 non-null    object  
 21  HlthLBWTPc  597 non-null    object  
 22  HlthEmergS  597 non-null    int64   
 23  HlthRank    597 non-null    int64   
 24  EduAttend   597 non-null    object  
 25  EduAttain   597 non-null    float64 
 26  EduNoQuals  597 non-null    int64   
 27  EduPartici  597 non-null    object  
 28  EduUniver   597 non-null    object  
 29  EduRank     597 non-null    int64   
 30  GAccPetrol  597 non-null    float64 
 31  GAccDTGP    597 non-null    float64 
 32  GAccDTPost  597 non-null    float64 
 33  GAccDTPsch  597 non-null    float64 
 34  GAccDTSsch  597 non-null    float64 
 35  GAccDTRet   597 non-null    float64 
 36  GAccPTGP    597 non-null    float64 
 37  GAccPTPost  597 non-null    float64 
 38  GAccPTRet   597 non-null    float64 
 39  GAccBrdbnd  597 non-null    object  
 40  GAccRank    597 non-null    int64   
 41  CrimeCount  597 non-null    int64   
 42  CrimeRate   597 non-null    int64   
 43  CrimeRank   597 non-null    float64 
 44  HouseNumOC  597 non-null    int64   
 45  HouseNumNC  597 non-null    int64   
 46  HouseOCrat  597 non-null    object  
 47  HouseNCrat  597 non-null    object  
 48  HouseRank   597 non-null    float64 
 49  Shape_Leng  597 non-null    float64 
 50  Shape_Area  597 non-null    float64 
 51  geometry    597 non-null    geometry
dtypes: float64(16), geometry(1), int64(22), object(13)
memory usage: 242.7+ KB

Before you jump into exploring the data, one additional step that will come in handy down the line. Not every variable in the table is an attribute that you will want for the clustering. In particular, you are interested in sub-ranks based on individual SIMD domains, so you will only consider those. Hence, first manually write them so they are easier to subset:

subranks = [
    "IncRankv2",
    "EmpRank",
    "HlthRank",
    "EduRank",
    "GAccRank",
    "CrimeRank",
    "HouseRank"
]

You can quickly familiarise yourself with those variables by plotting a few maps like the one below to build your intuition about what is going to happen.

simd[["IncRankv2", "geometry"]].explore("IncRankv2", tiles="CartoDB Positron", tooltip=False)
Make this Notebook Trusted to load map: File -> Trust Notebook