{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Clustering and regionalisation\n",
"\n",
"This session is all about finding groups of similar observations in data\n",
"using clustering techniques.\n",
"\n",
"Many questions and topics are complex phenomena that involve several\n",
"dimensions and are hard to summarise into a single variable. In\n",
"statistical terms, you call this family of problems *multivariate*, as\n",
"opposed to *univariate* cases where only a single variable is considered\n",
"in the analysis. Clustering tackles this kind of questions by reducing\n",
"their dimensionality -the number of relevant variables the analyst needs\n",
"to look at - and converting it into a more intuitive set of classes that\n",
"even non-technical audiences can look at and make sense of. For this\n",
"reason, it is widely used in applied contexts such as policymaking or\n",
"marketing. In addition, since these methods do not require many\n",
"preliminary assumptions about the structure of the data, it is a\n",
"commonly used exploratory tool, as it can quickly give clues about the\n",
"shape, form and content of a dataset.\n",
"\n",
"The basic idea of statistical clustering is to summarise the information\n",
"contained in several variables by creating a relatively small number of\n",
"categories. Each observation in the dataset is then assigned to one, and\n",
"only one, category depending on its values for the variables originally\n",
"considered in the classification. If done correctly, the exercise\n",
"reduces the complexity of a multi-dimensional problem while retaining\n",
"all the meaningful information contained in the original dataset. This\n",
"is because once classified, the analyst only needs to look at in which\n",
"category every observation falls into, instead of considering the\n",
"multiple values associated with each of the variables and trying to\n",
"figure out how to put them together in a coherent sense. When the\n",
"clustering is performed on observations that represent areas, the\n",
"technique is often called geodemographic analysis.\n",
"\n",
"Although there exist many techniques to statistically group observations\n",
"in a dataset, all of them are based on the premise of using a set of\n",
"attributes to define classes or categories of observations that are\n",
"similar *within* each of them, but differ *between* groups. How\n",
"similarity within groups and dissimilarity between them is defined and\n",
"how the classification algorithm is operationalised is what makes\n",
"techniques differ and also what makes each of them particularly well\n",
"suited for specific problems or types of data.\n",
"\n",
"In the case of analysing spatial data, there is a subset of methods that\n",
"are of particular interest for many common cases in Spatial Data\n",
"Science. These are the so-called *regionalisation* techniques.\n",
"Regionalisation methods can also take many forms and faces but, at their\n",
"core, they all involve statistical clustering of observations with the\n",
"additional constraint that observations need to be geographical\n",
"neighbours to be in the same category. Because of this, rather than\n",
"category, you will use the term *area* for each observation and *region*\n",
"for each category, hence regionalisation, the construction of regions\n",
"from smaller areas.\n",
"\n",
"The Python package you will use for clustering today is called\n",
"`scikit-learn` and can be imported as `sklearn`."
],
"id": "82a63b22-af46-48cd-ba8f-338321e0653b"
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import geopandas as gpd\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"from libpysal import graph\n",
"from sklearn import cluster"
],
"id": "d4f98be2"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Attribute-based clustering\n",
"\n",
"In this session, you will be working with another dataset you should\n",
"already be familiar with - the Scottish Index of Multiple Deprivation.\n",
"This time, you will focus only on the area of Edinburgh prepared for\n",
"this course.\n",
"\n",
"### Scottish Index of Multiple Deprivation\n",
"\n",
"As always, the table can be read from the site:"
],
"id": "e14c71a1-69d8-4e96-8ae0-b127cc6b504e"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"simd = gpd.read_file(\n",
" \"https://martinfleischmann.net/sds/clustering/data/edinburgh_simd_2020.gpkg\"\n",
")"
],
"id": "b72a69d1"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Alternative**\n",
">\n",
"> Instead of reading the file directly off the web, it is possible to\n",
"> download it manually, store it on your computer, and read it locally.\n",
"> To do that, you can follow these steps:\n",
">\n",
"> 1. Download the file by right-clicking on [this\n",
"> link](https://martinfleischmann.net/sds/clustering/data/edinburgh_simd_2020.gpkg)\n",
"> and saving the file\n",
"> 2. Place the file in the same folder as the notebook where you intend\n",
"> to read it\n",
"> 3. Replace the code in the cell above with:\n",
">\n",
"> ``` python\n",
"> simd = gpd.read_file(\n",
"> \"edinburgh_simd_2020.gpkg\",\n",
"> )\n",
"> ```\n",
"\n",
"Inspect the structure of the table:"
],
"id": "23be92b6-81a4-4b98-a269-e00e8d9a3df3"
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"RangeIndex: 597 entries, 0 to 596\n",
"Data columns (total 52 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 DataZone 597 non-null object \n",
" 1 DZName 597 non-null object \n",
" 2 LAName 597 non-null object \n",
" 3 SAPE2017 597 non-null int64 \n",
" 4 WAPE2017 597 non-null int64 \n",
" 5 Rankv2 597 non-null int64 \n",
" 6 Quintilev2 597 non-null int64 \n",
" 7 Decilev2 597 non-null int64 \n",
" 8 Vigintilv2 597 non-null int64 \n",
" 9 Percentv2 597 non-null int64 \n",
" 10 IncRate 597 non-null object \n",
" 11 IncNumDep 597 non-null int64 \n",
" 12 IncRankv2 597 non-null float64 \n",
" 13 EmpRate 597 non-null object \n",
" 14 EmpNumDep 597 non-null int64 \n",
" 15 EmpRank 597 non-null float64 \n",
" 16 HlthCIF 597 non-null int64 \n",
" 17 HlthAlcSR 597 non-null int64 \n",
" 18 HlthDrugSR 597 non-null int64 \n",
" 19 HlthSMR 597 non-null int64 \n",
" 20 HlthDprsPc 597 non-null object \n",
" 21 HlthLBWTPc 597 non-null object \n",
" 22 HlthEmergS 597 non-null int64 \n",
" 23 HlthRank 597 non-null int64 \n",
" 24 EduAttend 597 non-null object \n",
" 25 EduAttain 597 non-null float64 \n",
" 26 EduNoQuals 597 non-null int64 \n",
" 27 EduPartici 597 non-null object \n",
" 28 EduUniver 597 non-null object \n",
" 29 EduRank 597 non-null int64 \n",
" 30 GAccPetrol 597 non-null float64 \n",
" 31 GAccDTGP 597 non-null float64 \n",
" 32 GAccDTPost 597 non-null float64 \n",
" 33 GAccDTPsch 597 non-null float64 \n",
" 34 GAccDTSsch 597 non-null float64 \n",
" 35 GAccDTRet 597 non-null float64 \n",
" 36 GAccPTGP 597 non-null float64 \n",
" 37 GAccPTPost 597 non-null float64 \n",
" 38 GAccPTRet 597 non-null float64 \n",
" 39 GAccBrdbnd 597 non-null object \n",
" 40 GAccRank 597 non-null int64 \n",
" 41 CrimeCount 597 non-null int64 \n",
" 42 CrimeRate 597 non-null int64 \n",
" 43 CrimeRank 597 non-null float64 \n",
" 44 HouseNumOC 597 non-null int64 \n",
" 45 HouseNumNC 597 non-null int64 \n",
" 46 HouseOCrat 597 non-null object \n",
" 47 HouseNCrat 597 non-null object \n",
" 48 HouseRank 597 non-null float64 \n",
" 49 Shape_Leng 597 non-null float64 \n",
" 50 Shape_Area 597 non-null float64 \n",
" 51 geometry 597 non-null geometry\n",
"dtypes: float64(16), geometry(1), int64(22), object(13)\n",
"memory usage: 242.7+ KB"
]
}
],
"source": [
"simd.info()"
],
"id": "bc950d0a"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before you jump into exploring the data, one additional step that will\n",
"come in handy down the line. Not every variable in the table is an\n",
"attribute that you will want for the clustering. In particular, you are\n",
"interested in sub-ranks based on individual SIMD domains, so you will\n",
"only consider those. Hence, first manually write them so they are easier\n",
"to subset:"
],
"id": "dcd79284-e623-4a4e-94ed-cfc63b84edb2"
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"subranks = [\n",
" \"IncRankv2\",\n",
" \"EmpRank\",\n",
" \"HlthRank\",\n",
" \"EduRank\",\n",
" \"GAccRank\",\n",
" \"CrimeRank\",\n",
" \"HouseRank\"\n",
"]"
],
"id": "40771015"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can quickly familiarise yourself with those variables by plotting a\n",
"few maps like the one below to build your intuition about what is going\n",
"to happen."
],
"id": "845cb7c4-7cc4-4730-a042-b327f9e3a322"
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": [
"

Make this Notebook Trusted to load map: File -> Trust Notebook