You will work with the demographic characteristics of Chicago in 1918 linked to the influenza mortality during the pandemic that happened back then, coming from the research paper by Grantz et al. (2016) you used before in the Data wrangling session, but this time with geometries.
import geopandas as gpdchicago = gpd.read_file("https://martinfleischmann.net/sds/clustering/data/chicago_influenza_1918.geojson")chicago.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
Before working with clustering, do you remember that note about data standardisation? The demographic variables in the table are not using the same scale, so you need to do something about it before using K-means.
I’ll let you check the Data section of the chapter Clustering and Regionalization from the Geographic Data Science with Python by Rey, Arribas-Bel, and Wolf (2023) by yourself for an explanation of what is happening below. In short, you take the variables and scale them using the robust scaler, ensuring that all of them are using the same scale and the scaling is not affected by outliers.
Scale the selected columns and assign them back to the table.
geography_code
gross_acres
illit
unemployed_pct
ho_pct
agecat1
agecat2
agecat3
agecat4
agecat5
agecat6
agecat7
influenza
geometry
0
G17003100388
-0.194708
0.465558
0.121762
-0.577117
-0.145658
-0.430586
-0.320059
-0.367526
-0.419623
-0.304649
-0.397324
29
POLYGON ((358405.051 570342.347, 358371.811 57...
1
G17003100197
-0.201697
2.413302
-0.701498
-0.629427
0.627451
0.434924
0.408555
0.358749
-0.009057
-0.143917
-0.205352
30
POLYGON ((356903.353 580393.561, 356895.319 58...
If you check the values now, you will see that they are all distributed around 0.
Once this is ready, get to work with the following tasks:
Pick a number of clusters
Run K-Means for that number of clusters
Plot the different clusters on a map
Analyse the results:
What do you find?
What are the main characteristics of each cluster?
How are clusters distributed geographically?
Can you identify some groups concentrated on particular areas?
Create spatially lagged K-Means.
How did the result change?
Develop a regionalisation using agglomerative clustering
How did the result change compared to the previous two?
Generate a geography that contains only the boundaries of each region and visualise it.
Rinse and repeat with several combinations of variables and number of clusters
Pick your best. Why have you selected it? What does it show? What are the main groups of areas based on the built environment?
Acknowledgements
This section is derived from A Course on Geographic Data Science by Arribas-Bel (2019), licensed under CC-BY-SA 4.0.
References
Arribas-Bel, Dani. 2019. “A Course on Geographic Data Science.”The Journal of Open Source Education 2 (14). https://doi.org/10.21105/jose.00042.
Grantz, Kyra H, Madhura S Rane, Henrik Salje, Gregory E Glass, Stephen E Schachterle, and Derek AT Cummings. 2016. “Disparities in Influenza Mortality and Transmission Related to Sociodemographic Factors Within Chicago in the Pandemic of 1918.”Proceedings of the National Academy of Sciences 113 (48): 13839–44.
Rey, Sergio, Dani Arribas-Bel, and Levi John Wolf. 2023. Geographic Data Science with Python. Chapman & Hall/CRC Texts in Statistical Science. London, England: Taylor & Francis.