Find me a cluster

Now, it is up to you to find some clusters.

Chicago 1918

You will work with the demographic characteristics of Chicago in 1918 linked to the influenza mortality during the pandemic that happened back then, coming from the research paper by Grantz et al. (2016) you used before in the Data wrangling session, but this time with geometries.

import geopandas as gpd

chicago = gpd.read_file(
    "https://martinfleischmann.net/sds/clustering/data/chicago_influenza_1918.geojson"
)
chicago.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

Before working with clustering, do you remember that note about data standardisation? The demographic variables in the table are not using the same scale, so you need to do something about it before using K-means.

I’ll let you check the Data section of the chapter Clustering and Regionalization from the Geographic Data Science with Python by Rey, Arribas-Bel, and Wolf (2023) by yourself for an explanation of what is happening below. In short, you take the variables and scale them using the robust scaler, ensuring that all of them are using the same scale and the scaling is not affected by outliers.

1from sklearn import preprocessing

2demographics = [
    "gross_acres",
    "illit",
    "unemployed_pct",
    "ho_pct",
    "agecat1",
    "agecat2",
    "agecat3",
    "agecat4",
    "agecat5",
    "agecat6",
    "agecat7",
]
3chicago[demographics] = preprocessing.robust_scale(chicago[demographics])
chicago.head(2)
1
Import the preprocessing module of scikit-learn.
2
Specify a list of demographic variables.
3
Scale the selected columns and assign them back to the table.
geography_code gross_acres illit unemployed_pct ho_pct agecat1 agecat2 agecat3 agecat4 agecat5 agecat6 agecat7 influenza geometry
0 G17003100388 -0.194708 0.465558 0.121762 -0.577117 -0.145658 -0.430586 -0.320059 -0.367526 -0.419623 -0.304649 -0.397324 29 POLYGON ((358405.051 570342.347, 358371.811 57...
1 G17003100197 -0.201697 2.413302 -0.701498 -0.629427 0.627451 0.434924 0.408555 0.358749 -0.009057 -0.143917 -0.205352 30 POLYGON ((356903.353 580393.561, 356895.319 58...

If you check the values now, you will see that they are all distributed around 0.

Once this is ready, get to work with the following tasks:

  1. Pick a number of clusters
  2. Run K-Means for that number of clusters
  3. Plot the different clusters on a map
  4. Analyse the results:
    • What do you find?
    • What are the main characteristics of each cluster?
    • How are clusters distributed geographically?
    • Can you identify some groups concentrated on particular areas?
  5. Create spatially lagged K-Means.
    • How did the result change?
  6. Develop a regionalisation using agglomerative clustering
    • How did the result change compared to the previous two?
  7. Generate a geography that contains only the boundaries of each region and visualise it.
  • Rinse and repeat with several combinations of variables and number of clusters
  • Pick your best. Why have you selected it? What does it show? What are the main groups of areas based on the built environment?

Acknowledgements

This section is derived from A Course on Geographic Data Science by Arribas-Bel (2019), licensed under CC-BY-SA 4.0.

References

Arribas-Bel, Dani. 2019. “A Course on Geographic Data Science.” The Journal of Open Source Education 2 (14). https://doi.org/10.21105/jose.00042.
Grantz, Kyra H, Madhura S Rane, Henrik Salje, Gregory E Glass, Stephen E Schachterle, and Derek AT Cummings. 2016. “Disparities in Influenza Mortality and Transmission Related to Sociodemographic Factors Within Chicago in the Pandemic of 1918.” Proceedings of the National Academy of Sciences 113 (48): 13839–44.
Rey, Sergio, Dani Arribas-Bel, and Levi John Wolf. 2023. Geographic Data Science with Python. Chapman & Hall/CRC Texts in Statistical Science. London, England: Taylor & Francis.