xyzservices: a unified source of XYZ tile providers in Python

A Python ecosystem offers numerous tools for the visualisation of data on a map. A lot of them depend on XYZ tiles, providing a base map layer, either from OpenStreetMap, satellite or other sources. The issue is that each package that offers XYZ support manages its own list of supported providers.

We have built xyzservices package to support any Python library making use of XYZ tiles. I’ll try to explain the rationale why we did that, without going into the details of the package. If you want those details, check its documentation.

The situation

Let me quickly look at a few popular packages and their approach to tile management – contextily, folium, ipyleaflet and holoviews.

contextily

contextily brings contextual base maps to static geopandas plots. It comes with a dedicated contextily.providers module, which contains a hard-coded list of providers scraped from the list used by leaflet (as of version 1.1.0).

folium

folium is a lightweight interface to a JavaScript leaflet library. It providers built-in support for 6 types of tiles and allows passing any XYZ URL and its attribution to a map. It means that it mostly relies on external sources of tile providers.

ipyleaflet

ipyleaflet brings leaflet support to Jupyter notebooks and comes with a bit more options than folium. It has a very similar approach as contextily does – it has a hard-coded list of about 37 providers in its basemaps module.

holoviews

holoviews provides a Python interface to the Bokeh library and its list of supported base maps is also hard-coded.

A similar situation is in other packages like geemap or leafmap.

Each package has to maintain the list of base maps, ensure that they all work, respond to users requiring more, update links… That is a lot of duplicated maintenance burden. We think it is avoidable.

The vision

All XYZ tile providers have a single lightweight home and a clean API supporting the rest of the ecosystem. All the other packages use the same resource, one which is tested and expanded by a single group of maintainers.

We have designed xyzservices to be exactly that. It is a Python package that has no dependencies and only a single purpose – to collect and process metadata of tile providers.

We envisage a few potential use cases.

The first – packages like contextily and geopandas will directly support xyzservices.TileProvider object when specifying tiles. Nothing else is needed, contextily will fetch the data it needs (final tile URL, an attribution, zoom and extent limits) from the object. In the code form:

import xyzservices.providers as xyz
from contextily import add_basemap

add_basemap(ax, source=xyz.CartoDB.Positron)

The second option is wrapping xyzservices.providers into a custom API providing, for example, an interactive selection of tiles.

The third one is using different parts of a TileProvider individually when passing the information. This option can be currently used, for example, with folium:

import folium
import xyzservices.providers as xyz

tiles = xyz.CartoDB.Positron

folium.Map(
    location=[53.4108, -2.9358],
    tiles=tiles.build_url(),
    attr=tiles.attribution,
)

The last one is the most versatile. The xyzservices comes with a JSON file used as a storage of all the metadata. The JSON is automatically installed to share/xyzservices/providers.json where it is available for any other package without depending on xyzservices directly.

We hope to cooperate with maintainers of other existing packages and move most of the functionality around XYZ tiles that can be reused to xyzservices. We think that it will:

  1. Remove the burden from individual developers. Any package will just implement an interface to Python or JSON API of xyzservices.
  2. Expand the list easy-to-use tiles for users. xyzservices currently has over 200 providers, all of which should be available for users across the ecosystem, without the need to individually hard-code them in every package.

While this discussion started in May 2020 (thanks @darribas!), the initial version of the package is out now and installable from PyPI and conda-forge. We hope to have as many developers as possible on board to allow for the consolidation of the ecosystem in the future.

Evolution of Urban Patterns: Urban Morphology as an Open Reproducible Data Science

We have a new paper published in the Geographical Analysis on the opportunities current developments in geographic data science within the Python ecosystem offer to urban morphology. To sum up – there’s a lot to play with and if you’re interested in the quantification of urban form, there’s no better choice for you at the moment.

Urban morphology (study of urban form) is historically a qualitative discipline that only recently expands into more data science-ish waters. We believe that there’s a lot of potential in this direction and illustrate it on the case study looking into the evolution of urban patterns, i.e. how different aspects of urban form has changed over time.

The paper is open access (yay!) (links to the online version and a PDF), the research is fully reproducible (even in your browser thanks to amazing MyBinder!) with all code on GitHub.

Short summary

We have tried to map all the specialised open-source tools urban morphologists can use these days, which resulted in this nice table. The main conclusion? Most of them are plug-ins for QGIS or ArcGIS, hence depend on pointing and clicking. A tricky thing to reproduce properly.

Table 1 from the paper

We prefer code-based science, so we took the Python ecosystem and put it in the test. We have designed a fully reproducible workflow based on GeoPandas, OSMnx, PySAL and momepy to sample and study 42 places around the world, developed at very different times.

We measured a bunch of morphometric characters (indicators for individual aspects of form) and looked at their change in time. And there is a lot to look at. There are significant differences not only in scale (on the figure below) but in other aspects as well. One interesting observation – it seems that we have forgotten how to make a properly connected, dense grid.

As we said in the paper, “Switching to a code-based analysis may be associated with a steep learning curve. However, not everyone needs to reach the developer level as the data science ecosystem aims to provide a middle ground user level. That is a bit like Lego—the researcher learns how to put pieces together and then find pieces they need to build a house.

We think that moving from QGIS to Python (or R), as daunting as it may seem to some, is worth it. It helps us overcome the reproducibility crisis science is going through, the crisis caused, among other things, by relying on pointing and clicking too much.

The open research paradigm, based on open platforms and transparent community-led governance, has the potential to democratize science and remove unnecessary friction caused by the lack of cooperation between research groups while bringing additional transparency to research methods and outputs.

page 18

Abstract

The recent growth of geographic data science (GDS) fuelled by increasingly available open data and open source tools has influenced urban sciences across a multitude of fields. Yet there is limited application in urban morphology—a science of urban form. Although quantitative approaches to morphological research are finding momentum, existing tools for such analyses have limited scope and are predominantly implemented as plug-ins for standalone geographic information system software. This inherently restricts transparency and reproducibility of research. Simultaneously, the Python ecosystem for GDS is maturing to the point of fully supporting highly specialized morphological analysis. In this paper, we use the open source Python ecosystem in a workflow to illustrate its capabilities in a case study assessing the evolution of urban patterns over six historical periods on a sample of 42 locations. Results show a trajectory of change in the scale and structure of urban form from pre-industrial development to contemporary neighborhoods, with a peak of highest deviation during the post-World War II era of modernism, confirming previous findings. The wholly reproducible method is encapsulated in computational notebooks, illustrating how modern GDS can be applied to urban morphology research to promote open, collaborative, and transparent science, independent of proprietary or otherwise limited software.

Fleischmann, M., Feliciotti, A. and Kerr, W. (2021), Evolution of Urban Patterns: Urban Morphology as an Open Reproducible Data Science. Geogr Anal. https://doi.org/10.1111/gean.12302

Talk at ISUF 2021: Classifying urban form at a national scale

I had a chance to present our ongoing work on the classification of the (built) environment in Great Britain during the International Seminar on Urban Form 2021, which was held virtually in Glasgow. I was presenting the classification of urban form, one component of Spatial Signatures we’re developing as part of the Urban Grammar AI project together with Dani Arribas-Bel. The video of the presentation is attached below, as well as the abstract.

Classifying urban form at a national scale: The case of Great Britain

There is a pressing need to monitor urban form and function in ways that can feed into better planning and management of cities. Both academic and policymaking communities have identified the need for more spatially and temporally detailed, consistent, and scalable evidence on the nature and evolution of urban form. Despite impressive progress, the literature can achieve only two of those characteristics simultaneously. Detailed and consistent studies do not scale well because they tend to rely on small-scale, ad-hoc datasets that offer limited coverage. Until recently, consistent and scalable research has only been possible by using simplified measures that inevitably miss much of the nuance and richness behind the concept of urban form.

This paper outlines the notion of “spatial signatures”, a characterisation of space based on form and function, and will specifically focus on its form component. Whilst spatial signature sits between the purely morphological and purely functional description of the built environment, its form-based component reflects the morphometric definition of urban tissue, the distinct structurally homogenous area of a settlement. The proposed method employs concepts of “enclosures” and “enclosed tessellation” to derive indivisible hierarchical geographies based on physical boundaries (streets, railway, rivers, coastline) and building footprints to delineate such tissues in the built fabric. Each unit is then characterised by a comprehensive set of data-driven morphometric characters feeding into an explicitly spatial contextual layer, which is used as an input of cluster analysis.

The classification based on spatial signatures is applied to the entirety of Great Britain on a fine grain scale of individual tessellation cells and released as a fully reproducible open data product. The results provide a unique input for local authorities to drive planning and decision-making and for the wider research community as data input.

Video

The Urban Atlas: Methodological Foundation of a Morphometric Taxonomy of Urban Form

The Urban Atlas: Methodological Foundation of a Morphometric Taxonomy of Urban Form is the title of my PhD Thesis defended in January 2021 at the University of Strathclyde.

Thanks to Ombretta and Sergio for guiding me along the way!

Abstract

No two cities in the world are alike. Each urban environment is characterised by a unique variety and heterogeneity as a result of its evolution and transformation, reflecting the differences in needs human populations have had over time manifested, in space, by a plethora of urban patterns.

Traditionally, the study of these patterns over time and across space is the domain o urban morphology, a field of research stretching from geography to architecture. Whilst urban morphology has considerably advanced the current understanding of processes of formation, transformation and differentiation of many such patterns, predominantly through qualitative approaches, it has yet to fully take advantage of quantitative approaches and data-driven methods recently made possible by advances in geographic data science and expansion of available mapping products. Although relatively new, these methods hold immense potential in expanding our capacity to identify, characterise and compare urban patterns: these can be rich in terms of information, scalable (applicable to the large scale of extent, regional and national) and replicable, drastically improving the potential of comparative analysis and classification.

Different disciplines with more profound quantitative methods can help in the development of data-driven urban morphology, as now, for the first time, we are in the position where we can rely on a large amount of data on the built environment, unthinkable just a decade ago. This thesis, therefore, aims to link urban morphology and methodologically strong area of quantitative biological systematics, adapting its concepts and methods to the context of built-up fabric. That creates an infrastructure for numerical description of urban form, known as urban morphometrics, and a subsequent classification of urban types.

Conceptually building on the theory of numerical taxonomy, this research progresses the development of urban morphometrics to automate processes of urban form characterisation and classification. Whilst many available methods are characterised by significant limitation in applicability due to difficulties in obtaining necessary data, the proposed method employs only minimal data input – street network and building footprints – and overcomes limitations in the delineation of plots by identifying an alternative spatial unit of analysis, the morphological tessellation, a derivative of Voronoi tessellation partitioning the space based on a composition of building footprints. As tessellation covers the entirety of urban space, its inherent contiguity then constitutes a basis of a relational framework aimed at the comprehensive characterisation of individual elements of urban form and their relationships. Resulting abundant numerical description of all features is further utilised in cluster analysis delineating urban tissue types in an unrestricted urban fabric, shaping an input for hierarchical classification of urban form – a taxonomy.

The proposed method is applied to the historical heterogeneous city of Prague, Czechia and validated using supplementary non-morphological data reflecting the variation of built-up patterns. Furthermore, its cross-cultural and morphological validity and expandability are tested by assessment of Amsterdam, Netherlands and a combination of both cases into a unified taxonomy of their urban patterns. The research is accompanied by a bespoke open-source software momepy for quantitative assessment of urban form, providing infrastructure for replicability and further community-led development.

The work builds a basis for morphometric research of urban environment, providing operational tools and frameworks for its application and further development, eventually leading to a coherent taxonomy of urban form.

Keywords: urban morphometrics, taxonomy, classification, measuring, urban form, quantitative analysis, urban morphology, software


PDF of the thesis is available from my Dropbox before it will be visible in the University repository. The code repository will be shared once I’ll manage to clean it :).

Clustergam: visualisation of cluster analysis

In this post, I introduce a new Python package to generate clustergrams from clustering solutions. The library has been developed as part of the Urban Grammar research project, and it is compatible with scikit-learn and GPU-enabled libraries such as cuML or cuDF within RAPIDS.AI.


When we want to do some cluster analysis to identify groups in our data, we often use algorithms like K-Means, which require the specification of a number of clusters. But the issue is that we usually don’t know how many clusters there are.

There are many methods on how to determine the correct number, like silhouettes or elbow plot, to name a few. But they usually don’t give much insight into what is happening between different options, so the numbers are a bit abstract.

Matthias Schonlau proposed another approach – a clustergram. Clustergram is a two-dimensional plot capturing the flows of observations between classes as you add more clusters. It tells you how your data reshuffles and how good your splits are. Tal Galili later implemented clustergram for K-Means in R. And I have used Tal’s implementation, ported it to Python and created clustergram – a Python package to make clustergrams.

clustergram currently supports K-Means and using scikit-learn (inlcuding Mini-Batch implementation) and RAPIDS.AI cuML (if you have a CUDA-enabled GPU), Gaussian Mixture Model (scikit-learn only) and hierarchical clustering based on scipy.hierarchy. Alternatively, we can create clustergram based on labels and data derived from alternative custom clustering algorithms. It provides a sklearn-like API and plots clustergram using matplotlib, which gives it a wide range of styling options to match your publication style.

Install

You can install clustergram from conda or pip:

conda install clustergram -c conda-forge

or

pip install clustergram

In any case, you still need to install your selected backend (scikit-learn and scipy or cuML).

from clustergram import Clustergram
import urbangrammar_graphics as ugg
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale

sns.set(style='whitegrid')

Let us look at some examples to understand how clustergram looks and what to do with it.

Iris flower data set

The first example which we try to analyse using clustergram is the famous Iris flower data set. It contains data on three species of Iris flowers measuring sepal width and length and petal width and length. We can start with some exploration:

iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species", palette=ugg.COLORS[1:4])
g.fig.suptitle("Iris flowers", y=1.01)

It seems that setosa is a relatively well-defined group, while the difference between versicolor and virginica is smaller as they partially overlap (or entirely in the case of sepal width).

Okay, so we know how the data looks. Now we can check how does clustergram look. Remember that we know that there are three clusters, and we should ideally be able to recognise this from clustergram. I am saying ideally because even though there are known labels, it does not mean that our data or clustering method are able to distinguish between those classes.

Let’s start with K-Means clustering. To get a stable result, we can run a clustergram with 1000 initialisations.

data = scale(iris.drop(columns=['species']))

cgram = Clustergram(range(1, 10), n_init=1000)
cgram.fit(data)

ax = cgram.plot(
    figsize=(10, 8),
    line_style=dict(color=ugg.COLORS[1]),
    cluster_style={"color": ugg.COLORS[2]},
)
ax.yaxis.grid(False)
sns.despine(offset=10)
ax.set_title('K-Means (scikit-learn)')

On the x axis, we can see the number of clusters. Points represent a centre of each cluster (by default) weighted by the first principal component (that helps with the diagram’s readability). The lines connecting points and their thickness represent observations moving between clusters. Therefore, we can read when new clusters are formed as a split of a single existing class and when they are formed based on observations from two clusters.

We’re looking for the separation, i.e., did an additional cluster bring any meaningful split? The step from one cluster to two is a big one – nice and clear separation. From two to three, we also have quite a nice split in the top branch. But from 3 to 4, there is no visible difference because the new fourth cluster is almost the same as the existing bottom branch. Although it is now separated into two, this split does not give us much information. Therefore, we could conclude that the ideal number of clusters for Iris data is three.

We can also check some additional information, like a silhouette score or Calinski-Harabazs score.

fig, axs = plt.subplots(2, figsize=(10, 10), sharex=True)
cgram.silhouette_score().plot(
    xlabel="Number of clusters (k)",
    ylabel="Silhouette score",
    color=ugg.COLORS[1],
    ax=axs[0],
)
cgram.calinski_harabasz_score().plot(
    xlabel="Number of clusters (k)",
    ylabel="Calinski-Harabasz score",
    color=ugg.COLORS[1],
    ax=axs[1],
)
sns.despine(offset=10)

These plots would suggest 3-4 clusters, similarly to clustergram, but they are not very conclusive.

Palmer penguins data set

Now let’s try different data, one where clusters are a bit more complicated to assess. Palmer penguins contain similar data as Iris example, but it measures several attributes of three species of penguins.

penguins = sns.load_dataset("penguins")

g = sns.pairplot(penguins, hue="species", palette=ugg.COLORS[3:])
g.fig.suptitle("Palmer penguins", y=1.01)

Looking at the situation, we see that the overlap between species is much higher than before. It will likely be much more complicated to identify them. Again, we know that there are three clusters, but that does not mean that data has the power to distinguish between them. In this case, it may be especially tricky to differentiate between Adelie and Chinstrap penguins.

data = scale(penguins.drop(columns=['species', 'island', 'sex']).dropna())

cgram = Clustergram(range(1, 10), n_init=1000)
cgram.fit(data)

ax = cgram.plot(
    figsize=(10, 8),
    line_style=dict(color=ugg.COLORS[1]),
    cluster_style={"color": ugg.COLORS[2]},
)
ax.yaxis.grid(False)
sns.despine(offset=10)
ax.set_title("K-Means (scikit-learn)")

We’re looking for separations, and this clustergram shows plenty. It is actually quite complicated to determine the optimal number of clusters. However, since we know what happens between different options, we can play with that. If we have a reason to be conservative, we can go with 4 clusters (I know, it is already more than the initial species). But further splits are also reasonable, which indicates that even higher granularity may provide useful insight, that there might be meaningful groups.

Can we say it is three? Since we know it should be three… Well, not really. The difference between the split from 2 – 3 and that from 3 – 4 is slight. However, the culprit here is K-Means, not clustergram. It just simply cannot correctly cluster these data due to the overlaps and the overall structure.

Let’s have a look at how the Gaussian Mixture Model does.

cgram = Clustergram(range(1, 10), n_init=100, method="gmm")
cgram.fit(data)

ax = cgram.plot(
    figsize=(10, 8),
    line_style=dict(color=ugg.COLORS[1]),
    cluster_style={"color": ugg.COLORS[2]},
)
ax.yaxis.grid(False)
sns.despine(offset=10)
ax.set_title("Gaussian Mixture Model (scikit-learn)")

The result is very similar, though the difference between the third and fourth split is more pronounced. Even here, I would probably go with a four cluster solution.

A situation like this happens very often. The ideal case does not exist. We ultimately need to make a decision on the optimal number of clusters. Clustergam gives us additional insights into what happens between different options, how it splits. We can tell that the four-cluster option in Iris data is not helpful. We can tell that Palmer penguins may be tricky to cluster using K-Means, that there is no decisive right solution. Clustergram does not give an easy answer, but it gives us additional insight, and it is upon us how we interpret it.

You can install clustergram using conda install clustergram -c conda-forge or pip install clustergram. In any case, you will still need to install a clustering backend, either scikit-learn or cuML. The documentation is available at clustergram.readthedocs.io, and the source code is on github.com/martinfleis/clustergram, released under MIT license.

If you want to play with the examples used in this article, the Jupyter notebook is on GitHub. You can also run it in an interactive binder environment in your browser.

For more information, check Tal Galili’s blog post and original papers by Matthias Schonlau.

Give it a go!

Spatial Analytics + Data Talk

On March 30, 2021, I had a chance to deliver a talk as part of the Spatial Analytics + Data Seminar Series organised by the University of Newcastle (Rachel Franklin), the University of Bristol (Levi Wolf) and the Alan Turing Institute. The recording of the event is now available on YouTube.

Spatial Signatures: Dynamic classification of the built environment

This talk introduces the notion of “spatial signatures”, a characterisation of space based on form and function. We know little about how cities are organised over space influences social, economic and environmental outcomes, in part because it is hard to measure. It presents the first stage of the Urban Grammar AI research project, which develops a conceptual framework to characterise urban structure through the notions of spatial signatures and urban grammar and will deploy it to generate open data products and insight about the evolution of cities.

The slides are available online at https://urbangrammarai.github.io/talks/202103_sad/.

3 – 10 = 65529. What?

Yes, the formula above is correct. Well, it depends on what we mean by correct.

NDVI does not make sense

Imagine the following situation. We have fetched a cloud-free mosaic of Sentinel 2 satellite data and want to measure NDVI (Normalised difference vegetation index), which uses red and near-infrared bands within this simple formula.

    \[\mathrm{NDVI}=\frac{(\mathrm{NIR}-\mathrm{Red})}{(\mathrm{NIR}+\mathrm{Red})}\]

The results are normalised, which in this case means that they lie between -1 and 1. Always.

We open out raster data using xarray and have all 4 bands in a single xarray.DataSet. The code to measure NDVI is then simple.

>>> red = data.sel(band=1)  # select red band
>>> nir = data.sel(band=4)  # select near-infrared band

>>> ndvi = (nir - red) / (nir + red)  # compute NDVI

And, a surprise! Our results are between 0 and 170. That is certainly not correct. What has happened?

16-bit unsigned integer

The data coming from Sentinel 2 are stored as 16-bit unsigned integer (uint16). That means that the value the array can hold can be anything between 0 and 216 – 1 (65,535). Remember that NDVI is between -1 and 1. Does it mean that uint16 cannot represent NDVI values? Yes, precisely.

Look at this toy example to understand what is happening during the computation of NDVI in uint16. Let’s have an array with four numbers and subtract 10 from each of them.

>>> array = numpy.array([1, 3, 6, 9], dtype='uint16')
>>> array - 10

array([65527, 65529, 65532, 65535], dtype=uint16)

Yes, as weird as it is, it is correct. The result should be negative, but we can’t have negative values in uint16. So what happens is that the counter rolls over and subtracts the remaining value from the maximum it can represent (65,535 – x).

It is exactly like a rollover of the odometer. We ran out of values, so they started over. The only difference is that we have 16 binary values encoding each number, not decimal.

Odometer rollover. By Hellbus - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=3089111

The fix is easy. We have to use data type which does not limit you like this, like a 64-bit integer.

>>> array.astype('int64') - 10

array([-9, -7, -4, -1])

Compared to a 64-bit integer, a 16-bit integer is efficient since the resulting file will be much smaller (that is why it is used in the first place) but it can be limiting.

Be aware of your data types, so you don’t make the same mistake we did ;).

The journey of an algorithm from QGIS to GeoPandas

This is a short story of one open-source algorithm and its journey from QGIS to mapclassify, to be used within GeoPandas. I am writing it to illustrate the flow within the open-source community because even though this happens all the time, we normally don’t talk about it. And we should.

The story

Sometimes last year, I asked myself a question. How hard would it be to port topological colouring tool from QGIS to be used with GeoPandas? Considering that this particular QGIS tool is written in Python, it seemed to be an easy task.

For those of you who never used it, the aim of topological colouring is to assign colours to (usually) polygons in such a way, that no two adjacent polygons share the same colour (see the illustration in the embedded tweet below).

The adaptation of the Nyall Dawson’s original algorithm was quite straightforward, the logic of Python algorithms for QGIS and for GeoPandas is the same. So in October, I have asked the others what would be the ideal way of sharing it.

The original license was not compatible with the one we use in GeoPandas and I was not sure if GeoPandas itself is actually the right place for it. So while thinking about it, Nyall himself made the situation easier and offered to relicense the original code.

However, there was no clear consensus what is the best way at that time and the whole idea was set aside, until the end of the year, when I decided to keep it simple and release the script as a tiny Python package. And greedy was born.

In the end, greedy offered the original QGIS algorithm for topological colouring and a few other options on top of that, applicable to all geometry types (unlike QGIS, which is limited to polygons). It was a simple solution to release it like that, but it was not optimal, because the more scattered the ecosystem is, the more complicated is to navigate in it and people just easily miss things.

We could have ended the story here, but then Levi Wolf came with an idea.

After some discussion, splot was not seen as the best place, but mapclassify was. And after a couple of months, I made a pull request merging greedy into mapclassify.

It is a very simple story, you may say. Yes, but it shows one thing very clearly, and that is a willingness of the community to collaborate across different projects. A lot of people were involved in it, everyone willing to find the best solution. I think it is worth sharing tiny stories like this.

To see the capability of mapclassify.greedy, here is a Jupyter notebook for you. Thanks everyone involved!

The Code

This is just a quick appendix to the story, outlining the translation of the code from QGIS to GeoPandas-compatible version.

First thing to say – it is easy! Easier that expected, to be honest. I might have been lucky with the algorithm I’ve chosen, but I believe that there is a scope for other processing tools to be ported this way.

The code of greedy is here and original code here.

QGIS code has a lot of stuff related to QGIS interface, which can be happily ignored. The core is in def processAlgorithm(self, parameters, context, feedback) and that is the part we should focus on.

Nyall’s code can be roughly divided into three steps (ignoring some interface code):

  1. Compute graph to understand topological relationships between polygons
  2. Compute balanced topological colouring
  3. Assign colours to features

To compute graph, Nyall defines a new class holding the topology and checks which polygons intersect with which. I did not want to have a specific class, because we have libpysal‘s weights object taking care of it. Moreover, it comes with an alternative option of topology inferring as contiguity weights. No need to expensively compute intersections anymore (unless we want to, I kept the option there).

Balanced colouring is, in the end, the only part of the code, which is almost entirely original. I made only a few modifications.

Because topological colouring is a know Graph problem, there is a selection of algorithms in networkx library dealing with the problem, so I just simply linked them in.

Finally, the function now returns pandas.Series, making it simple to assign resulting values to GeoDataFrame. The most simple usage is then a single line.

gdf['colors'] = mapclassify.greedy(gdf)

Morphological tessellation

Imagine you are trying to analyse a city, and you want primarily to understand its structure. You look at buildings, their dimensions and patterns they form, you look at a street network, and then you want to understand detailed patterns of density. This last point requires a specification of an aerial unit, and morphological tessellation is one of the most detailed options.

I don’t want to talk only about density, but let’s use it as an illustration.

Let’s say that we want as high resolution of our data as possible, so we need the smallest spatial unit we can get. That is often a plot.

But the reality is not so simple. Firstly, you need to get a layer representing plots. Here you may be lucky and find one, nice and clean, which you can use for the analysis. However, if you try to the same thing for another city on another continent, you quickly figure out that the situation is becoming tricky. Plots have various definitions – each country captures them differently 1Kropf, K. (2018) ‘Plots, property and behaviour’, Urban Morphology, 22(1), pp. 5–14.. So, in the end, your plots from different places are not representing the same concept, and your results are not comparable.2An additional issue with plots comes with modernist housing development, where the plot does not have a structural role, but that is a whole another story.

One possible solution is to use morphological tessellation as a spatial unit instead. Its principle is simple – it is a Voronoi tessellation based on building footprint polygons. In practice, each building polygon is represented as a series of points capturing its boundaries. Points are used to generate Voronoi, and relevant cells are dissolved to get a single cell representing tessellation cell based on a single building.3There are algorithms which can do tessellation based on polygon directly (e.g. CGAL library), but point-based implementation is much easier to do in Python, R or even QGIS and the result is almost the same.

In our recent paper, we have tested morphological tessellation from two perspectives – 1) what are the optimal parameters for its creation, and 2) how good proxy of a plot tessellation is.

There are a few important findings to mention. First, there are contexts where tessellation can directly replace a plot in the analysis. One example is the density mentioned above – both coverage area ratio, and floor area ratio could be easily measured based on tessellation. Second, tessellation has some advantages over the plot – it is consistent and contiguous.

Tessellation captures any kind of urban form (historical, modernist, industrial…) in the same, consistent manner. That makes the tessellation cells across the case studies perfectly comparable, even in situations when plots are not available at all (like some informal settlements) or capture non-morphological spatial division (like contemporary large-scale development with multiple buildings on a single plot).

Unlike plots, morphological tessellation is contiguous, allowing for the definition of adjacency between buildings in situations, where it normally would not be so clear.

That, in turn, allows spatial aggregation, based on the aerial topology of space. Just pick a number of topological steps you like.

Our implementation of morphological tessellation is a part of momepy Python package, coming with a tutorial on how to generate tessellation on your data or on data from OpenStreetMap.

The published version of our article Morphological tessellation as a way of partitioning space: Improving consistency in urban morphology at the plot scale is available from CEUS website, the repository with the Python code and data we have used within the research is available on GitHub and archived here.

The second image in this article is reproduced from Kropf, K. (2018) ‘Plots, property and behaviour’, Urban Morphology, 22(1), pp. 5–14.


Fleischmann, M., Feliciotti, A., Romice, O. and Porta, S. (2020) ‘Morphological tessellation as a way of partitioning space: Improving consistency in urban morphology at the plot scale’, Computers, Environment and Urban Systems, 80, p. 101441. doi: 10.1016/j.compenvurbsys.2019.101441.

Line simplification algorithms

Sometimes our lines and polygons are way too complicated for the purpose. Let’s say that we have a beautiful shape of Europe, and we want to make an interactive online map using that shape. Soon we’ll figure out that the polygon has too many points, it takes ages to load, it consumes a lot of memory and, in the end, we don’t even see the full detail. To make things easier, we decide to simplify my polygon.

Simplification means that we want to express the same geometry, using fewer points, but trying to preserve the original shape as much as we can. The easiest way is to open QGIS and use its Simplify processing tool. Now we face the choice – which simplification method should we use? Douglas-Peucker or Visvalingam? How do they work? What is the difference? What does a “tolerance” mean?

This short post aims to answer these questions. I’ll try to explain both of these most popular algorithms, so you can make proper decisions while using them.

First let’s see both how algorithms simplify the following line.

Relatively complex line with 11 points, which needs to be simplified.

Douglas-Peucker

Douglas-Peucker, or sometimes Ramer–Douglas–Peucker algorithm, is the better known of the two. Its main aim is to identify those points, which are less important for the overall shape of the line and remove them. It does not generate any new point.

The algorithm accepts typically one parameter, tolerance, sometimes called epsilon. To explain how is epsilon used, it is the best to start with the principle. Douglas-Peucker is an iterative algorithm – it removes the point, splits the line and starts again until there is no point which could be removed. In the first step, it makes a line between the first and the last points of the line, as illustrated in the figure below. Then it identifies the point on the line, which is the furthest from this line connecting endpoints. If the distance between the line and the point is less than epsilon, the point is discarded, and the algorithm starts again until there is no point between endpoints.

If the distance between the point and the line is larger than epsilon, the first and the furthest points are connected with another line and every point, which is closer than epsilon to this line gets discarded. Every time a new furthest point is identified, our original line splits in two and the algorithm continues on each part separately. The animation below shows the whole procedure of simplification of the line above using the Douglas-Peucker algorithm.

Visvalingam-Whyatt

Visvalingam-Whyatt shares the aim with Douglas-Peucker – identify points which could be removed. However, the principle is different. Tolerance, or epsilon, in this case, is an area, not a distance.

Visvalingam-Whyatt, in the first step, generates triangles between points, as illustrated in the figure below.

Then it identifies the smallest of these triangles and checks if its area is smaller or larger than the epsilon. If it is smaller, the point associated with the triangle gets discarded, and we start again – generate new triangles, identify the smallest one, check and repeat. The algorithm stops when all generated triangles are larger than the epsilon. See the whole simplification process below.

A great explanation of Visvalingam-Whyatt algorithm with an interactive visualisation made Mike Bostock.

Which one is better?

You can see from the example above, that the final line is the same, but that is not always true, and both algorithms can result in different geometries. Visvalingam-Whyatt tends to produce nicer geometry and is often preferred for simplification of natural features. Douglas-Peucker tends to produce spiky lines at specific configurations. You can compare the actual behaviour of both at this great example by Michael Barry.

Which one is faster?

Let’s figure it out. I will use a long randomised line and Python package simplification, which implements both algorithms. The results may vary based on the actual implementation, but using the same package seems fair. I generate randomised line based on 5000 points and then simplify if using both algorithms with the epsilon fine-tuned to return a similar number of points.

import numpy as np
from simplification.cutil import (
    simplify_coords, # this is Douglas-Peucker 
    simplify_coords_vw,  # this is Visvalingam-Whyatt
)

# generate coords of 5000 ordered points as a line
coords = np.sort(np.random.rand(5000, 2), axis=0)

# how many coordinates returns DP with eps=0.01?
simplify_coords(coords, .0025).shape
# 30 / 5000

# how many coordinates returns VW with eps=0.001?
simplify_coords_vw(coords, .0001).shape
# 28 / 500

%%timeit
simplify_coords(coords, .0025)

%%timeit
simplify_coords_vw(coords, .0001)

And the winner is – Douglas-Peucker. By a significant margin.

Douglas-Peucker:

74.1 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Visvalingam-Whyatt:

2.17 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Douglas-Peucker is clearly more performant, but Visvalingam-Whyatt can produce nicer-looking geometry, pick the one you prefer.

Percentage instead of epsilon

Some implementations of simplification algorithms do not offer tolerance / epsilon parameter, but ask for a percentage. How many points do you want to keep? One example of this approach is mapshaper by Matthew Bloch. Based on the iterative nature of both, you can figure out how that works :).

What about topology?

It may happen, that the algorithm (any of them) returns invalid self-intersecting line. Be aware that it may happen. Some implementations (like GEOS used by Shapely and GeoPandas) provide optional slower version preserving topology, but some don’t, so be careful.

I have gaps between my polygons

If you are trying to simplify GeoDataFrame or shapefile, you may be surprised that the simplification makes gaps between the polygons where there should not be any. The reason for that is simple – the algorithm simplifies each polygon separately, so you will easily get something like this.

If you want nice simplification which preserves topology between all polygons, like mapshaper does, look for TopoJSON. Without explaining how that works, as it deserves its own post, see the example below for yourself as the last bit of this text.

import topojson as tp

topo = tp.Topology(df, prequantize=False)
topo.toposimplify(5).to_gdf()

If there’s something inaccurate or confusing, let me know.