Yes, the formula above is correct. Well, it depends on what we mean by correct.

NDVI does not make sense

Imagine the following situation. We have fetched a cloud-free mosaic of Sentinel 2 satellite data and want to measure NDVI (Normalised difference vegetation index), which uses red and near-infrared bands within this simple formula.

The results are normalised, which in this case means that they lie between -1 and 1. Always.

We open out raster data using xarray and have all 4 bands in a single xarray.DataSet. The code to measure NDVI is then simple.

>>> red = data.sel(band=1) # select red band
>>> nir = data.sel(band=4) # select near-infrared band
>>> ndvi = (nir - red) / (nir + red) # compute NDVI

And, a surprise! Our results are between 0 and 170. That is certainly not correct. What has happened?

16-bit unsigned integer

The data coming from Sentinel 2 are stored as 16-bit unsigned integer (uint16). That means that the value the array can hold can be anything between 0 and 2^{16} – 1 (65,535). Remember that NDVI is between -1 and 1. Does it mean that uint16 cannot represent NDVI values? Yes, precisely.

Look at this toy example to understand what is happening during the computation of NDVI in uint16. Let’s have an array with four numbers and subtract 10 from each of them.

Yes, as weird as it is, it is correct. The result should be negative, but we can’t have negative values in uint16. So what happens is that the counter rolls over and subtracts the remaining value from the maximum it can represent (65,535 – x).

It is exactly like a rollover of the odometer. We ran out of values, so they started over. The only difference is that we have 16 binary values encoding each number, not decimal.

The fix is easy. We have to use data type which does not limit you like this, like a 64-bit integer.

Compared to a 64-bit integer, a 16-bit integer is efficient since the resulting file will be much smaller (that is why it is used in the first place) but it can be limiting.

Be aware of your data types, so you don’t make the same mistake we did ;).

This is a short story of one open-source algorithm and its journey from QGIS to mapclassify, to be used within GeoPandas. I am writing it to illustrate the flow within the open-source community because even though this happens all the time, we normally don’t talk about it. And we should.

The story

Sometimes last year, I asked myself a question. How hard would it be to port topological colouring tool from QGIS to be used with GeoPandas? Considering that this particular QGIS tool is written in Python, it seemed to be an easy task.

For those of you who never used it, the aim of topological colouring is to assign colours to (usually) polygons in such a way, that no two adjacent polygons share the same colour (see the illustration in the embedded tweet below).

The adaptation of the Nyall Dawson’s original algorithm was quite straightforward, the logic of Python algorithms for QGIS and for GeoPandas is the same. So in October, I have asked the others what would be the ideal way of sharing it.

The original license was not compatible with the one we use in GeoPandas and I was not sure if GeoPandas itself is actually the right place for it. So while thinking about it, Nyall himself made the situation easier and offered to relicense the original code.

However, there was no clear consensus what is the best way at that time and the whole idea was set aside, until the end of the year, when I decided to keep it simple and release the script as a tiny Python package. And greedy was born.

In the end, greedy offered the original QGIS algorithm for topological colouring and a few other options on top of that, applicable to all geometry types (unlike QGIS, which is limited to polygons). It was a simple solution to release it like that, but it was not optimal, because the more scattered the ecosystem is, the more complicated is to navigate in it and people just easily miss things.

We could have ended the story here, but then Levi Wolf came with an idea.

After some discussion, splot was not seen as the best place, but mapclassify was. And after a couple of months, I made a pull request merging greedy into mapclassify.

It is a very simple story, you may say. Yes, but it shows one thing very clearly, and that is a willingness of the community to collaborate across different projects. A lot of people were involved in it, everyone willing to find the best solution. I think it is worth sharing tiny stories like this.

To see the capability of mapclassify.greedy, here is a Jupyter notebook for you. Thanks everyone involved!

The Code

This is just a quick appendix to the story, outlining the translation of the code from QGIS to GeoPandas-compatible version.

First thing to say – it is easy! Easier that expected, to be honest. I might have been lucky with the algorithm I’ve chosen, but I believe that there is a scope for other processing tools to be ported this way.

The code of greedy is here and original code here.

QGIS code has a lot of stuff related to QGIS interface, which can be happily ignored. The core is in def processAlgorithm(self, parameters, context, feedback) and that is the part we should focus on.

Nyall’s code can be roughly divided into three steps (ignoring some interface code):

Compute graph to understand topological relationships between polygons

Compute balanced topological colouring

Assign colours to features

To compute graph, Nyall defines a new class holding the topology and checks which polygons intersect with which. I did not want to have a specific class, because we have libpysal‘s weights object taking care of it. Moreover, it comes with an alternative option of topology inferring as contiguity weights. No need to expensively compute intersections anymore (unless we want to, I kept the option there).

Balanced colouring is, in the end, the only part of the code, which is almost entirely original. I made only a few modifications.

Because topological colouring is a know Graph problem, there is a selection of algorithms in networkx library dealing with the problem, so I just simply linked them in.

Finally, the function now returns pandas.Series, making it simple to assign resulting values to GeoDataFrame. The most simple usage is then a single line.

Imagine you are trying to analyse a city, and you want primarily to understand its structure. You look at buildings, their dimensions and patterns they form, you look at a street network, and then you want to understand detailed patterns of density. This last point requires a specification of an aerial unit, and morphological tessellation is one of the most detailed options.

I don’t want to talk only about density, but let’s use it as an illustration.

Let’s say that we want as high resolution of our data as possible, so we need the smallest spatial unit we can get. That is often a plot.

But the reality is not so simple. Firstly, you need to get a layer representing plots. Here you may be lucky and find one, nice and clean, which you can use for the analysis. However, if you try to the same thing for another city on another continent, you quickly figure out that the situation is becoming tricky. Plots have various definitions – each country captures them differently ^{1}Kropf, K. (2018) ‘Plots, property and behaviour’, Urban Morphology, 22(1), pp. 5–14.. So, in the end, your plots from different places are not representing the same concept, and your results are not comparable.^{2}An additional issue with plots comes with modernist housing development, where the plot does not have a structural role, but that is a whole another story.

One possible solution is to use morphological tessellation as a spatial unit instead. Its principle is simple – it is a Voronoi tessellation based on building footprint polygons. In practice, each building polygon is represented as a series of points capturing its boundaries. Points are used to generate Voronoi, and relevant cells are dissolved to get a single cell representing tessellation cell based on a single building.^{3}There are algorithms which can do tessellation based on polygon directly (e.g. CGAL library), but point-based implementation is much easier to do in Python, R or even QGIS and the result is almost the same.

In our recent paper, we have tested morphological tessellation from two perspectives – 1) what are the optimal parameters for its creation, and 2) how good proxy of a plot tessellation is.

There are a few important findings to mention. First, there are contexts where tessellation can directly replace a plot in the analysis. One example is the density mentioned above – both coverage area ratio, and floor area ratio could be easily measured based on tessellation. Second, tessellation has some advantages over the plot – it is consistent and contiguous.

Tessellation captures any kind of urban form (historical, modernist, industrial…) in the same, consistent manner. That makes the tessellation cells across the case studies perfectly comparable, even in situations when plots are not available at all (like some informal settlements) or capture non-morphological spatial division (like contemporary large-scale development with multiple buildings on a single plot).

Unlike plots, morphological tessellation is contiguous, allowing for the definition of adjacency between buildings in situations, where it normally would not be so clear.

That, in turn, allows spatial aggregation, based on the aerial topology of space. Just pick a number of topological steps you like.

Our implementation of morphological tessellation is a part of momepy Python package, coming with a tutorial on how to generate tessellation on your data or on data from OpenStreetMap.

The published version of our article Morphological tessellation as a way of partitioning space: Improving consistency in urban morphology at the plot scale is available from CEUS website, the repository with the Python code and data we have used within the research is available on GitHub and archived here.

The second image in this article is reproduced from Kropf, K. (2018) ‘Plots, property and behaviour’, Urban Morphology, 22(1), pp. 5–14.

Fleischmann, M., Feliciotti, A., Romice, O. and Porta, S. (2020) ‘Morphological tessellation as a way of partitioning space: Improving consistency in urban morphology at the plot scale’, Computers, Environment and Urban Systems, 80, p. 101441. doi: 10.1016/j.compenvurbsys.2019.101441.

Sometimes our lines and polygons are way too complicated for the purpose. Let’s say that we have a beautiful shape of Europe, and we want to make an interactive online map using that shape. Soon we’ll figure out that the polygon has too many points, it takes ages to load, it consumes a lot of memory and, in the end, we don’t even see the full detail. To make things easier, we decide to simplify my polygon.

Simplification means that we want to express the same geometry, using fewer points, but trying to preserve the original shape as much as we can. The easiest way is to open QGIS and use its Simplify processing tool. Now we face the choice – which simplification method should we use? Douglas-Peucker or Visvalingam? How do they work? What is the difference? What does a “tolerance” mean?

This short post aims to answer these questions. I’ll try to explain both of these most popular algorithms, so you can make proper decisions while using them.

First let’s see both how algorithms simplify the following line.

Douglas-Peucker

Douglas-Peucker, or sometimes Ramer–Douglas–Peucker algorithm, is the better known of the two. Its main aim is to identify those points, which are less important for the overall shape of the line and remove them. It does not generate any new point.

The algorithm accepts typically one parameter, tolerance, sometimes called epsilon. To explain how is epsilon used, it is the best to start with the principle. Douglas-Peucker is an iterative algorithm – it removes the point, splits the line and starts again until there is no point which could be removed. In the first step, it makes a line between the first and the last points of the line, as illustrated in the figure below. Then it identifies the point on the line, which is the furthest from this line connecting endpoints. If the distance between the line and the point is less than epsilon, the point is discarded, and the algorithm starts again until there is no point between endpoints.

If the distance between the point and the line is larger than epsilon, the first and the furthest points are connected with another line and every point, which is closer than epsilon to this line gets discarded. Every time a new furthest point is identified, our original line splits in two and the algorithm continues on each part separately. The animation below shows the whole procedure of simplification of the line above using the Douglas-Peucker algorithm.

Visvalingam-Whyatt

Visvalingam-Whyatt shares the aim with Douglas-Peucker – identify points which could be removed. However, the principle is different. Tolerance, or epsilon, in this case, is an area, not a distance.

Visvalingam-Whyatt, in the first step, generates triangles between points, as illustrated in the figure below.

Then it identifies the smallest of these triangles and checks if its area is smaller or larger than the epsilon. If it is smaller, the point associated with the triangle gets discarded, and we start again – generate new triangles, identify the smallest one, check and repeat. The algorithm stops when all generated triangles are larger than the epsilon. See the whole simplification process below.

A great explanation of Visvalingam-Whyatt algorithm with an interactive visualisation made Mike Bostock.

Which one is better?

You can see from the example above, that the final line is the same, but that is not always true, and both algorithms can result in different geometries. Visvalingam-Whyatt tends to produce nicer geometry and is often preferred for simplification of natural features. Douglas-Peucker tends to produce spiky lines at specific configurations. You can compare the actual behaviour of both at this great example by Michael Barry.

Which one is faster?

Let’s figure it out. I will use a long randomised line and Python package simplification, which implements both algorithms. The results may vary based on the actual implementation, but using the same package seems fair. I generate randomised line based on 5000 points and then simplify if using both algorithms with the epsilon fine-tuned to return a similar number of points.

import numpy as np
from simplification.cutil import (
simplify_coords, # this is Douglas-Peucker
simplify_coords_vw, # this is Visvalingam-Whyatt
)
# generate coords of 5000 ordered points as a line
coords = np.sort(np.random.rand(5000, 2), axis=0)
# how many coordinates returns DP with eps=0.01?
simplify_coords(coords, .0025).shape
# 30 / 5000
# how many coordinates returns VW with eps=0.001?
simplify_coords_vw(coords, .0001).shape
# 28 / 500
%%timeit
simplify_coords(coords, .0025)
%%timeit
simplify_coords_vw(coords, .0001)

And the winner is – Douglas-Peucker. By a significant margin.

Douglas-Peucker:

74.1 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Visvalingam-Whyatt:

2.17 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Douglas-Peucker is clearly more performant, but Visvalingam-Whyatt can produce nicer-looking geometry, pick the one you prefer.

Percentage instead of epsilon

Some implementations of simplification algorithms do not offer tolerance / epsilon parameter, but ask for a percentage. How many points do you want to keep? One example of this approach is mapshaper by Matthew Bloch. Based on the iterative nature of both, you can figure out how that works :).

What about topology?

It may happen, that the algorithm (any of them) returns invalid self-intersecting line. Be aware that it may happen. Some implementations (like GEOS used by Shapely and GeoPandas) provide optional slower version preserving topology, but some don’t, so be careful.

I have gaps between my polygons

If you are trying to simplify GeoDataFrame or shapefile, you may be surprised that the simplification makes gaps between the polygons where there should not be any. The reason for that is simple – the algorithm simplifies each polygon separately, so you will easily get something like this.

If you want nice simplification which preserves topology between all polygons, like mapshaper does, look for TopoJSON. Without explaining how that works, as it deserves its own post, see the example below for yourself as the last bit of this text.

import topojson as tp
topo = tp.Topology(df, prequantize=False)
topo.toposimplify(5).to_gdf()

If there’s something inaccurate or confusing, let me know.

This is a short introduction of our recently published paper Measuring urban form: Overcoming terminological inconsistencies for a quantitative and comprehensive morphologic analysis of cities, which is essentially one of the background chapters of my PhD (hopefully finished later this year).

When I started my work, which is focusing on measuring of urban form (or urban morphometrics) – see momepy – one of the first things I wanted to do was to understand what people were people measuring so far — the natural thing to do. However, I have soon figured out that it would not be so easy as there is a minimal consensus on how to call measurable characters. What one calls connectivity other names intersection density and even then you have a little idea – density based on what? Intersections per hectare, per kilometre? I have collected almost 500 measurable characters to find out that it is one big mess.

Before moving on, I had to do a slight detour trying to understand what all of them were about, which were called differently but were, in fact, the same and which were called the same, but were different things. Both situations happen A LOT. This paper is proposing a 1) framework for naming measurable characters to avoid these cases and 2) classification of characters, to make a more structured sense of the vast number of options we can measure.

Renaming is based on the Index of Element principle. Each character has an Index – the measure that it calculates and Element – the element of urban form that it measures. So the case of connectivity above could be called a weighted number of intersections (Index) of a pedestrian network (Element). Yes, it is longer, but also unambiguous. The room for the interpretation is much narrower than in the previous example.

We have used this principle to rename all of those ~500 characters, which eliminated a lot of duplications, leaving us a bit more than 350 unique ones. At that point, we could continue with a classification, which was the first aim.

Classification of characters, which is loosely used in momepy as well, categorises characters into six groups based on the nature of the Index part of the name: dimension, shape, spatial distribution, intensity, connectivity, and diversity. The second layer of classification is based on the notion of scale – what is the grain of the resulting information and the extent of the element. To keep it simple, we are using conceptual S, M, L bins for scale, where “Small (S) represent- ing the spatial extent of building, plot, street or block (and similar), Medium (M) represent- ing the scale of the sanctuary area, neighbourhood, walkable distance (5 or 10min) or district (and similar) and Large (L), representing the city, urban area, metropolitan area or similar” (taken from the paper, p.7). In the end, each character has its category, scale of the grain and scale of the extent. Taking an example of Closeness Centrality of Street Network, it falls into connectivity category (as closeness centrality is a network measure), its grain is S because each node has its value, but its extent is L as it can be measured on large networks.

This framework helped us figure out what is the current situation in the field, e.g., that there is a lot of work focusing on dimension, shape, or intensity but very little on diversity (which was surprising as theoretical urban research talks about diversity all the time). Plus a couple of other interesting findings.

Btw, the paper also includes this map of the quantitative research in urban morphology and a complete database of morphometric character we have worked with (also on GitHub).

The whole paper of trying to talk inwards to the community, saying guys, this is a mess, let’s do something about it. We found it fascinating how such a small field as urban morphometrics can produce so much confusion in a way how we use language.

You can find the original paper at Environment and Planning B website (open access) or accepted manuscript at the University of Strathclyde PURE portal (open access). There is very little code involved in this. Still, related Jupyter notebooks and CSVs containing the database of literature as well as the database of measurable characters are all in this GitHub repository.

Fleischmann, M, Romice, O & Porta, S 2020, ‘Measuring urban form: overcoming terminological inconsistencies for a quantitative and comprehensive morphologic analysis of cities’, Environment and Planning B: Urban Analytics and City Science, pp. 1-18. https://doi.org/10.1177/2399808320910444