How to create a vector-based web map hosted on GitHub

This is the map we have created for the Urban Grammar AI project. It is created using open source software stack and hosted on GitHub, for free.

This post will walk you through the whole process of generation of the map, step by step, so you can create your own. It is a bit longer than usual, so a quick outline for better orientation:

  1. Vector tiles
    1. Zoom levels
    2. Tile format and compression
    3. Generate tiles
  2. Leaflet.js map
    1. Load leaflet.js
    2. Create the map
    3. Add a base map
    4. Add vector tiles
    5. Style the tiles
  3. Github Pages
  4. Further styling and interactivity
    1. Labels on top of the map
    2. Popup information
    3. Conditional formatting based on the zoom level
    4. Legend and details

By the end of this tutorial, you will be able to take your vector data and turn them into a fully-fledged vector map hosted on GitHub with no cost involved, for everyone to enjoy.

Vector tiles

Web maps are (usually) based on tiles. That means that every dataset is turned into a large set of tiny squares representing small portions of the map. And if you want to zoom in and out, you need those tiles for every zoom level. So the first step in our process will be the creation of vector tiles.

The easiest option is to use tippecanoe, a tool created by Mapbox that allows you to create vector tiles from any GeoJSON input. You should export only the columns you need and make sure that the geometry is in EPSG:4326 projection (lat/long) – that is what tippecanoe expects. Something like this will do if you have your data in geopandas.GeoDataFrame:

gdf[["signature_type", "geometry"]].to_crs(4326).to_file(
    "signatures_4326.geojson"
)

Now you have a signatures_4326.geojson with geometry and a column encoding the signature type. It means that it is time to use tippecanoe (check the installation instructions).

tippecanoe is a command-line tool, so fire up your favourite terminal, or use it within a Jupyter notebook (as we did). There are some options to set but let’s talk about the most important ones (you can check the rest in the documentation).

Zoom levels

Every map has a range of zoom levels at which the shown data is meaningful. If you’re comparing states, you don’t need to zoom to a level of a street, and if you’re showing building-level data, you don’t need a zoom to see the whole continent.

Zoom level 0 means you are zoomed out to see the whole world. It is a single tile. Zoom level 1 is a bit closer and is split into 4 tiles. Zoom level 2 has 16 tiles, level 3 has 64 and so on. Zoom level 16, which shows a street-level detail, consists of 4 294 967 296 tiles if you want to create them for the whole world. That is a lot of data, so you need to be mindful of which detail you actually need because the difference between zooming up to level 15 and up to level 16 is huge. OpenStreetMap has a nice guide on zoom levels if you are interested in more details.

For the signature geometry that represents neighbourhood-level classification, level 15 feels like the reasonable maximum, so you have the first option – -z15.

Tile format and compression

In later steps, you will use the Leaflet.VectorGrid plugin to load the tiles. To make that work, you need them as uncompressed protobuf files (not Mapbox vector tiles), which gives you another two options – -no-tile-compression' to control the compression and –output-to-directorytellingtippecanoeto use directories, instead ofmbtiles`.

We have used a few more options, but you may not need those.

Generate tiles

You have all you need to create the tiles:

tippecanoe -z15 \
           --no-tile-compression \
           --output-to-directory=tiles/ \
           signatures_4326.geojson

The command above sets the maximum zoom to 15 so you don’t create overly detailed tiles (z15), disables tile compression (--no-tile-compression) and saves the result to the tiles/ directory (--output-to-directory=tiles/), based on your GeoJSON file (signatures_4326.geojson).

This step may take a bit of time, but once it is done, you will see thousands of files in the tiles/ folder. It is time to show them on a map.

Leaflet.js map

One of the best tools for interactive maps is a leaflet.js, an open-source JavaScript library. With the right plugins, it is an ideal tool for almost any web map.

The whole map and a JavaScript code will be served from a single HTML file. Let’s start with some basic structure, with comments as a placeholder you will fill later:

<!DOCTYPE html>
<html>

<head>
    <title>Spatial Signatures in Great Britain</title>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <!-- load leaflet.js -->
</head>

<body style='margin:0'>
        <!-- div containing map -->
        <!-- specification of leaflet map -->
</body>

</html>

Load leaflet.js

To load the leaflet.js, you need to get its source script and CSS styles. Both are usually fetched over the web. You can then replace <!-- load leaflet.js --> with the following snippet:

<!-- load leaflet.js -->
<!-- load CSS -->
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css"/>

<!-- load JS -->
<script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js"></script>

However, the basic leaflet.js is not able to work with vector tiles, so you need to load a Leaflet.VectorGrid extension on top. Just add this to the snippet:

<!-- load VectorGrid extension -->
<script src="https://unpkg.com/leaflet.vectorgrid@1.3.0/dist/Leaflet.VectorGrid.bundled.js"></script> 

At this moment, you have everything you need to create the map in your file.

Create the map

You need two things – a div object where the map will live and a bit of JavaScript specifying what the map should do. The former is easy:

<!-- div containing map -->
<div id="map" style="width: 100vw; height: 100vh; background: #fdfdfd"></div>

This div is empty, but the important bit is id="map", which will link it to the JavaScript. It also has some basic styles ensuring that it covers the whole screen (width: 100vw; height: 100vh;) and has an almost-white background (background: #fdfdfd).

The second part is the actual JavaScript generating the map.

<script>
// create map
var map = L.map('map', {
    center: [53.4107, -2.9704],
    minZoom: 6,
    maxZoom: 15,
    zoomControl: true,
    zoom: 12,
});

// other map specifications
</script>

Let’s unpack this. This creates a variable called map (var map), which is a leaflet map object (L.map). The object is linked to the HTML element (in this case div) with the id map you created above (“map”). It also has a dictionary with some basic specification, that is setting the centre of the map to Liverpool (center: [53.4107, -2.9704] where the numbers are latitude and longitude), limiting zoom levels to 6-15 (minZoom: 6 and maxZoom: 15), turning on zoom control (+- buttons) (zoomControl: true) and setting the default zoom to 12 (zoom: 12). The zoom level 6, selected as a minimum, is approximately the level on which you see the whole of Great Britain. Since spatial signatures cover only GB, there’s no need to zoom further away. Maximum zoom 15 mirrors the maximum zoom of tiles we generated earlier. You would get no tiles if you zoomed closer than that.

The map object is ready! There’s nothing there now, but that will change soon.

Add a base map

It is nice to have some base map in there before adding other layers, just to make sure everything works. You can add a default OpenStreetMap or something more subtle as a CartoDB Positron map.

// add background basemap
var mapBaseLayer = L.tileLayer(
    'https://{s}.basemaps.cartocdn.com/rastertiles/light_all/{z}/{x}/{y}.png', {
        attribution: '(C) OpenStreetMap contributors (C) CARTO'
    }
).addTo(map);

This creates a leaflet tile layer (L.tileLayer) with the URL of CartoDB raster XYZ tiles and attribution of the origin of tiles. Finally, it adds the layer to our map (.addTo(map)). Now you can actually see a real map.

Add vector tiles

Now for the key part – vector tiles. We need to define quite a few things here.

URL

Let’s first set an URL to each tile.

// get vector tiles URL
var mapUrl = "tiles/{z}/{x}/{y}.pbf";

Tiles are in the tiles/ folder where tippecanoe dumped them. They are organised into folders, following the XYZ tiling standard. That means that they are organised according to zoom level ({z}) and tile coordinates within each level ({x} and {y}). You then need to specify the URL to tiles with placeholders that will be automatically filled by leaflet.js. Here you save the URL to the variable mapUrl.

Options

Another variable you may want to specify is a dictionary with options for the tiles.

// define options of vector tiles
var mapVectorTileOptions = {
    rendererFactory: L.canvas.tile,
    interactive: true,
    attribution: '(C) Martin Fleischmann',
    maxNativeZoom: 15,
    minZoom: 6,
};

The snippet creates a variable (mapVectorTileOptions) and specifies that the tiles should be rendered using leaflet (rendererFactory: L.canvas.tile), they should allow interactivity (interactive: true) as we will need it later, it sets the attribution (attribution: '(C) Martin Fleischmann') and min and max zoom levels.

Vector tile layer

Now you can define the vector tile layer itself.

var mapPbfLayer = new L.VectorGrid.Protobuf(
    mapUrl, mapVectorTileOptions
)

This creates a new variable (mapPbfLayer) to store vector grid protobuf layer (new L.VectorGrid.Protobuf) based on the URL and tile options specified above.

Add to map

Finally, you can add the created layer to the map.

// add VectorGrid layer to map
mapPbfLayer.addTo(map);

At this point, you should be able to see your vector tiles on your map with no styles. However, to be able to use it locally, you need to start a server first. An easy way is to do it with Python. In the terminal, navigate to the folder with your html file (something like cd path/to/my_folder) and start the http server:

python -m http.server 8000

If you now open http://localhost:8000 in your browser, you should be able to open your html file (it will open automatically if it is called index.html) and see your map. It should look like this:

Style the tiles

The GeoJSON used to create these tiles has a single attribute column called signature_type. What you want to do now is to use this column to style the map to your liking.

Let’s assume that you know which RGB colours should apply to which signature type code. Let’s save those in a dictionary, so you can use them later.

// mapping of colors to signature types
const cmap = {
    "0_0": "#f2e6c7",
    "1_0": "#8fa37e",
    "3_0": "#d7a59f",
    "4_0": "#d7ded1",
    "5_0": "#c3abaf",
    "6_0": "#e4cbc8",
    "7_0": "#c2d0d9",
    "8_0": "#f0d17d",
    "2_0": "#678ea6",
    "2_1": "#94666e",
    "2_2": "#efc758",
    "9_0": "#3b6e8c",
    "9_1": "#333432",
    "9_2": "#ab888e",
    "9_3": "#c1c1c0",
    "9_4": "#bc5b4f",
    "9_5": "#a7b799",
    "9_6": "#c1c1c0",
    "9_7": "#c1c1c0",
    "9_8": "#c1c1c0"
};

This dictionary is easy left side encodes the signature type code, right side the colour as a HEX code.

Having colours, let’s define a new variable with styles and unpack it a bit.

// define styling of vector tiles
var vectorTileStyling = {
    signatures_4326: function(properties) {
        return ({
            fill: true,
                        fillColor: cmap[properties.signature_type],
            fillOpacity: 0.9,
            weight: 1,
            color: "#ffffff",
            opacity: 1.0,
        });
    }
}

You need to create a function mapping the styles to geometries based on the signature type property. For layer signatures_4326 (that matches the name of input GeoJSON), you shall define a function that can use the layer properties (function(properties)) and specify that we want to fill each polygon with a colour (fill: true), that the fill colour should use the signature type to retrieve the colour from the dictionary defined above (fillColor: cmap[properties.signature_type]) with the fill opacity 0.8 to see a bit of background through (fillOpacity: 0.9`). It also specifies the outline of each polygon as a white line of a weight 1.

Then you just need to edit the vector tile options (the mapVectorTileOptions dictionary from above) and add tile styling.

        // ...
        minZoom: 6,
        vectorTileLayerStyles: vectorTileStyling,  // this line is new
};

At this point, the basic map is done, and you can try to get it online or style it a bit more (scroll below for that).

This is the whole code of the page for reference.

<!DOCTYPE html>
<html>

<head>
    <title>Spatial Signatures in Great Britain</title>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <!-- load leaflet.js -->
    <!-- load CSS -->
    <link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css" />

    <!-- load JS -->
    <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js"></script>

    <!-- load VectorGrid extension -->
    <script src="https://unpkg.com/leaflet.vectorgrid@1.3.0/dist/Leaflet.VectorGrid.bundled.js"></script>
</head>

<body style='margin:0'>
    <!-- div containing map -->
    <div id="map" style="width: 100vw; height: 100vh; background: #fdfdfd"></div>
    <!-- specification of leaflet map -->
    <script>
        // create map
        var map = L.map('map', {
            center: [53.4107, -2.9704],
            minZoom: 6,
            maxZoom: 15,
            zoomControl: true,
            zoom: 12,
        });

        // add background basemap
        var mapBaseLayer = L.tileLayer(
            'https://{s}.basemaps.cartocdn.com/rastertiles/light_all/{z}/{x}/{y}.png', {
                attribution: '(C) OpenStreetMap contributors (C) CARTO'
            }
        ).addTo(map);

        // get vector tiles URL
        var mapUrl = "tiles/{z}/{x}/{y}.pbf";

        // mapping of colors to signature types
        const cmap = {
            "0_0": "#f2e6c7",
            "1_0": "#8fa37e",
            "3_0": "#d7a59f",
            "4_0": "#d7ded1",
            "5_0": "#c3abaf",
            "6_0": "#e4cbc8",
            "7_0": "#c2d0d9",
            "8_0": "#f0d17d",
            "2_0": "#678ea6",
            "2_1": "#94666e",
            "2_2": "#efc758",
            "9_0": "#3b6e8c",
            "9_1": "#333432",
            "9_2": "#ab888e",
            "9_3": "#c1c1c0",
            "9_4": "#bc5b4f",
            "9_5": "#a7b799",
            "9_6": "#c1c1c0",
            "9_7": "#c1c1c0",
            "9_8": "#c1c1c0"
        };

        // define styling of vector tiles
        var vectorTileStyling = {
            signatures_4326: function(properties) {
                return ({
                    fill: true,
                    fillColor: cmap[properties.signature_type],
                    fillOpacity: 0.9,
                    weight: 1,
                    color: "#ffffff",
                    opacity: 1.0,
                });
            }
        }

        // define options of vector tiles
        var mapVectorTileOptions = {
            rendererFactory: L.canvas.tile,
            interactive: true,
            attribution: '(C) Martin Fleischmann',
            maxNativeZoom: 15,
            minZoom: 6,
            vectorTileLayerStyles: vectorTileStyling,
        };

        // create VectorGrid layer
        var mapPbfLayer = new L.VectorGrid.Protobuf(
            mapUrl, mapVectorTileOptions
        )

        // add VectorGrid layer to map
        mapPbfLayer.addTo(map);
    </script>
</body>

</html>

Github Pages

You can use GitHub Pages to host the whole map online. GitHub has a size limit for individual files of 100MB. Considering that the original GeoJSON has 2.2GB, that may sound like a problem. But remember that the data are now cut to small tiles, where the size of each is a few KB. You want to create a repository on GitHub, upload your tiles/ folder and the HTML file you have just created, called index.html and set up GitHub Pages hosting.

Once your data are uploaded, create an empty file called .nojekyll that tells GitHub Pages to use the repository as an HTML code as it is (what it really says is “do not use Jekyll to compile the page”) and go to Settings > Pages and enable GitHub Pages.

🎉 Your web map is live and accessible by anyone online. GitHub Pages settings show you the exact link. It is as easy as that.

Tip: Once you have everything set on GitHub, use github.dev to make the further changes to the HTML file avoid download and indexing of the whole repository (just hit . when you are on GitHub in your repository).

Our map lives in the urbangrammarai/great-britain repository if you want to check the source code and is available on https://urbangrammarai.xyz/great-britain/ (we use the custom domain on GitHub Pages).

Further styling and interactivity

You may have noticed that our map has a couple of features on top of this. Let’s quickly go through those.

Labels on top of the map

You can use another XYZ base map that has just labels and load it on top of your tiles to see the names of places (or roads or something else). The order of layers matches the order in which you use .addTo(map).

Adding CartoDB Positron labels (this need to be after mapPbfLayer.addTo(map):

// add labels basemap
var mapBaseLayer = L.tileLayer(
    'https://{s}.basemaps.cartocdn.com/rastertiles/light_only_labels/{z}/{x}/{y}.png', {
        attribution: '(C) OpenStreetMap contributors (C) CARTO'
    }
).addTo(map);

You may want to show some additional information when you click on the geometry. Let’s link show the name of the signature type based on its code. First, create a dictionary with the info:

// mapping of names to signature types to be shown in popup on click
const popup_info = {
    "0_0": "Countryside agriculture",
    "1_0": "Accessible suburbia",
    "3_0": "Open sprawl",
    "4_0": "Wild countryside",
    "5_0": "Warehouse/Park land",
    "6_0": "Gridded residential quarters",
    "7_0": "Urban buffer",
    "8_0": "Disconnected suburbia",
    "2_0": "Dense residential neighbourhoods",
    "2_1": "Connected residential neighbourhoods",
    "2_2": "Dense urban neighbourhoods",
    "9_0": "Local urbanity",
    "9_1": "Concentrated urbanity",
    "9_2": "Regional urbanity",
    "9_4": "Metropolitan urbanity",
    "9_5": "Hyper concentrated urbanity",
};

Then you need to link the on click popup element to the VectorGrid object.

// create VectorGrid layer
var mapPbfLayer = new L.VectorGrid.Protobuf(
    mapUrl, mapVectorTileOptions
).on('click', function (e) { // this line and below
    L.popup()
        .setContent(popup_info[e.layer.properties.signature_type])
        .setLatLng(e.latlng)
        .openOn(map);
});

Conditional formatting based on the zoom level

The white outline can be a bit think when you zoom out, but there is a way how you can control that. You can just specify the condition to a variable and pass the variable to the tile styling dictionary.

// define styling of vector tiles
var vectorTileStyling = {
    signatures_4326: function(properties, zoom) {
                // set line weight conditionally based on zoom
        var weight = 0;
        if (zoom > 12) {
            weight = 1.0;
        }
        return ({
            fill: true,
            fillColor: cmap[properties.signature_type],
            fillOpacity: 0.9,
            weight: weight, // pass the weight variable instead of a value
            color: "#ffffff",
            opacity: 1.0,
        });
    }
}

The function signature has a new zoom attribute you can use in the condition.

Legend and details

A legend and details are just custom HTML objects detached from the leaflet map. You can check the source code to see how they’re done.

Introducing Dask-GeoPandas for scalable spatial analysis in Python

Using Python for data science is usually a great experience, but if you’ve ever worked with pandas or GeoPandas, you may have noticed that they use only a single core of your processor. Especially on larger machines, that is a bit of a sad situation.

Developers came up with many solutions to scale pandas, but the one that seems to take the lead is Dask. Dask (specifically dask.dataframe as Dask can do much more) creates a partitioned data frame, where each partition is a single pandas.DataFrame. Each of them can be then processed in parallel and combined when necessary. On top of that, the whole pipeline can be scaled to a cluster of machines and can deal with out-of-core computation, i.e. with datasets that do not fit the memory.

Today, we announce the release of Dask-GeoPandas 0.1.0, a new Python package that extends dask.dataframe in the same way GeoPandas extends pandas, bringing the support for geospatial data to Dask. That means geometry columns and spatial operations but also spatial partitioning, ensuring that geometries that are close in space are within the same partition, necessary for efficient spatial indexing.

The project has been in development for quite some time. The original exploration of bridging Dask and GeoPandas started almost 5 years ago by Matt Rocklin, the author of Dask. Later, in 2020, Julia Signell revised the idea and created the foundations of the current project. Since then, GeoPandas maintainers have taken over and led the recent development.

What is awesome about Dask-GeoPandas? First, you can do your spatial analysis in parallel, making sure all available resources are used (no more sad idle cores!), turning your workflow into faster and more efficient ones. You can also use Dask-GeoPandas to process data that do not fit your machine’s memory as Dask comes with a support of out-of-core computation. Finally, you can distribute the work across many machines in a cluster. And all that with almost the same familiar GeoPandas APIs.

The latest evolution of underlying libraries powering GeoPandas ensures that it is efficient in terms of utilisation of resources but also performant within each partition. For example, unlike GeoPandas, where the use of pygeos, a new vectorised interface to GEOS is optional, Dask-GeoPandas requires it. Similarly, it depends on pyogrio, a vectorised interface to GDAL, to read geospatial file formats.

At this moment, Dask-GeoPandas can do a lot of what GeoPandas can, with some limitations. When your code involves individual geometries, without assessing a relationship between them (like computing a centroid or area), you should be able to use it directly. When you need to work out some relationships, you can try (still a bit limited) sjoin or make use of spatial partitions and spatial indexing.

But not everything is ready. For example, overlapping computation needed for use cases like accessibility or K-nearest neighbour analyses is not yet implemented, PostGIS IO is not done, and some overlay operations are implemented only partially (sjoin) or not at all (overlay, sjoin_nearest). But the 0.1.0 release is just a start.

You can try it yourself, installing via conda (or mamba) or from PyPI (but see the instructions, GeoPandas can be tricky to install using pip).

mamba install dask-geopandas
pip install dask-geopandas 

The best starting point to learn how Dask-GeoPandas works is the documentation, but this is the gist:

import geopandas
import dask_geopandas

df = geopandas.read_file(
    geopandas.datasets.get_path("naturalearth_lowres")
)
dask_df = dask_geopandas.from_geopandas(df, npartitions=8)

dask_df.geometry.area.compute()

The code creates a dask_geopandas.GeoDataFrame with 8 partitions because I have 8 cores and compute each polygon’s area in parallel, giving almost 8x speedup compared to the vanilla GeoPandas.

You can also check my latest post comparing Dask-GeoPandas performance on a large spatial join with PostGIS and cuSpatial (GPU) implementations.

If you want to help, have questions or ideas, you are always welcome. Just head over to Github or Gitter and say hi!

Evolution of Urban Patterns: Urban Morphology as an Open Reproducible Data Science

We have a new paper published in the Geographical Analysis on the opportunities current developments in geographic data science within the Python ecosystem offer to urban morphology. To sum up – there’s a lot to play with and if you’re interested in the quantification of urban form, there’s no better choice for you at the moment.

Urban morphology (study of urban form) is historically a qualitative discipline that only recently expands into more data science-ish waters. We believe that there’s a lot of potential in this direction and illustrate it on the case study looking into the evolution of urban patterns, i.e. how different aspects of urban form has changed over time.

The paper is open access (yay!) (links to the online version and a PDF), the research is fully reproducible (even in your browser thanks to amazing MyBinder!) with all code on GitHub.

Short summary

We have tried to map all the specialised open-source tools urban morphologists can use these days, which resulted in this nice table. The main conclusion? Most of them are plug-ins for QGIS or ArcGIS, hence depend on pointing and clicking. A tricky thing to reproduce properly.

Table 1 from the paper

We prefer code-based science, so we took the Python ecosystem and put it in the test. We have designed a fully reproducible workflow based on GeoPandas, OSMnx, PySAL and momepy to sample and study 42 places around the world, developed at very different times.

We measured a bunch of morphometric characters (indicators for individual aspects of form) and looked at their change in time. And there is a lot to look at. There are significant differences not only in scale (on the figure below) but in other aspects as well. One interesting observation – it seems that we have forgotten how to make a properly connected, dense grid.

As we said in the paper, “Switching to a code-based analysis may be associated with a steep learning curve. However, not everyone needs to reach the developer level as the data science ecosystem aims to provide a middle ground user level. That is a bit like Lego—the researcher learns how to put pieces together and then find pieces they need to build a house.

We think that moving from QGIS to Python (or R), as daunting as it may seem to some, is worth it. It helps us overcome the reproducibility crisis science is going through, the crisis caused, among other things, by relying on pointing and clicking too much.

The open research paradigm, based on open platforms and transparent community-led governance, has the potential to democratize science and remove unnecessary friction caused by the lack of cooperation between research groups while bringing additional transparency to research methods and outputs.

page 18

Abstract

The recent growth of geographic data science (GDS) fuelled by increasingly available open data and open source tools has influenced urban sciences across a multitude of fields. Yet there is limited application in urban morphology—a science of urban form. Although quantitative approaches to morphological research are finding momentum, existing tools for such analyses have limited scope and are predominantly implemented as plug-ins for standalone geographic information system software. This inherently restricts transparency and reproducibility of research. Simultaneously, the Python ecosystem for GDS is maturing to the point of fully supporting highly specialized morphological analysis. In this paper, we use the open source Python ecosystem in a workflow to illustrate its capabilities in a case study assessing the evolution of urban patterns over six historical periods on a sample of 42 locations. Results show a trajectory of change in the scale and structure of urban form from pre-industrial development to contemporary neighborhoods, with a peak of highest deviation during the post-World War II era of modernism, confirming previous findings. The wholly reproducible method is encapsulated in computational notebooks, illustrating how modern GDS can be applied to urban morphology research to promote open, collaborative, and transparent science, independent of proprietary or otherwise limited software.

Fleischmann, M., Feliciotti, A. and Kerr, W. (2021), Evolution of Urban Patterns: Urban Morphology as an Open Reproducible Data Science. Geogr Anal. https://doi.org/10.1111/gean.12302

Spatial Analytics + Data Talk

On March 30, 2021, I had a chance to deliver a talk as part of the Spatial Analytics + Data Seminar Series organised by the University of Newcastle (Rachel Franklin), the University of Bristol (Levi Wolf) and the Alan Turing Institute. The recording of the event is now available on YouTube.

Spatial Signatures: Dynamic classification of the built environment

This talk introduces the notion of “spatial signatures”, a characterisation of space based on form and function. We know little about how cities are organised over space influences social, economic and environmental outcomes, in part because it is hard to measure. It presents the first stage of the Urban Grammar AI research project, which develops a conceptual framework to characterise urban structure through the notions of spatial signatures and urban grammar and will deploy it to generate open data products and insight about the evolution of cities.

The slides are available online at https://urbangrammarai.github.io/talks/202103_sad/.

The journey of an algorithm from QGIS to GeoPandas

This is a short story of one open-source algorithm and its journey from QGIS to mapclassify, to be used within GeoPandas. I am writing it to illustrate the flow within the open-source community because even though this happens all the time, we normally don’t talk about it. And we should.

The story

Sometimes last year, I asked myself a question. How hard would it be to port topological colouring tool from QGIS to be used with GeoPandas? Considering that this particular QGIS tool is written in Python, it seemed to be an easy task.

For those of you who never used it, the aim of topological colouring is to assign colours to (usually) polygons in such a way, that no two adjacent polygons share the same colour (see the illustration in the embedded tweet below).

The adaptation of the Nyall Dawson’s original algorithm was quite straightforward, the logic of Python algorithms for QGIS and for GeoPandas is the same. So in October, I have asked the others what would be the ideal way of sharing it.

The original license was not compatible with the one we use in GeoPandas and I was not sure if GeoPandas itself is actually the right place for it. So while thinking about it, Nyall himself made the situation easier and offered to relicense the original code.

However, there was no clear consensus what is the best way at that time and the whole idea was set aside, until the end of the year, when I decided to keep it simple and release the script as a tiny Python package. And greedy was born.

In the end, greedy offered the original QGIS algorithm for topological colouring and a few other options on top of that, applicable to all geometry types (unlike QGIS, which is limited to polygons). It was a simple solution to release it like that, but it was not optimal, because the more scattered the ecosystem is, the more complicated is to navigate in it and people just easily miss things.

We could have ended the story here, but then Levi Wolf came with an idea.

After some discussion, splot was not seen as the best place, but mapclassify was. And after a couple of months, I made a pull request merging greedy into mapclassify.

It is a very simple story, you may say. Yes, but it shows one thing very clearly, and that is a willingness of the community to collaborate across different projects. A lot of people were involved in it, everyone willing to find the best solution. I think it is worth sharing tiny stories like this.

To see the capability of mapclassify.greedy, here is a Jupyter notebook for you. Thanks everyone involved!

The Code

This is just a quick appendix to the story, outlining the translation of the code from QGIS to GeoPandas-compatible version.

First thing to say – it is easy! Easier that expected, to be honest. I might have been lucky with the algorithm I’ve chosen, but I believe that there is a scope for other processing tools to be ported this way.

The code of greedy is here and original code here.

QGIS code has a lot of stuff related to QGIS interface, which can be happily ignored. The core is in def processAlgorithm(self, parameters, context, feedback) and that is the part we should focus on.

Nyall’s code can be roughly divided into three steps (ignoring some interface code):

  1. Compute graph to understand topological relationships between polygons
  2. Compute balanced topological colouring
  3. Assign colours to features

To compute graph, Nyall defines a new class holding the topology and checks which polygons intersect with which. I did not want to have a specific class, because we have libpysal‘s weights object taking care of it. Moreover, it comes with an alternative option of topology inferring as contiguity weights. No need to expensively compute intersections anymore (unless we want to, I kept the option there).

Balanced colouring is, in the end, the only part of the code, which is almost entirely original. I made only a few modifications.

Because topological colouring is a know Graph problem, there is a selection of algorithms in networkx library dealing with the problem, so I just simply linked them in.

Finally, the function now returns pandas.Series, making it simple to assign resulting values to GeoDataFrame. The most simple usage is then a single line.

gdf['colors'] = mapclassify.greedy(gdf)

Line simplification algorithms

Sometimes our lines and polygons are way too complicated for the purpose. Let’s say that we have a beautiful shape of Europe, and we want to make an interactive online map using that shape. Soon we’ll figure out that the polygon has too many points, it takes ages to load, it consumes a lot of memory and, in the end, we don’t even see the full detail. To make things easier, we decide to simplify my polygon.

Simplification means that we want to express the same geometry, using fewer points, but trying to preserve the original shape as much as we can. The easiest way is to open QGIS and use its Simplify processing tool. Now we face the choice – which simplification method should we use? Douglas-Peucker or Visvalingam? How do they work? What is the difference? What does a “tolerance” mean?

This short post aims to answer these questions. I’ll try to explain both of these most popular algorithms, so you can make proper decisions while using them.

First let’s see both how algorithms simplify the following line.

Relatively complex line with 11 points, which needs to be simplified.

Douglas-Peucker

Douglas-Peucker, or sometimes Ramer–Douglas–Peucker algorithm, is the better known of the two. Its main aim is to identify those points, which are less important for the overall shape of the line and remove them. It does not generate any new point.

The algorithm accepts typically one parameter, tolerance, sometimes called epsilon. To explain how is epsilon used, it is the best to start with the principle. Douglas-Peucker is an iterative algorithm – it removes the point, splits the line and starts again until there is no point which could be removed. In the first step, it makes a line between the first and the last points of the line, as illustrated in the figure below. Then it identifies the point on the line, which is the furthest from this line connecting endpoints. If the distance between the line and the point is less than epsilon, the point is discarded, and the algorithm starts again until there is no point between endpoints.

If the distance between the point and the line is larger than epsilon, the first and the furthest points are connected with another line and every point, which is closer than epsilon to this line gets discarded. Every time a new furthest point is identified, our original line splits in two and the algorithm continues on each part separately. The animation below shows the whole procedure of simplification of the line above using the Douglas-Peucker algorithm.

Visvalingam-Whyatt

Visvalingam-Whyatt shares the aim with Douglas-Peucker – identify points which could be removed. However, the principle is different. Tolerance, or epsilon, in this case, is an area, not a distance.

Visvalingam-Whyatt, in the first step, generates triangles between points, as illustrated in the figure below.

Then it identifies the smallest of these triangles and checks if its area is smaller or larger than the epsilon. If it is smaller, the point associated with the triangle gets discarded, and we start again – generate new triangles, identify the smallest one, check and repeat. The algorithm stops when all generated triangles are larger than the epsilon. See the whole simplification process below.

A great explanation of Visvalingam-Whyatt algorithm with an interactive visualisation made Mike Bostock.

Which one is better?

You can see from the example above, that the final line is the same, but that is not always true, and both algorithms can result in different geometries. Visvalingam-Whyatt tends to produce nicer geometry and is often preferred for simplification of natural features. Douglas-Peucker tends to produce spiky lines at specific configurations. You can compare the actual behaviour of both at this great example by Michael Barry.

Which one is faster?

Let’s figure it out. I will use a long randomised line and Python package simplification, which implements both algorithms. The results may vary based on the actual implementation, but using the same package seems fair. I generate randomised line based on 5000 points and then simplify if using both algorithms with the epsilon fine-tuned to return a similar number of points.

import numpy as np
from simplification.cutil import (
    simplify_coords, # this is Douglas-Peucker 
    simplify_coords_vw,  # this is Visvalingam-Whyatt
)

# generate coords of 5000 ordered points as a line
coords = np.sort(np.random.rand(5000, 2), axis=0)

# how many coordinates returns DP with eps=0.01?
simplify_coords(coords, .0025).shape
# 30 / 5000

# how many coordinates returns VW with eps=0.001?
simplify_coords_vw(coords, .0001).shape
# 28 / 500

%%timeit
simplify_coords(coords, .0025)

%%timeit
simplify_coords_vw(coords, .0001)

And the winner is – Douglas-Peucker. By a significant margin.

Douglas-Peucker:

74.1 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Visvalingam-Whyatt:

2.17 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Douglas-Peucker is clearly more performant, but Visvalingam-Whyatt can produce nicer-looking geometry, pick the one you prefer.

Percentage instead of epsilon

Some implementations of simplification algorithms do not offer tolerance / epsilon parameter, but ask for a percentage. How many points do you want to keep? One example of this approach is mapshaper by Matthew Bloch. Based on the iterative nature of both, you can figure out how that works :).

What about topology?

It may happen, that the algorithm (any of them) returns invalid self-intersecting line. Be aware that it may happen. Some implementations (like GEOS used by Shapely and GeoPandas) provide optional slower version preserving topology, but some don’t, so be careful.

I have gaps between my polygons

If you are trying to simplify GeoDataFrame or shapefile, you may be surprised that the simplification makes gaps between the polygons where there should not be any. The reason for that is simple – the algorithm simplifies each polygon separately, so you will easily get something like this.

If you want nice simplification which preserves topology between all polygons, like mapshaper does, look for TopoJSON. Without explaining how that works, as it deserves its own post, see the example below for yourself as the last bit of this text.

import topojson as tp

topo = tp.Topology(df, prequantize=False)
topo.toposimplify(5).to_gdf()

If there’s something inaccurate or confusing, let me know.