How to create a vector-based web map hosted on GitHub

This is the map we have created for the Urban Grammar AI project. It is created using open source software stack and hosted on GitHub, for free.

This post will walk you through the whole process of generation of the map, step by step, so you can create your own. It is a bit longer than usual, so a quick outline for better orientation:

  1. Vector tiles
    1. Zoom levels
    2. Tile format and compression
    3. Generate tiles
  2. Leaflet.js map
    1. Load leaflet.js
    2. Create the map
    3. Add a base map
    4. Add vector tiles
    5. Style the tiles
  3. Github Pages
  4. Further styling and interactivity
    1. Labels on top of the map
    2. Popup information
    3. Conditional formatting based on the zoom level
    4. Legend and details

By the end of this tutorial, you will be able to take your vector data and turn them into a fully-fledged vector map hosted on GitHub with no cost involved, for everyone to enjoy.

Vector tiles

Web maps are (usually) based on tiles. That means that every dataset is turned into a large set of tiny squares representing small portions of the map. And if you want to zoom in and out, you need those tiles for every zoom level. So the first step in our process will be the creation of vector tiles.

The easiest option is to use tippecanoe, a tool created by Mapbox that allows you to create vector tiles from any GeoJSON input. You should export only the columns you need and make sure that the geometry is in EPSG:4326 projection (lat/long) – that is what tippecanoe expects. Something like this will do if you have your data in geopandas.GeoDataFrame:

gdf[["signature_type", "geometry"]].to_crs(4326).to_file(
    "signatures_4326.geojson"
)

Now you have a signatures_4326.geojson with geometry and a column encoding the signature type. It means that it is time to use tippecanoe (check the installation instructions).

tippecanoe is a command-line tool, so fire up your favourite terminal, or use it within a Jupyter notebook (as we did). There are some options to set but let’s talk about the most important ones (you can check the rest in the documentation).

Zoom levels

Every map has a range of zoom levels at which the shown data is meaningful. If you’re comparing states, you don’t need to zoom to a level of a street, and if you’re showing building-level data, you don’t need a zoom to see the whole continent.

Zoom level 0 means you are zoomed out to see the whole world. It is a single tile. Zoom level 1 is a bit closer and is split into 4 tiles. Zoom level 2 has 16 tiles, level 3 has 64 and so on. Zoom level 16, which shows a street-level detail, consists of 4 294 967 296 tiles if you want to create them for the whole world. That is a lot of data, so you need to be mindful of which detail you actually need because the difference between zooming up to level 15 and up to level 16 is huge. OpenStreetMap has a nice guide on zoom levels if you are interested in more details.

For the signature geometry that represents neighbourhood-level classification, level 15 feels like the reasonable maximum, so you have the first option – -z15.

Tile format and compression

In later steps, you will use the Leaflet.VectorGrid plugin to load the tiles. To make that work, you need them as uncompressed protobuf files (not Mapbox vector tiles), which gives you another two options – -no-tile-compression' to control the compression and –output-to-directorytellingtippecanoeto use directories, instead ofmbtiles`.

We have used a few more options, but you may not need those.

Generate tiles

You have all you need to create the tiles:

tippecanoe -z15 \
           --no-tile-compression \
           --output-to-directory=tiles/ \
           signatures_4326.geojson

The command above sets the maximum zoom to 15 so you don’t create overly detailed tiles (z15), disables tile compression (--no-tile-compression) and saves the result to the tiles/ directory (--output-to-directory=tiles/), based on your GeoJSON file (signatures_4326.geojson).

This step may take a bit of time, but once it is done, you will see thousands of files in the tiles/ folder. It is time to show them on a map.

Leaflet.js map

One of the best tools for interactive maps is a leaflet.js, an open-source JavaScript library. With the right plugins, it is an ideal tool for almost any web map.

The whole map and a JavaScript code will be served from a single HTML file. Let’s start with some basic structure, with comments as a placeholder you will fill later:

<!DOCTYPE html>
<html>

<head>
    <title>Spatial Signatures in Great Britain</title>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <!-- load leaflet.js -->
</head>

<body style='margin:0'>
        <!-- div containing map -->
        <!-- specification of leaflet map -->
</body>

</html>

Load leaflet.js

To load the leaflet.js, you need to get its source script and CSS styles. Both are usually fetched over the web. You can then replace <!-- load leaflet.js --> with the following snippet:

<!-- load leaflet.js -->
<!-- load CSS -->
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css"/>

<!-- load JS -->
<script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js"></script>

However, the basic leaflet.js is not able to work with vector tiles, so you need to load a Leaflet.VectorGrid extension on top. Just add this to the snippet:

<!-- load VectorGrid extension -->
<script src="https://unpkg.com/leaflet.vectorgrid@1.3.0/dist/Leaflet.VectorGrid.bundled.js"></script> 

At this moment, you have everything you need to create the map in your file.

Create the map

You need two things – a div object where the map will live and a bit of JavaScript specifying what the map should do. The former is easy:

<!-- div containing map -->
<div id="map" style="width: 100vw; height: 100vh; background: #fdfdfd"></div>

This div is empty, but the important bit is id="map", which will link it to the JavaScript. It also has some basic styles ensuring that it covers the whole screen (width: 100vw; height: 100vh;) and has an almost-white background (background: #fdfdfd).

The second part is the actual JavaScript generating the map.

<script>
// create map
var map = L.map('map', {
    center: [53.4107, -2.9704],
    minZoom: 6,
    maxZoom: 15,
    zoomControl: true,
    zoom: 12,
});

// other map specifications
</script>

Let’s unpack this. This creates a variable called map (var map), which is a leaflet map object (L.map). The object is linked to the HTML element (in this case div) with the id map you created above (“map”). It also has a dictionary with some basic specification, that is setting the centre of the map to Liverpool (center: [53.4107, -2.9704] where the numbers are latitude and longitude), limiting zoom levels to 6-15 (minZoom: 6 and maxZoom: 15), turning on zoom control (+- buttons) (zoomControl: true) and setting the default zoom to 12 (zoom: 12). The zoom level 6, selected as a minimum, is approximately the level on which you see the whole of Great Britain. Since spatial signatures cover only GB, there’s no need to zoom further away. Maximum zoom 15 mirrors the maximum zoom of tiles we generated earlier. You would get no tiles if you zoomed closer than that.

The map object is ready! There’s nothing there now, but that will change soon.

Add a base map

It is nice to have some base map in there before adding other layers, just to make sure everything works. You can add a default OpenStreetMap or something more subtle as a CartoDB Positron map.

// add background basemap
var mapBaseLayer = L.tileLayer(
    'https://{s}.basemaps.cartocdn.com/rastertiles/light_all/{z}/{x}/{y}.png', {
        attribution: '(C) OpenStreetMap contributors (C) CARTO'
    }
).addTo(map);

This creates a leaflet tile layer (L.tileLayer) with the URL of CartoDB raster XYZ tiles and attribution of the origin of tiles. Finally, it adds the layer to our map (.addTo(map)). Now you can actually see a real map.

Add vector tiles

Now for the key part – vector tiles. We need to define quite a few things here.

URL

Let’s first set an URL to each tile.

// get vector tiles URL
var mapUrl = "tiles/{z}/{x}/{y}.pbf";

Tiles are in the tiles/ folder where tippecanoe dumped them. They are organised into folders, following the XYZ tiling standard. That means that they are organised according to zoom level ({z}) and tile coordinates within each level ({x} and {y}). You then need to specify the URL to tiles with placeholders that will be automatically filled by leaflet.js. Here you save the URL to the variable mapUrl.

Options

Another variable you may want to specify is a dictionary with options for the tiles.

// define options of vector tiles
var mapVectorTileOptions = {
    rendererFactory: L.canvas.tile,
    interactive: true,
    attribution: '(C) Martin Fleischmann',
    maxNativeZoom: 15,
    minZoom: 6,
};

The snippet creates a variable (mapVectorTileOptions) and specifies that the tiles should be rendered using leaflet (rendererFactory: L.canvas.tile), they should allow interactivity (interactive: true) as we will need it later, it sets the attribution (attribution: '(C) Martin Fleischmann') and min and max zoom levels.

Vector tile layer

Now you can define the vector tile layer itself.

var mapPbfLayer = new L.VectorGrid.Protobuf(
    mapUrl, mapVectorTileOptions
)

This creates a new variable (mapPbfLayer) to store vector grid protobuf layer (new L.VectorGrid.Protobuf) based on the URL and tile options specified above.

Add to map

Finally, you can add the created layer to the map.

// add VectorGrid layer to map
mapPbfLayer.addTo(map);

At this point, you should be able to see your vector tiles on your map with no styles. However, to be able to use it locally, you need to start a server first. An easy way is to do it with Python. In the terminal, navigate to the folder with your html file (something like cd path/to/my_folder) and start the http server:

python -m http.server 8000

If you now open http://localhost:8000 in your browser, you should be able to open your html file (it will open automatically if it is called index.html) and see your map. It should look like this:

Style the tiles

The GeoJSON used to create these tiles has a single attribute column called signature_type. What you want to do now is to use this column to style the map to your liking.

Let’s assume that you know which RGB colours should apply to which signature type code. Let’s save those in a dictionary, so you can use them later.

// mapping of colors to signature types
const cmap = {
    "0_0": "#f2e6c7",
    "1_0": "#8fa37e",
    "3_0": "#d7a59f",
    "4_0": "#d7ded1",
    "5_0": "#c3abaf",
    "6_0": "#e4cbc8",
    "7_0": "#c2d0d9",
    "8_0": "#f0d17d",
    "2_0": "#678ea6",
    "2_1": "#94666e",
    "2_2": "#efc758",
    "9_0": "#3b6e8c",
    "9_1": "#333432",
    "9_2": "#ab888e",
    "9_3": "#c1c1c0",
    "9_4": "#bc5b4f",
    "9_5": "#a7b799",
    "9_6": "#c1c1c0",
    "9_7": "#c1c1c0",
    "9_8": "#c1c1c0"
};

This dictionary is easy left side encodes the signature type code, right side the colour as a HEX code.

Having colours, let’s define a new variable with styles and unpack it a bit.

// define styling of vector tiles
var vectorTileStyling = {
    signatures_4326: function(properties) {
        return ({
            fill: true,
                        fillColor: cmap[properties.signature_type],
            fillOpacity: 0.9,
            weight: 1,
            color: "#ffffff",
            opacity: 1.0,
        });
    }
}

You need to create a function mapping the styles to geometries based on the signature type property. For layer signatures_4326 (that matches the name of input GeoJSON), you shall define a function that can use the layer properties (function(properties)) and specify that we want to fill each polygon with a colour (fill: true), that the fill colour should use the signature type to retrieve the colour from the dictionary defined above (fillColor: cmap[properties.signature_type]) with the fill opacity 0.8 to see a bit of background through (fillOpacity: 0.9`). It also specifies the outline of each polygon as a white line of a weight 1.

Then you just need to edit the vector tile options (the mapVectorTileOptions dictionary from above) and add tile styling.

        // ...
        minZoom: 6,
        vectorTileLayerStyles: vectorTileStyling,  // this line is new
};

At this point, the basic map is done, and you can try to get it online or style it a bit more (scroll below for that).

This is the whole code of the page for reference.

<!DOCTYPE html>
<html>

<head>
    <title>Spatial Signatures in Great Britain</title>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <!-- load leaflet.js -->
    <!-- load CSS -->
    <link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css" />

    <!-- load JS -->
    <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js"></script>

    <!-- load VectorGrid extension -->
    <script src="https://unpkg.com/leaflet.vectorgrid@1.3.0/dist/Leaflet.VectorGrid.bundled.js"></script>
</head>

<body style='margin:0'>
    <!-- div containing map -->
    <div id="map" style="width: 100vw; height: 100vh; background: #fdfdfd"></div>
    <!-- specification of leaflet map -->
    <script>
        // create map
        var map = L.map('map', {
            center: [53.4107, -2.9704],
            minZoom: 6,
            maxZoom: 15,
            zoomControl: true,
            zoom: 12,
        });

        // add background basemap
        var mapBaseLayer = L.tileLayer(
            'https://{s}.basemaps.cartocdn.com/rastertiles/light_all/{z}/{x}/{y}.png', {
                attribution: '(C) OpenStreetMap contributors (C) CARTO'
            }
        ).addTo(map);

        // get vector tiles URL
        var mapUrl = "tiles/{z}/{x}/{y}.pbf";

        // mapping of colors to signature types
        const cmap = {
            "0_0": "#f2e6c7",
            "1_0": "#8fa37e",
            "3_0": "#d7a59f",
            "4_0": "#d7ded1",
            "5_0": "#c3abaf",
            "6_0": "#e4cbc8",
            "7_0": "#c2d0d9",
            "8_0": "#f0d17d",
            "2_0": "#678ea6",
            "2_1": "#94666e",
            "2_2": "#efc758",
            "9_0": "#3b6e8c",
            "9_1": "#333432",
            "9_2": "#ab888e",
            "9_3": "#c1c1c0",
            "9_4": "#bc5b4f",
            "9_5": "#a7b799",
            "9_6": "#c1c1c0",
            "9_7": "#c1c1c0",
            "9_8": "#c1c1c0"
        };

        // define styling of vector tiles
        var vectorTileStyling = {
            signatures_4326: function(properties) {
                return ({
                    fill: true,
                    fillColor: cmap[properties.signature_type],
                    fillOpacity: 0.9,
                    weight: 1,
                    color: "#ffffff",
                    opacity: 1.0,
                });
            }
        }

        // define options of vector tiles
        var mapVectorTileOptions = {
            rendererFactory: L.canvas.tile,
            interactive: true,
            attribution: '(C) Martin Fleischmann',
            maxNativeZoom: 15,
            minZoom: 6,
            vectorTileLayerStyles: vectorTileStyling,
        };

        // create VectorGrid layer
        var mapPbfLayer = new L.VectorGrid.Protobuf(
            mapUrl, mapVectorTileOptions
        )

        // add VectorGrid layer to map
        mapPbfLayer.addTo(map);
    </script>
</body>

</html>

Github Pages

You can use GitHub Pages to host the whole map online. GitHub has a size limit for individual files of 100MB. Considering that the original GeoJSON has 2.2GB, that may sound like a problem. But remember that the data are now cut to small tiles, where the size of each is a few KB. You want to create a repository on GitHub, upload your tiles/ folder and the HTML file you have just created, called index.html and set up GitHub Pages hosting.

Once your data are uploaded, create an empty file called .nojekyll that tells GitHub Pages to use the repository as an HTML code as it is (what it really says is “do not use Jekyll to compile the page”) and go to Settings > Pages and enable GitHub Pages.

🎉 Your web map is live and accessible by anyone online. GitHub Pages settings show you the exact link. It is as easy as that.

Tip: Once you have everything set on GitHub, use github.dev to make the further changes to the HTML file avoid download and indexing of the whole repository (just hit . when you are on GitHub in your repository).

Our map lives in the urbangrammarai/great-britain repository if you want to check the source code and is available on https://urbangrammarai.xyz/great-britain/ (we use the custom domain on GitHub Pages).

Further styling and interactivity

You may have noticed that our map has a couple of features on top of this. Let’s quickly go through those.

Labels on top of the map

You can use another XYZ base map that has just labels and load it on top of your tiles to see the names of places (or roads or something else). The order of layers matches the order in which you use .addTo(map).

Adding CartoDB Positron labels (this need to be after mapPbfLayer.addTo(map):

// add labels basemap
var mapBaseLayer = L.tileLayer(
    'https://{s}.basemaps.cartocdn.com/rastertiles/light_only_labels/{z}/{x}/{y}.png', {
        attribution: '(C) OpenStreetMap contributors (C) CARTO'
    }
).addTo(map);

You may want to show some additional information when you click on the geometry. Let’s link show the name of the signature type based on its code. First, create a dictionary with the info:

// mapping of names to signature types to be shown in popup on click
const popup_info = {
    "0_0": "Countryside agriculture",
    "1_0": "Accessible suburbia",
    "3_0": "Open sprawl",
    "4_0": "Wild countryside",
    "5_0": "Warehouse/Park land",
    "6_0": "Gridded residential quarters",
    "7_0": "Urban buffer",
    "8_0": "Disconnected suburbia",
    "2_0": "Dense residential neighbourhoods",
    "2_1": "Connected residential neighbourhoods",
    "2_2": "Dense urban neighbourhoods",
    "9_0": "Local urbanity",
    "9_1": "Concentrated urbanity",
    "9_2": "Regional urbanity",
    "9_4": "Metropolitan urbanity",
    "9_5": "Hyper concentrated urbanity",
};

Then you need to link the on click popup element to the VectorGrid object.

// create VectorGrid layer
var mapPbfLayer = new L.VectorGrid.Protobuf(
    mapUrl, mapVectorTileOptions
).on('click', function (e) { // this line and below
    L.popup()
        .setContent(popup_info[e.layer.properties.signature_type])
        .setLatLng(e.latlng)
        .openOn(map);
});

Conditional formatting based on the zoom level

The white outline can be a bit think when you zoom out, but there is a way how you can control that. You can just specify the condition to a variable and pass the variable to the tile styling dictionary.

// define styling of vector tiles
var vectorTileStyling = {
    signatures_4326: function(properties, zoom) {
                // set line weight conditionally based on zoom
        var weight = 0;
        if (zoom > 12) {
            weight = 1.0;
        }
        return ({
            fill: true,
            fillColor: cmap[properties.signature_type],
            fillOpacity: 0.9,
            weight: weight, // pass the weight variable instead of a value
            color: "#ffffff",
            opacity: 1.0,
        });
    }
}

The function signature has a new zoom attribute you can use in the condition.

Legend and details

A legend and details are just custom HTML objects detached from the leaflet map. You can check the source code to see how they’re done.

Understanding the structure of cities through the lens of data

The workshop organised together with James D. Gaboardi during the Spatial Data Science Symposium 2022 is now available online. See the recording below and access the workshop material on Github from which you can even run the code online, in your browser.

Annotation

Martin & James will walk you through the fundamentals of analysis of the structure of cities. You will learn what can be measured, how to do that in Python, and how to use those numbers to capture the patterns that make up cities. Using a real-life example, you will learn every step of the method, from data retrieval to detection of individual patterns.

Capturing the Structure of Cities with Data Science

During the Spatial Data Science Conference 2021, I had a chance to deliver a workshop illustrating the application of PySAL and momepy in understanding the structure of cities. The recording is now available for everyone. The materials are available on my GitHub and you can even run the whole notebook in your browser using the MyBinder service.

Clustergam: visualisation of cluster analysis

In this post, I introduce a new Python package to generate clustergrams from clustering solutions. The library has been developed as part of the Urban Grammar research project, and it is compatible with scikit-learn and GPU-enabled libraries such as cuML or cuDF within RAPIDS.AI.


When we want to do some cluster analysis to identify groups in our data, we often use algorithms like K-Means, which require the specification of a number of clusters. But the issue is that we usually don’t know how many clusters there are.

There are many methods on how to determine the correct number, like silhouettes or elbow plot, to name a few. But they usually don’t give much insight into what is happening between different options, so the numbers are a bit abstract.

Matthias Schonlau proposed another approach – a clustergram. Clustergram is a two-dimensional plot capturing the flows of observations between classes as you add more clusters. It tells you how your data reshuffles and how good your splits are. Tal Galili later implemented clustergram for K-Means in R. And I have used Tal’s implementation, ported it to Python and created clustergram – a Python package to make clustergrams.

clustergram currently supports K-Means and using scikit-learn (inlcuding Mini-Batch implementation) and RAPIDS.AI cuML (if you have a CUDA-enabled GPU), Gaussian Mixture Model (scikit-learn only) and hierarchical clustering based on scipy.hierarchy. Alternatively, we can create clustergram based on labels and data derived from alternative custom clustering algorithms. It provides a sklearn-like API and plots clustergram using matplotlib, which gives it a wide range of styling options to match your publication style.

Install

You can install clustergram from conda or pip:

conda install clustergram -c conda-forge

or

pip install clustergram

In any case, you still need to install your selected backend (scikit-learn and scipy or cuML).

from clustergram import Clustergram
import urbangrammar_graphics as ugg
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale

sns.set(style='whitegrid')

Let us look at some examples to understand how clustergram looks and what to do with it.

Iris flower data set

The first example which we try to analyse using clustergram is the famous Iris flower data set. It contains data on three species of Iris flowers measuring sepal width and length and petal width and length. We can start with some exploration:

iris = sns.load_dataset("iris")
g = sns.pairplot(iris, hue="species", palette=ugg.COLORS[1:4])
g.fig.suptitle("Iris flowers", y=1.01)

It seems that setosa is a relatively well-defined group, while the difference between versicolor and virginica is smaller as they partially overlap (or entirely in the case of sepal width).

Okay, so we know how the data looks. Now we can check how does clustergram look. Remember that we know that there are three clusters, and we should ideally be able to recognise this from clustergram. I am saying ideally because even though there are known labels, it does not mean that our data or clustering method are able to distinguish between those classes.

Let’s start with K-Means clustering. To get a stable result, we can run a clustergram with 1000 initialisations.

data = scale(iris.drop(columns=['species']))

cgram = Clustergram(range(1, 10), n_init=1000)
cgram.fit(data)

ax = cgram.plot(
    figsize=(10, 8),
    line_style=dict(color=ugg.COLORS[1]),
    cluster_style={"color": ugg.COLORS[2]},
)
ax.yaxis.grid(False)
sns.despine(offset=10)
ax.set_title('K-Means (scikit-learn)')

On the x axis, we can see the number of clusters. Points represent a centre of each cluster (by default) weighted by the first principal component (that helps with the diagram’s readability). The lines connecting points and their thickness represent observations moving between clusters. Therefore, we can read when new clusters are formed as a split of a single existing class and when they are formed based on observations from two clusters.

We’re looking for the separation, i.e., did an additional cluster bring any meaningful split? The step from one cluster to two is a big one – nice and clear separation. From two to three, we also have quite a nice split in the top branch. But from 3 to 4, there is no visible difference because the new fourth cluster is almost the same as the existing bottom branch. Although it is now separated into two, this split does not give us much information. Therefore, we could conclude that the ideal number of clusters for Iris data is three.

We can also check some additional information, like a silhouette score or Calinski-Harabazs score.

fig, axs = plt.subplots(2, figsize=(10, 10), sharex=True)
cgram.silhouette_score().plot(
    xlabel="Number of clusters (k)",
    ylabel="Silhouette score",
    color=ugg.COLORS[1],
    ax=axs[0],
)
cgram.calinski_harabasz_score().plot(
    xlabel="Number of clusters (k)",
    ylabel="Calinski-Harabasz score",
    color=ugg.COLORS[1],
    ax=axs[1],
)
sns.despine(offset=10)

These plots would suggest 3-4 clusters, similarly to clustergram, but they are not very conclusive.

Palmer penguins data set

Now let’s try different data, one where clusters are a bit more complicated to assess. Palmer penguins contain similar data as Iris example, but it measures several attributes of three species of penguins.

penguins = sns.load_dataset("penguins")

g = sns.pairplot(penguins, hue="species", palette=ugg.COLORS[3:])
g.fig.suptitle("Palmer penguins", y=1.01)

Looking at the situation, we see that the overlap between species is much higher than before. It will likely be much more complicated to identify them. Again, we know that there are three clusters, but that does not mean that data has the power to distinguish between them. In this case, it may be especially tricky to differentiate between Adelie and Chinstrap penguins.

data = scale(penguins.drop(columns=['species', 'island', 'sex']).dropna())

cgram = Clustergram(range(1, 10), n_init=1000)
cgram.fit(data)

ax = cgram.plot(
    figsize=(10, 8),
    line_style=dict(color=ugg.COLORS[1]),
    cluster_style={"color": ugg.COLORS[2]},
)
ax.yaxis.grid(False)
sns.despine(offset=10)
ax.set_title("K-Means (scikit-learn)")

We’re looking for separations, and this clustergram shows plenty. It is actually quite complicated to determine the optimal number of clusters. However, since we know what happens between different options, we can play with that. If we have a reason to be conservative, we can go with 4 clusters (I know, it is already more than the initial species). But further splits are also reasonable, which indicates that even higher granularity may provide useful insight, that there might be meaningful groups.

Can we say it is three? Since we know it should be three… Well, not really. The difference between the split from 2 – 3 and that from 3 – 4 is slight. However, the culprit here is K-Means, not clustergram. It just simply cannot correctly cluster these data due to the overlaps and the overall structure.

Let’s have a look at how the Gaussian Mixture Model does.

cgram = Clustergram(range(1, 10), n_init=100, method="gmm")
cgram.fit(data)

ax = cgram.plot(
    figsize=(10, 8),
    line_style=dict(color=ugg.COLORS[1]),
    cluster_style={"color": ugg.COLORS[2]},
)
ax.yaxis.grid(False)
sns.despine(offset=10)
ax.set_title("Gaussian Mixture Model (scikit-learn)")

The result is very similar, though the difference between the third and fourth split is more pronounced. Even here, I would probably go with a four cluster solution.

A situation like this happens very often. The ideal case does not exist. We ultimately need to make a decision on the optimal number of clusters. Clustergam gives us additional insights into what happens between different options, how it splits. We can tell that the four-cluster option in Iris data is not helpful. We can tell that Palmer penguins may be tricky to cluster using K-Means, that there is no decisive right solution. Clustergram does not give an easy answer, but it gives us additional insight, and it is upon us how we interpret it.

You can install clustergram using conda install clustergram -c conda-forge or pip install clustergram. In any case, you will still need to install a clustering backend, either scikit-learn or cuML. The documentation is available at clustergram.readthedocs.io, and the source code is on github.com/martinfleis/clustergram, released under MIT license.

If you want to play with the examples used in this article, the Jupyter notebook is on GitHub. You can also run it in an interactive binder environment in your browser.

For more information, check Tal Galili’s blog post and original papers by Matthias Schonlau.

Give it a go!

3 – 10 = 65529. What?

Yes, the formula above is correct. Well, it depends on what we mean by correct.

NDVI does not make sense

Imagine the following situation. We have fetched a cloud-free mosaic of Sentinel 2 satellite data and want to measure NDVI (Normalised difference vegetation index), which uses red and near-infrared bands within this simple formula.

    \[\mathrm{NDVI}=\frac{(\mathrm{NIR}-\mathrm{Red})}{(\mathrm{NIR}+\mathrm{Red})}\]

The results are normalised, which in this case means that they lie between -1 and 1. Always.

We open out raster data using xarray and have all 4 bands in a single xarray.DataSet. The code to measure NDVI is then simple.

>>> red = data.sel(band=1)  # select red band
>>> nir = data.sel(band=4)  # select near-infrared band

>>> ndvi = (nir - red) / (nir + red)  # compute NDVI

And, a surprise! Our results are between 0 and 170. That is certainly not correct. What has happened?

16-bit unsigned integer

The data coming from Sentinel 2 are stored as 16-bit unsigned integer (uint16). That means that the value the array can hold can be anything between 0 and 216 – 1 (65,535). Remember that NDVI is between -1 and 1. Does it mean that uint16 cannot represent NDVI values? Yes, precisely.

Look at this toy example to understand what is happening during the computation of NDVI in uint16. Let’s have an array with four numbers and subtract 10 from each of them.

>>> array = numpy.array([1, 3, 6, 9], dtype='uint16')
>>> array - 10

array([65527, 65529, 65532, 65535], dtype=uint16)

Yes, as weird as it is, it is correct. The result should be negative, but we can’t have negative values in uint16. So what happens is that the counter rolls over and subtracts the remaining value from the maximum it can represent (65,535 – x).

It is exactly like a rollover of the odometer. We ran out of values, so they started over. The only difference is that we have 16 binary values encoding each number, not decimal.

Odometer rollover. By Hellbus - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=3089111

The fix is easy. We have to use data type which does not limit you like this, like a 64-bit integer.

>>> array.astype('int64') - 10

array([-9, -7, -4, -1])

Compared to a 64-bit integer, a 16-bit integer is efficient since the resulting file will be much smaller (that is why it is used in the first place) but it can be limiting.

Be aware of your data types, so you don’t make the same mistake we did ;).