Population as a raster grid

Until now, all the data you were working with were tables. However, not everything is a table. Raster data are not that common in social geography, but spatial data science is full of it, from satellite imagery to population grids. In this session, you will learn how to work with spatial raster data in Python and how to link raster to vector using the ecosystem around the xarray package.

Arrays and their many dimensions

Raster data are represented as arrays. Those can take many forms and shapes. You already know pandas data structures, so let’s start with those.

A pandas.Series is a 1-dimensional array with an index. A typical array contains values of the same data type (e.g. float numbers), as does a typical Series.

When it comes to geospatial raster data, one dimension is not enough. Even the most basic raster, something like a digital terrain model (DTM), requires two dimensions. One represents longitude (or x when projected), while the other latitude (or y), resulting in a 2-dimensional array.

But you don’t have to stop there. Take a typical satellite image. The longitude and latitude dimensions are still present, but you have different bands representing blue, green, red and often near-infra-red frequencies, resulting in a 3-dimensional array (lon, lat, band). Throw in time, and you’re now dealing with a 4-dimensional array (lon, lat, band, time).

All these use cases fall under the umbrella of N-dimensional array handling covered by the xarray package. Whereas a pandas.Series is a 1-dimensional array with an index, xarray.DataArray is an N-dimensional array with N indexes. Combining multiple Series gives you a pandas.DataFrame, where each column can have a different data type (e.g. one numbers, other names). Combining multiple xarray.DataArrays gives you a xarray.Dataset, where each array can have a different data type. There’s a lot of similarity between pandas and xarray, but also some differences.

Let’s read some raster and explore xarray objects in practice.

import datashader as ds
import geopandas as gpd
import rioxarray
import xarray as xr
import osmnx as ox
import pandas as pd
import matplotlib.pyplot as plt
import xvec

Population grids

Today, you will be working with the data from the Global Human Settlement Layer (GHSL) developed by the Joint Research Centre of the European Commission. Unlike in all previous hands-on sessions, the data is not pre-processed and you could read it directly from the open data repository. However, since that seems to be a bit unstable lately, use the copy stored as part of this course.

The first layer you will open is a population grid. GHSL covers the whole world divided into a set of tiles, each covering an area of 1,000 by 1,000 km at a resolution of 100m per pixel. The link below points to a single tile1 covering most of Eastern Europe.

1pop_url = (
    "https://martinfleischmann.net/sds/raster_data/data/"
    "GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
      )
pop_url
1
The URL is long. It may be better to write it as a multi-line string to avoid a long line.
'https://martinfleischmann.net/sds/raster_data/data/GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip'

You can, alternatively, try reading the original data directly using this URL:

pop_url = (
    "https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/"
    "GHS_POP_GLOBE_R2023A/GHS_POP_E2030_GLOBE_R2023A_54009_100/"
    "V1-0/tiles/GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
      )

The pop_url points to a ZIP file. Within that ZIP file is a GeoTIFF containing the actual raster. There is often no need to download and unzip the file as there’s a good chance you can read it directly.

Reading rasters with rioxarray

xarray, like pandas is an agnostic library. It is designed for N-dimensional arrays but not necessarily geospatial arrays (although that is often the case…). It means that by default, it is not able to read geospatial file formats like GeoTIFF. That is where rioxarray comes in. It comes with the support of the usual geo-specific things like specific file formats or CRS.

1p = f"zip+{pop_url}!GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.tif"
2population = rioxarray.open_rasterio(p, masked=True)
population
1
Create a path to the file inside the ZIP. Add the "zip+" prefix and then the path to the actual file inside the archive, starting with "!".
2
Use rioxarray to open the file using the lower-level rasterio package. With masked=True ensure, that the missing values are properly masked out.
<xarray.DataArray (band: 1, y: 10000, x: 10000)> Size: 800MB
[100000000 values with dtype=float64]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 80kB 9.59e+05 9.592e+05 ... 1.959e+06 1.959e+06
  * y            (y) float64 80kB 6e+06 6e+06 6e+06 6e+06 ... 5e+06 5e+06 5e+06
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0

Above, you can see the representation of the population grid as a DataArray. It has three dimensions ("band", "x", "y") with a resolution 1x10,000x10,000 and values as float.

rioxarray gives you a handy .rio accessor on xarray objects, allowing you to access geospatial-specific tools. Like retrieval of CRS.

population.rio.crs
CRS.from_wkt('PROJCS["World_Mollweide",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]],PROJECTION["Mollweide"],PARAMETER["central_meridian",0],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]')

Or the extent of the raster (in the CRS shown above).

population.rio.bounds()
(959000.0, 5000000.0, 1959000.0, 6000000.0)

The missing, masked data can be represented as as specific value, especially when dealing with integer arrays. You can check which one:

population.rio.nodata
nan

Plotting with datashader

Plotting a raster with a resolution of 10,000x10,000 pixels can be tricky. Often, the resolution is even larger than that. The best way to plot is to resample the data to a smaller resolution that better fits the screen. A handy tool that can do that quickly is datashader. Let’s use it to plot the array as 600x600 pixels.

1canvas = ds.Canvas(plot_width=600, plot_height=600)
2agg = canvas.raster(population.where(population>0).sel(band=1))
agg
1
Create a canvas with a specific resolution.
2
Select pixels with a population more than 0 (population.where(population>0)), select a single band to get 2-dimensional array (.sel(band=1)) and pass the result to the canvas.
<xarray.DataArray (y: 600, x: 600)> Size: 3MB
array([[4.21513869, 3.52179019, 6.29294423, ..., 1.66938128, 0.82322911,
               nan],
       [1.04041781, 1.57566844, 4.50548627, ...,        nan, 0.15449866,
               nan],
       [       nan, 3.18057789, 2.21502486, ...,        nan,        nan,
               nan],
       ...,
       [       nan,        nan,        nan, ..., 4.05967611, 1.9045163 ,
        3.70262405],
       [       nan,        nan,        nan, ..., 7.2545469 , 6.44937184,
        1.5507871 ],
       [       nan,        nan,        nan, ..., 0.76526973, 1.16737383,
        0.21985548]])
Coordinates:
  * x        (x) float64 5kB 9.598e+05 9.615e+05 ... 1.957e+06 1.958e+06
  * y        (y) float64 5kB 5.999e+06 5.998e+06 ... 5.002e+06 5.001e+06
Attributes:
    res:      100.0
    x_range:  (959000.0, 1959000.0)
    y_range:  (5000000.0, 6000000.0)

You can see that the result is a new xarray.DataArray with a resolution 600x600. The built-in matplotlib-based plotting can easily handle that.

_ = agg.plot()

Population grid resampled to 600x600 pixels

Clipping based on geometry

Daling with large rasters is often impractical if you are interested in a small subset, for example, representing a single city.

Functional Urban Areas

In this case, you may want to work only with the data covering Budapest, Hungary, defined by its functional urban area (FUA), available as another data product on GHSL. FUAs are available as a single GeoPackage with vector geometries.

1fua_url = (
    "https://martinfleischmann.net/sds/raster_data/data/"
    "GHS_FUA_UCDB2015_GLOBE_R2019A_54009_1K_V1_0.zip"
)
2p = f"zip+{fua_url}!GHS_FUA_UCDB2015_GLOBE_R2019A_54009_1K_V1_0.gpkg"
3fuas = gpd.read_file(p)
4budapest = fuas.query("eFUA_name == 'Budapest'")
budapest.explore()
1
Get the URL.
2
Specify the path to read the file from the ZIP.
3
Read the table with geopandas.
4
Filter only Budapest.
Make this Notebook Trusted to load map: File -> Trust Notebook

You can, alternatively, try reading the original data directly using this URL:

fua_url = (
    "https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/"
    "GHS_FUA_UCDB2015_GLOBE_R2019A/V1-0/"
    "GHS_FUA_UCDB2015_GLOBE_R2019A_54009_1K_V1_0.zip"
)

If you want to clip the population raster to the extent of Budapest FUA, you can use the clip method from the rioxarray extension of xarray.

1population_bud = population.rio.clip(
2    budapest.to_crs(population.rio.crs).geometry
)
population_bud
1
Use .rio.clip to clip the geospatial raster to the extent of a geometry.
2
Ensure the budapest is in the same CRS as the population and pass its geometry.
<xarray.DataArray (band: 1, y: 840, x: 830)> Size: 6MB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]])
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06
  * y            (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0

The raster is no longer 10,000x10,000 pixels but only 840x830, covering the extent of Budapest FUA. You can easily check that by plotting the clipped array.

_ = population_bud.plot()

Population grid clipped to Budapest FUA

Array manipulation

While this is technically a 3-dimensional array, the dimension "band" has only one value. Normally, you would get a 2-dimensional array representing a selected band using the .sel() method.

population_bud.sel(band=1)
<xarray.DataArray (y: 840, x: 830)> Size: 6MB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
    band         int64 8B 1
  * x            (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06
  * y            (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0

But if you have only one band, you can squeeze the array and get rid of that dimension that is not needed.

population_bud = population_bud.drop_vars("band").squeeze()
population_bud
<xarray.DataArray (y: 840, x: 830)> Size: 6MB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * x            (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06
  * y            (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0

Now a lot what you know from pandas works equally in xarray. Getting the minimum:

population_bud.min()
<xarray.DataArray ()> Size: 8B
array(0.)
Coordinates:
    spatial_ref  int64 8B 0

As expected, there are some cells with no inhabitants.

population_bud.max()
<xarray.DataArray ()> Size: 8B
array(603.90405273)
Coordinates:
    spatial_ref  int64 8B 0

The densest cell, on the other hand, has more than 600 people per hectare.

population_bud.mean()
<xarray.DataArray ()> Size: 8B
array(6.77593517)
Coordinates:
    spatial_ref  int64 8B 0

Mean is, however, only below 7.

population_bud.median()
<xarray.DataArray ()> Size: 8B
array(0.)
Coordinates:
    spatial_ref  int64 8B 0

While the median is 0, there are a lot of cells with 0.

DataArray vs scalar

Notice that xarray always returns another DataArray even with a single value. If you want to get that scalar value, you can use .item().

population_bud.mean().item()
6.775935172506021

You can plot the distribution of values across the array.

_ = population_bud.plot.hist(bins=100)

Histogram of population counts

Indeed, there are a lot of zeros. Let’s filter them out and check the distribution again.

_ = population_bud.where(population_bud>0).plot.hist(bins=100)

Histogram of population counts excluding 0

As with many observations in urban areas, this follows a power-law-like distribution with a lot of observations with tiny values and only a few with large ones.

Array operations

Let’s assume that you want to normalise population counts by the built-up volume, which is available as another GHSL product. This time, on a grid again.

volume_url = (
    "https://martinfleischmann.net/sds/raster_data/data/"
    "GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
)
volume_url
'https://martinfleischmann.net/sds/raster_data/data/GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip'

You can, alternatively, try reading the original data directly using this URL:

volume_url = (
    "https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/"
    "GHS_BUILT_V_GLOBE_R2023A/GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100/V1-0/tiles/"
    "GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
)

All work the same as before. You read the GeoTIFF as a DataArray.

p = f"zip+{volume_url}!GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.tif"
built_up = rioxarray.open_rasterio(p, masked=True).drop_vars("band").squeeze()
built_up
<xarray.DataArray (y: 10000, x: 10000)> Size: 800MB
[100000000 values with dtype=float64]
Coordinates:
  * x            (x) float64 80kB 9.59e+05 9.592e+05 ... 1.959e+06 1.959e+06
  * y            (y) float64 80kB 6e+06 6e+06 6e+06 6e+06 ... 5e+06 5e+06 5e+06
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0

And clip it to the same extent.

built_up_bud = built_up.rio.clip(budapest.to_crs(built_up.rio.crs).geometry)
built_up_bud
<xarray.DataArray (y: 840, x: 830)> Size: 6MB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * x            (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06
  * y            (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0

You can quickly check what it looks like.

_ = built_up_bud.plot(cmap="magma_r")

Built-up volume

The two grids are aligned, meaning that pixels with the same coordinates represent the same area. This allows us to directly perform array algebra. Again, you know this from pandas.

1pop_density = population_bud /  built_up_bud
pop_density
1
Divide the population by built-up volume to get a normalised value.
<xarray.DataArray (y: 840, x: 830)> Size: 6MB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * x            (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06
  * y            (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06
    spatial_ref  int64 8B 0

The result is a new array that inherits spatial information (spatial_ref) but contains newly computed values.

_ = pop_density.plot(cmap="cividis_r")

Population grid normalised by the built-up volume

The resulting array can then be saved to a GeoTIFF using rioxarray.

pop_density.rio.to_raster("population_density.tif")

Extracting values for locations

A common need is to extract values from raster data for a specific location of interest. That is a first type of interaction between raster and vector data (points in this case). To illustrate the use case, create a set of random points covering the area of budapest.

1locations = budapest.sample_points(1000).explode(ignore_index=True)
locations.head()
1
sample_points() method creates a random sample of the selected size within each geometry in a GeoSeries. Each sample set is a MultiPoint but in this case, you want individual points. That is when explode() is useful, since it explodes each multi-part geometry into individual points. Because you are not interested in the original index, you can use ignore_index=True to get the default pd.RangeIndex.
0    POINT (1466052.866 5632350.866)
1    POINT (1467284.892 5629615.691)
2     POINT (1467757.67 5634626.745)
3    POINT (1468163.478 5627029.796)
4    POINT (1468463.242 5627577.834)
Name: sampled_points, dtype: geometry

Check how the sample looks on a map.

locations.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
Random sampling and reproducibility

The points sampled from budapest will be different every time you run the sample_points() method. If you want to fix the result, you can pass a seed value to a random number generator as rng=42. With the same seed value, the result will be always the same. This is useful, especially if you are interested in the reproducibility of your code.

The xarray ecosystem offers many ways of extracting point values. Below, you will use the implementation from the xvec package. Create a new xarray.DataArray with all three arrays you created so far to see the benefits of using xvec below.

1bud_cube = xr.concat(
2    [pop_density, population_bud, built_up_bud],
3    dim=pd.Index(
4        ["density", "population", "built-up volume"],
5        name="measurement",
    )
)
bud_cube
1
Use xr.concat function to concatenate all arrays together.
2
Specify which arrays shall be concatenated.
3
Define a new dimension created along the axis of concatenation. Use the pd.Index to create a new index along this dimension.
4
Specify coordinates along the new dimension.
5
Give the dimension a name.
<xarray.DataArray (measurement: 3, y: 840, x: 830)> Size: 17MB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]])
Coordinates:
  * x            (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06
  * y            (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06
    spatial_ref  int64 8B 0
  * measurement  (measurement) object 24B 'density' ... 'built-up volume'

The resulting DataArray is 3-dimensional, compared to 2-dimensional arrays used before. Apart from x and y, you now have measurement as well. Using the new index created above, you can use the sel() method to get the original arrays.

bud_cube.sel(measurement="density")
<xarray.DataArray (y: 840, x: 830)> Size: 6MB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * x            (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06
  * y            (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06
    spatial_ref  int64 8B 0
    measurement  <U7 28B 'density'

Now it is time to take this cube and create another based on your points. That can be done using the .xvec accessor and its extract_points method.

1vector_cube = bud_cube.drop_vars("spatial_ref").xvec.extract_points(
2    points=locations.geometry,
3    x_coords="x",
    y_coords="y",
)
vector_cube
1
Drop spatial_ref because it is not interesting for point extraction and use .xvec.extract_points()
2
Specify the points for which you want to extract the values.
3
Specify which dimension of the bud_cube DataArray represents x-coordinate dimension of geometries and which represents the y-coordinate dimension to match points to the array.
<xarray.DataArray (measurement: 3, geometry: 1000)> Size: 24kB
array([[1.69309007e-03,            nan,            nan, ...,
        1.67070983e-03, 1.71016553e-03,            nan],
       [1.40780439e+01, 0.00000000e+00, 0.00000000e+00, ...,
        1.05789347e+01, 1.13726008e+00, 0.00000000e+00],
       [8.31500000e+03, 0.00000000e+00, 0.00000000e+00, ...,
        6.33200000e+03, 6.65000000e+02, 0.00000000e+00]])
Coordinates:
  * measurement  (measurement) object 24B 'density' ... 'built-up volume'
  * geometry     (geometry) object 8kB POINT (1466052.8664912267 5632350.8657...
Indexes:
    geometry  GeometryIndex (crs=ESRI:54009)

The resulting object is still a DataArray but a bit different. It is no longer 3-dimensional, although all dimensions of interest ('density', 'population', 'built-up volume') are still there, but 2-dimensional. One dimension is measurement, and the other is geometry, containing the points of interest. With xvec, the spatial dimension is reduced, but the remaining dimensionality of the original array is preserved.

You can then convert the data into a geopandas.GeoDataFrame and work with it as usual.

location_data = vector_cube.xvec.to_geopandas()
location_data.head()
measurement geometry density population built-up volume
0 POINT (1466052.866 5632350.866) 0.001693 14.078044 8315.0
1 POINT (1467284.892 5629615.691) NaN 0.000000 0.0
2 POINT (1467757.67 5634626.745) NaN 0.000000 0.0
3 POINT (1468163.478 5627029.796) NaN 0.000000 0.0
4 POINT (1468463.242 5627577.834) NaN 0.000000 0.0

Check the result on a map to verify that all worked as expected.

location_data.explore("density", cmap="cividis_r", tiles="CartoDB Positron")
Make this Notebook Trusted to load map: File -> Trust Notebook