import datashader as ds
import geopandas as gpd
import rioxarray
import xarray as xr
import osmnx as ox
import pandas as pd
import matplotlib.pyplot as plt
import xvec
Population as a raster grid
Until now, all the data you were working with were tables. However, not everything is a table. Raster data are not that common in social geography, but spatial data science is full of it, from satellite imagery to population grids. In this session, you will learn how to work with spatial raster data in Python and how to link raster to vector using the ecosystem around the xarray
package.
Arrays and their many dimensions
Raster data are represented as arrays. Those can take many forms and shapes. You already know pandas
data structures, so let’s start with those.
A pandas.Series
is a 1-dimensional array with an index. A typical array contains values of the same data type (e.g. float
numbers), as does a typical Series
.
When it comes to geospatial raster data, one dimension is not enough. Even the most basic raster, something like a digital terrain model (DTM), requires two dimensions. One represents longitude (or x when projected), while the other latitude (or y), resulting in a 2-dimensional array.
But you don’t have to stop there. Take a typical satellite image. The longitude and latitude dimensions are still present, but you have different bands representing blue, green, red and often near-infra-red frequencies, resulting in a 3-dimensional array (lon
, lat
, band
). Throw in time, and you’re now dealing with a 4-dimensional array (lon
, lat
, band
, time
).
All these use cases fall under the umbrella of N-dimensional array handling covered by the xarray
package. Whereas a pandas.Series
is a 1-dimensional array with an index, xarray.DataArray
is an N-dimensional array with N indexes. Combining multiple Series
gives you a pandas.DataFrame
, where each column can have a different data type (e.g. one numbers, other names). Combining multiple xarray.DataArray
s gives you a xarray.Dataset
, where each array can have a different data type. There’s a lot of similarity between pandas
and xarray
, but also some differences.
Let’s read some raster and explore xarray
objects in practice.
Population grids
Today, you will be working with the data from the Global Human Settlement Layer (GHSL) developed by the Joint Research Centre of the European Commission. Unlike in all previous hands-on sessions, the data is not pre-processed and you could read it directly from the open data repository. However, since that seems to be a bit unstable lately, use the copy stored as part of this course.
The first layer you will open is a population grid. GHSL covers the whole world divided into a set of tiles, each covering an area of 1,000 by 1,000 km at a resolution of 100m per pixel. The link below points to a single tile1 covering most of Eastern Europe.
1= (
pop_url "https://martinfleischmann.net/sds/raster_data/data/"
"GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
) pop_url
- 1
- The URL is long. It may be better to write it as a multi-line string to avoid a long line.
'https://martinfleischmann.net/sds/raster_data/data/GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip'
You can, alternatively, try reading the original data directly using this URL:
= (
pop_url "https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/"
"GHS_POP_GLOBE_R2023A/GHS_POP_E2030_GLOBE_R2023A_54009_100/"
"V1-0/tiles/GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
)
The pop_url
points to a ZIP file. Within that ZIP file is a GeoTIFF containing the actual raster. There is often no need to download and unzip the file as there’s a good chance you can read it directly.
Reading rasters with rioxarray
xarray
, like pandas
is an agnostic library. It is designed for N-dimensional arrays but not necessarily geospatial arrays (although that is often the case…). It means that by default, it is not able to read geospatial file formats like GeoTIFF. That is where rioxarray
comes in. It comes with the support of the usual geo-specific things like specific file formats or CRS.
1= f"zip+{pop_url}!GHS_POP_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.tif"
p 2= rioxarray.open_rasterio(p, masked=True)
population population
- 1
-
Create a path to the file inside the ZIP. Add the
"zip+"
prefix and then the path to the actual file inside the archive, starting with"!"
. - 2
-
Use
rioxarray
to open the file using the lower-levelrasterio
package. Withmasked=True
ensure, that the missing values are properly masked out.
<xarray.DataArray (band: 1, y: 10000, x: 10000)> Size: 800MB [100000000 values with dtype=float64] Coordinates: * band (band) int64 8B 1 * x (x) float64 80kB 9.59e+05 9.592e+05 ... 1.959e+06 1.959e+06 * y (y) float64 80kB 6e+06 6e+06 6e+06 6e+06 ... 5e+06 5e+06 5e+06 spatial_ref int64 8B 0 Attributes: AREA_OR_POINT: Area scale_factor: 1.0 add_offset: 0.0
Above, you can see the representation of the population grid as a DataArray
. It has three dimensions ("band"
, "x"
, "y"
) with a resolution 1x10,000x10,000 and values as float
.
rioxarray
gives you a handy .rio
accessor on xarray
objects, allowing you to access geospatial-specific tools. Like retrieval of CRS.
population.rio.crs
CRS.from_wkt('PROJCS["World_Mollweide",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]],PROJECTION["Mollweide"],PARAMETER["central_meridian",0],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]')
Or the extent of the raster (in the CRS shown above).
population.rio.bounds()
(959000.0, 5000000.0, 1959000.0, 6000000.0)
The missing, masked data can be represented as as specific value, especially when dealing with integer arrays. You can check which one:
population.rio.nodata
nan
Plotting with datashader
Plotting a raster with a resolution of 10,000x10,000 pixels can be tricky. Often, the resolution is even larger than that. The best way to plot is to resample the data to a smaller resolution that better fits the screen. A handy tool that can do that quickly is datashader
. Let’s use it to plot the array as 600x600 pixels.
1= ds.Canvas(plot_width=600, plot_height=600)
canvas 2= canvas.raster(population.where(population>0).sel(band=1))
agg agg
- 1
- Create a canvas with a specific resolution.
- 2
-
Select pixels with a population more than 0 (
population.where(population>0)
), select a single band to get 2-dimensional array (.sel(band=1)
) and pass the result to the canvas.
<xarray.DataArray (y: 600, x: 600)> Size: 3MB array([[4.21513869, 3.52179019, 6.29294423, ..., 1.66938128, 0.82322911, nan], [1.04041781, 1.57566844, 4.50548627, ..., nan, 0.15449866, nan], [ nan, 3.18057789, 2.21502486, ..., nan, nan, nan], ..., [ nan, nan, nan, ..., 4.05967611, 1.9045163 , 3.70262405], [ nan, nan, nan, ..., 7.2545469 , 6.44937184, 1.5507871 ], [ nan, nan, nan, ..., 0.76526973, 1.16737383, 0.21985548]]) Coordinates: * x (x) float64 5kB 9.598e+05 9.615e+05 ... 1.957e+06 1.958e+06 * y (y) float64 5kB 5.999e+06 5.998e+06 ... 5.002e+06 5.001e+06 Attributes: res: 100.0 x_range: (959000.0, 1959000.0) y_range: (5000000.0, 6000000.0)
You can see that the result is a new xarray.DataArray
with a resolution 600x600. The built-in matplotlib-based plotting can easily handle that.
= agg.plot() _
Clipping based on geometry
Daling with large rasters is often impractical if you are interested in a small subset, for example, representing a single city.
Functional Urban Areas
In this case, you may want to work only with the data covering Budapest, Hungary, defined by its functional urban area (FUA), available as another data product on GHSL. FUAs are available as a single GeoPackage with vector geometries.
1= (
fua_url "https://martinfleischmann.net/sds/raster_data/data/"
"GHS_FUA_UCDB2015_GLOBE_R2019A_54009_1K_V1_0.zip"
)2= f"zip+{fua_url}!GHS_FUA_UCDB2015_GLOBE_R2019A_54009_1K_V1_0.gpkg"
p 3= gpd.read_file(p)
fuas 4= fuas.query("eFUA_name == 'Budapest'")
budapest budapest.explore()
- 1
- Get the URL.
- 2
- Specify the path to read the file from the ZIP.
- 3
-
Read the table with
geopandas
. - 4
- Filter only Budapest.
You can, alternatively, try reading the original data directly using this URL:
= (
fua_url "https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/"
"GHS_FUA_UCDB2015_GLOBE_R2019A/V1-0/"
"GHS_FUA_UCDB2015_GLOBE_R2019A_54009_1K_V1_0.zip"
)
If you want to clip the population
raster to the extent of Budapest FUA, you can use the clip
method from the rioxarray
extension of xarray
.
1= population.rio.clip(
population_bud 2
budapest.to_crs(population.rio.crs).geometry
) population_bud
- 1
-
Use
.rio.clip
to clip the geospatial raster to the extent of a geometry. - 2
-
Ensure the
budapest
is in the same CRS as thepopulation
and pass its geometry.
<xarray.DataArray (band: 1, y: 840, x: 830)> Size: 6MB array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]]) Coordinates: * band (band) int64 8B 1 * x (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06 * y (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06 spatial_ref int64 8B 0 Attributes: AREA_OR_POINT: Area scale_factor: 1.0 add_offset: 0.0
The raster is no longer 10,000x10,000 pixels but only 840x830, covering the extent of Budapest FUA. You can easily check that by plotting the clipped array.
= population_bud.plot() _
Array manipulation
While this is technically a 3-dimensional array, the dimension "band"
has only one value. Normally, you would get a 2-dimensional array representing a selected band using the .sel()
method.
=1) population_bud.sel(band
<xarray.DataArray (y: 840, x: 830)> Size: 6MB array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]) Coordinates: band int64 8B 1 * x (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06 * y (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06 spatial_ref int64 8B 0 Attributes: AREA_OR_POINT: Area scale_factor: 1.0 add_offset: 0.0
But if you have only one band, you can squeeze the array and get rid of that dimension that is not needed.
= population_bud.drop_vars("band").squeeze()
population_bud population_bud
<xarray.DataArray (y: 840, x: 830)> Size: 6MB array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]) Coordinates: * x (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06 * y (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06 spatial_ref int64 8B 0 Attributes: AREA_OR_POINT: Area scale_factor: 1.0 add_offset: 0.0
Now a lot what you know from pandas
works equally in xarray
. Getting the minimum:
min() population_bud.
<xarray.DataArray ()> Size: 8B array(0.) Coordinates: spatial_ref int64 8B 0
As expected, there are some cells with no inhabitants.
max() population_bud.
<xarray.DataArray ()> Size: 8B array(603.90405273) Coordinates: spatial_ref int64 8B 0
The densest cell, on the other hand, has more than 600 people per hectare.
population_bud.mean()
<xarray.DataArray ()> Size: 8B array(6.77593517) Coordinates: spatial_ref int64 8B 0
Mean is, however, only below 7.
population_bud.median()
<xarray.DataArray ()> Size: 8B array(0.) Coordinates: spatial_ref int64 8B 0
While the median is 0, there are a lot of cells with 0.
Notice that xarray
always returns another DataArray
even with a single value. If you want to get that scalar value, you can use .item()
.
population_bud.mean().item()
6.775935172506021
You can plot the distribution of values across the array.
= population_bud.plot.hist(bins=100) _
Indeed, there are a lot of zeros. Let’s filter them out and check the distribution again.
= population_bud.where(population_bud>0).plot.hist(bins=100) _
As with many observations in urban areas, this follows a power-law-like distribution with a lot of observations with tiny values and only a few with large ones.
Array operations
Let’s assume that you want to normalise population counts by the built-up volume, which is available as another GHSL product. This time, on a grid again.
= (
volume_url "https://martinfleischmann.net/sds/raster_data/data/"
"GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
) volume_url
'https://martinfleischmann.net/sds/raster_data/data/GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip'
You can, alternatively, try reading the original data directly using this URL:
= (
volume_url "https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/"
"GHS_BUILT_V_GLOBE_R2023A/GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100/V1-0/tiles/"
"GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.zip"
)
All work the same as before. You read the GeoTIFF as a DataArray
.
= f"zip+{volume_url}!GHS_BUILT_V_E2030_GLOBE_R2023A_54009_100_V1_0_R4_C20.tif"
p = rioxarray.open_rasterio(p, masked=True).drop_vars("band").squeeze()
built_up built_up
<xarray.DataArray (y: 10000, x: 10000)> Size: 800MB [100000000 values with dtype=float64] Coordinates: * x (x) float64 80kB 9.59e+05 9.592e+05 ... 1.959e+06 1.959e+06 * y (y) float64 80kB 6e+06 6e+06 6e+06 6e+06 ... 5e+06 5e+06 5e+06 spatial_ref int64 8B 0 Attributes: AREA_OR_POINT: Area scale_factor: 1.0 add_offset: 0.0
And clip it to the same extent.
= built_up.rio.clip(budapest.to_crs(built_up.rio.crs).geometry)
built_up_bud built_up_bud
<xarray.DataArray (y: 840, x: 830)> Size: 6MB array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]) Coordinates: * x (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06 * y (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06 spatial_ref int64 8B 0 Attributes: AREA_OR_POINT: Area scale_factor: 1.0 add_offset: 0.0
You can quickly check what it looks like.
= built_up_bud.plot(cmap="magma_r") _
The two grids are aligned, meaning that pixels with the same coordinates represent the same area. This allows us to directly perform array algebra. Again, you know this from pandas
.
1= population_bud / built_up_bud
pop_density pop_density
- 1
- Divide the population by built-up volume to get a normalised value.
<xarray.DataArray (y: 840, x: 830)> Size: 6MB array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]) Coordinates: * x (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06 * y (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06 spatial_ref int64 8B 0
The result is a new array that inherits spatial information (spatial_ref
) but contains newly computed values.
= pop_density.plot(cmap="cividis_r") _
The resulting array can then be saved to a GeoTIFF using rioxarray
.
"population_density.tif") pop_density.rio.to_raster(
Extracting values for locations
A common need is to extract values from raster data for a specific location of interest. That is a first type of interaction between raster and vector data (points in this case). To illustrate the use case, create a set of random points covering the area of budapest
.
1= budapest.sample_points(1000).explode(ignore_index=True)
locations locations.head()
- 1
-
sample_points()
method creates a random sample of the selected size within each geometry in aGeoSeries
. Each sample set is aMultiPoint
but in this case, you want individual points. That is whenexplode()
is useful, since it explodes each multi-part geometry into individual points. Because you are not interested in the original index, you can useignore_index=True
to get the defaultpd.RangeIndex
.
0 POINT (1466052.866 5632350.866)
1 POINT (1467284.892 5629615.691)
2 POINT (1467757.67 5634626.745)
3 POINT (1468163.478 5627029.796)
4 POINT (1468463.242 5627577.834)
Name: sampled_points, dtype: geometry
Check how the sample looks on a map.
locations.explore()
The points sampled from budapest
will be different every time you run the sample_points()
method. If you want to fix the result, you can pass a seed value to a random number generator as rng=42
. With the same seed value, the result will be always the same. This is useful, especially if you are interested in the reproducibility of your code.
The xarray
ecosystem offers many ways of extracting point values. Below, you will use the implementation from the xvec
package. Create a new xarray.DataArray
with all three arrays you created so far to see the benefits of using xvec
below.
1= xr.concat(
bud_cube 2
[pop_density, population_bud, built_up_bud],3=pd.Index(
dim4"density", "population", "built-up volume"],
[5="measurement",
name
)
) bud_cube
- 1
-
Use
xr.concat
function to concatenate all arrays together. - 2
- Specify which arrays shall be concatenated.
- 3
-
Define a new dimension created along the axis of concatenation. Use the
pd.Index
to create a new index along this dimension. - 4
- Specify coordinates along the new dimension.
- 5
- Give the dimension a name.
<xarray.DataArray (measurement: 3, y: 840, x: 830)> Size: 17MB array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], [[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], [[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]]) Coordinates: * x (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06 * y (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06 spatial_ref int64 8B 0 * measurement (measurement) object 24B 'density' ... 'built-up volume'
The resulting DataArray
is 3-dimensional, compared to 2-dimensional arrays used before. Apart from x
and y
, you now have measurement
as well. Using the new index created above, you can use the sel()
method to get the original arrays.
="density") bud_cube.sel(measurement
<xarray.DataArray (y: 840, x: 830)> Size: 6MB array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]) Coordinates: * x (x) float64 7kB 1.465e+06 1.465e+06 ... 1.548e+06 1.548e+06 * y (y) float64 7kB 5.648e+06 5.648e+06 ... 5.564e+06 5.564e+06 spatial_ref int64 8B 0 measurement <U7 28B 'density'
Now it is time to take this cube and create another based on your points. That can be done using the .xvec
accessor and its extract_points
method.
1= bud_cube.drop_vars("spatial_ref").xvec.extract_points(
vector_cube 2=locations.geometry,
points3="x",
x_coords="y",
y_coords
) vector_cube
- 1
-
Drop
spatial_ref
because it is not interesting for point extraction and use.xvec.extract_points()
- 2
- Specify the points for which you want to extract the values.
- 3
-
Specify which dimension of the
bud_cube
DataArray
represents x-coordinate dimension of geometries and which represents the y-coordinate dimension to match points to the array.
<xarray.DataArray (measurement: 3, geometry: 1000)> Size: 24kB array([[1.69309007e-03, nan, nan, ..., 1.67070983e-03, 1.71016553e-03, nan], [1.40780439e+01, 0.00000000e+00, 0.00000000e+00, ..., 1.05789347e+01, 1.13726008e+00, 0.00000000e+00], [8.31500000e+03, 0.00000000e+00, 0.00000000e+00, ..., 6.33200000e+03, 6.65000000e+02, 0.00000000e+00]]) Coordinates: * measurement (measurement) object 24B 'density' ... 'built-up volume' * geometry (geometry) object 8kB POINT (1466052.8664912267 5632350.8657... Indexes: geometry GeometryIndex (crs=ESRI:54009)
The resulting object is still a DataArray
but a bit different. It is no longer 3-dimensional, although all dimensions of interest ('density', 'population', 'built-up volume'
) are still there, but 2-dimensional. One dimension is measurement
, and the other is geometry
, containing the points of interest. With xvec
, the spatial dimension is reduced, but the remaining dimensionality of the original array is preserved.
You can then convert the data into a geopandas.GeoDataFrame
and work with it as usual.
= vector_cube.xvec.to_geopandas()
location_data location_data.head()
measurement | geometry | density | population | built-up volume |
---|---|---|---|---|
0 | POINT (1466052.866 5632350.866) | 0.001693 | 14.078044 | 8315.0 |
1 | POINT (1467284.892 5629615.691) | NaN | 0.000000 | 0.0 |
2 | POINT (1467757.67 5634626.745) | NaN | 0.000000 | 0.0 |
3 | POINT (1468163.478 5627029.796) | NaN | 0.000000 | 0.0 |
4 | POINT (1468463.242 5627577.834) | NaN | 0.000000 | 0.0 |
Check the result on a map to verify that all worked as expected.
"density", cmap="cividis_r", tiles="CartoDB Positron") location_data.explore(