Dask-GeoPandas vs PostGIS vs GPU: Performance and Spatial Joins

Paul Ramsey saw a spatial join done using a GPU and tried to do the same with PostGIS, checking how fast that is compared to the GPU-based RAPIDS.AI solution. Since Paul used parallelisation in PostGIS, I got curious how fast Dask-GeoPandas is on the same task.

So, I gave it a go.

import download
import geopandas
import dask_geopandas
import dask.dataframe
from dask.distributed import Client, LocalCluster

Let’s download the data using Paul’s query, to ensure we work with the same CSV.

curl "https://phl.carto.com/api/v2/sql?filename=parking_violations&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator&q=SELECT%20*%20FROM%20parking_violations%20WHERE%20issue_datetime%20%3E=%20%272012-01-01%27%20AND%20issue_datetime%20%3C%20%272017-12-31%27" > phl_parking.csv

And then download and unzip the neighbourhoods shapefile.

download.download(
    "https://github.com/azavea/geo-data/raw/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.zip",
    "Neighborhoods_Philadelphia", 
    kind="zip"
)

Paul used a machine with 8 cores. Since I use a machine with 16 cores, I’ll create a local cluster limited to 8 workers. That should be as close to Paul’s machine as I can get without using some virtual one. Keep in mind that this distorts the benchmark as we use different processors with different performances. But the point here is to get a sense of how fast can Dask-based solution be compared to PostGIS and the original GPU code.

client = Client(
    LocalCluster(
        n_workers=8, 
        threads_per_worker=1
    )
)

With Dask, we create the whole pipeline to create a task graph and then run it all, so we won’t have the timings for individual steps, just the total one.

Read parking data CSV into a partitioned data frame (25MB per partition).

ddf = dask.dataframe.read_csv(
    "phl_parking.csv", 
    blocksize=25e6, 
    assume_missing=True
)

Create point geometry and assign it to the data frame, creating dask_geopandas.GeoDataFrame.

ddf = ddf.set_geometry(
    dask_geopandas.points_from_xy(
        ddf, 
        x="lon", 
        y="lat", 
        crs=4326
    )
)

Read neighbourhood polygons and reproject to EPSG:4326 (same as parking data).

neigh = geopandas.read_file(
    "Neighborhoods_Philadelphia"
).to_crs(4326)

Create the spatial join.

joined = dask_geopandas.sjoin(ddf, neigh, predicate="within")

Finally, let’s compute the result.

%%time
r = joined.compute()

Time on a local cluster with 8 workers and 1 thread per worker to pretend it is an 8-core CPU:

CPU times: user 9.34 s, sys: 2.09 s, total: 11.4 s
Wall time: 21.3 s

The complete pipeline took 21.3 seconds, including sending all data to a single process, in the end, to create a single partition joined GeoDataFrame. Usually, that is unnecessary as you work with the data directly in Dask. It does take a few seconds guessing from the Dask Dashboard.

Let’s compare it to the PostGIS solution:

  • Reading in the 9M records from CSV takes about 29 seconds
  • Making a second copy while creating a geometry column takes about 24 seconds
  • The final query running with 4 workers takes 24 seconds

That gives us a total of 77 seconds compared to 21 seconds using Dask-GeoPandas. It’s still slower than 13 seconds using RAPIDS.AI although that covers only the join itself, not reading and creating geometry, so my sense is that it will be almost equal. One aspect that makes the difference between Dask and PostGIS is that our pipeline is parallelised at every step – reading the CSV, creating points, generating spatial index (that is done under the hood in sjoin), the actual join.

While Paul was using the 8-core machine, PostGIS actually utilised only 4 cores (I am not sure why). Let’s try to run our code limited to 4 workers as well.

CPU times: user 9.53 s, sys: 2 s, total: 11.5 s
Wall time: 28.4 s

28 seconds is a bit slower than before, but still quite fast!

When comparing PostGIS and GPU solutions, Paul says

Basically, it is very hard to beat a bespoke performance solution with a general-purpose tool. Yet, PostgreSQL/PostGIS comes within “good enough” range of a high end GPU solution, so that counts as a “win” to me.

At the moment, Dask-GeoPandas is somewhere between PostGIS and bespoke solutions. It does not offer as many functions as PostGIS, but it is designed as a general-purpose tool. So I would say that we are all winners here.

The notebook is available here.

EDIT (Mar 24, 2022): See also Dewey Dunnington’s follow-up expanding the comparison to R.

Methodological Foundation of a Numerical Taxonomy of Urban Form

The final paper based on my PhD thesis is (finally!) out in the Environment and Planning B: Urban Analytics and City Science. We looked into ways of identifying patterns of urban form and came up with the Methodological foundation of a numerical taxonomy of urban form. You can read it on the journal website (open access).

We use urban morphometrics (i.e. data-driven methods) to derive a classification of urban form in Prague and Amsterdam, and you can check the results in online interactive maps – http://martinfleis.github.io/numerical-taxonomy-maps/ or below. (Check the layers toggle!)

The paper explores the method that can eventually support the creation of a taxonomy of urban types in a similar way you know from biology. We even borrowed the foundations of the method from biology.

We measure many variables based on building footprints and street networks (using the momepy Python package) and use Gaussian Mixture Model clustering to get urban tissue types independently in both cities. Then we apply Ward’s hierarchical clustering to build a taxonomy of these types.

The code is available, and the repo even includes a clean Jupyter notebook with the complete method, so you can apply it to your data if you wish. https://github.com/martinfleis/numerical-taxonomy-paper. If you instead want to play with our data, it is available as well https://doi.org/10.6084/m9.figshare.16897102.

Abstract

Cities are complex products of human culture, characterised by a startling diversity of visible traits. Their form is constantly evolving, reflecting changing human needs and local contingencies, manifested in space by many urban patterns. Urban morphology laid the foundation for understanding many such patterns, largely relying on qualitative research methods to extract distinct spatial identities of urban areas. However, the manual, labour-intensive and subjective nature of such approaches represents an impediment to the development of a scalable, replicable and data-driven urban form characterisation. Recently, advances in geographic data science and the availability of digital mapping products open the opportunity to overcome such limitations. And yet, our current capacity to systematically capture the heterogeneity of spatial patterns remains limited in terms of spatial parameters included in the analysis and hardly scalable due to the highly labour-intensive nature of the task. In this paper, we present a method for numerical taxonomy of urban form derived from biological systematics, which allows the rigorous detection and classification of urban types. Initially, we produce a rich numerical characterisation of urban space from minimal data input, minimising limitations due to inconsistent data quality and availability. These are street network, building footprint and morphological tessellation, a spatial unit derivative of Voronoi tessellation, obtained from building footprints. Hence, we derive homogeneous urban tissue types and, by determining overall morphological similarity between them, generate a hierarchical classification of urban form. After framing and presenting the method, we test it on two cities – Prague and Amsterdam – and discuss potential applications and further developments. The proposed classification method represents a step towards the development of an extensive, scalable numerical taxonomy of urban form and opens the way to more rigorous comparative morphological studies and explorations into the relationship between urban space and phenomena as diverse as environmental performance, health and place attractiveness.