spatialpandas¶

Pandas and Dask extensions for vectorized spatial and geometric operations using numba, initially focused on the needs of data visualization libraries.

Goal¶

The goal of spatialpandas is to provide a foundation for implementing custom vectorized geometric algorithms using numba or cython. The initial motivation for the project was to support data visualization libraries. Datashader, for example, builds on spatialpandas to support vectorized polygon rendering using numba.

Comparison to geopandas¶

This project shares some of the goals (and some of the API) of the geopandas project. However, unlike geopandas, spatialpandas is not focused on geographic functionality. Instead, this project is focused on geometric use-cases and is explicitly independent of what space these spatial data structures and operations cover. As such, spatialpandas has no dependencies on the traditional geographic stack (GDAL, fiona, GEOS, etc.). Instead, it is built solely on top of general-purpose SciPy and PyData libraries and tools, which means that is should be easily installable using Conda or pip without any trickly binary-compatibility issues.

All algorithms are implemented in pure Python, which is then compiled to machine code at runtime with Numba. To support scaling to large problems, operations can run distributed and/or out of core using Dask.

Future direction¶

In the future, this library could grow to support most of the geometric operations offered by geopandas. It could also be extended to include 3-dimensional geometric primitives and operations.

Geometry Extension Arrays¶

spatialpandas provides pandas extension arrays for storing arrays of geometry objects as DataFrame columns. Unlike geopandas, where a geometry column may contain a mix of geometry types, all elements of a spatialpandas geometry array must be of the same geometry type.

In [1]:

from spatialpandas.geometry import (
    PointArray, MultiPointArray, LineArray,
    MultiLineArray, PolygonArray, MultiPolygonArray
)

PointArray¶

The PointArray stores the coordinates of a single point per element. A PointArray can be constructed from several data structures:

A 2D list or array with two columns, the first column containing the x coordinates and the second the y coordinates.

In [2]:

point_array = PointArray([
    [1, 2],
    [3, 4],
    [5, 6],
])
point_array

Out[2]:

<PointArray>
[Point([1, 2]), Point([3, 4]), Point([5, 6])]
Length: 3, dtype: point[int64]

A 1D list or array of the interleaved x and y coordinates of each point

In [3]:

point_array = PointArray([
    1, 2,
    3, 4,
    5, 6,
])
point_array

Out[3]:

<PointArray>
[Point([1, 2]), Point([3, 4]), Point([5, 6])]
Length: 3, dtype: point[int64]

A tuple of two 1D lists or arrays, the first containing the x coordinates and the second the y coordinates.

In [4]:

point_array = PointArray(
    ([1, 3, 5], [2, 4, 6])
)
point_array

Out[4]:

<PointArray>
[Point([1, 2]), Point([3, 4]), Point([5, 6])]
Length: 3, dtype: point[int64]

The x and y coordinates of a PointArray can be accessed using the x and y properties.

In [5]:

point_array.x

Out[5]:

array([1., 3., 5.])

In [6]:

point_array.y

Out[6]:

array([2., 4., 6.])

All coordinates can be accessed as a single array of interleaved x and y coordinates using the flat_values property.

In [7]:

point_array.flat_values

Out[7]:

array([1, 2, 3, 4, 5, 6])

MultiPointArray¶

A MultiPointArray stores the coordinates of zero or more points per element. A MultiPointArray is constructed from a list of lists, where the inner lists contain the interleaved x and y coordinates of the points in that element.

In [8]:

multipoint_array = MultiPointArray([
    [1, 2, 3, 4, 5, 6],
    [7, 8, 9, 10],
])
multipoint_array

Out[8]:

<MultiPointArray>
[MultiPoint([1, 2, 3, 4, 5, 6]), MultiPoint([7, 8, 9, 10])]
Length: 2, dtype: multipoint[int64]

All of the coordinates can be accessed as a single flat array of interleaved x and y coordinates using the buffer_values property.

In [9]:

multipoint_array.buffer_values

Out[9]:

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

A length one tuple with an array of indices into buffer_values that separate the coordinates of individual elements is available as the buffer_offsets property.

In [10]:

multipoint_array.buffer_offsets

Out[10]:

(array([ 0,  6, 10], dtype=uint32),)

LineArray¶

A LineArray stores the coordinates of one line per element. Like the MultiPointArray, a LineArray is constructed from a list of lists, where the inner lists contain the interleaved x and y coordinates of the vertices of the line in that element.

In [11]:

line_array = LineArray([
    [1, 2, 3, 4, 5, 6],
    [7, 8, 9, 10],
])
line_array

Out[11]:

<LineArray>
[Line([1, 2, 3, 4, 5, 6]), Line([7, 8, 9, 10])]
Length: 2, dtype: line[int64]

MultiLineArray¶

A MultiLineArray stores the coordinates of zero or more lines per element. A MultiLineArray is constructed from a doubly nested list of the interleaved x and y coordinates.

In [12]:

multiline_array = MultiLineArray([
    [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]],
    [[11, 12]]
])
multiline_array

Out[12]:

<MultiLineArray>
[MultiLine([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]), MultiLine([[11, 12]])]
Length: 2, dtype: multiline[int64]

All of the coordinates can be accessed as a single flat array of interleaved x and y coordinates using the buffer_values property.

In [13]:

multiline_array.buffer_values

Out[13]:

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

A tuple of arrays of indices into buffer_values that separate the coordinates of individual lines and multi-line elements are available as the buffer_offsets property.

In [14]:

multiline_array.buffer_offsets

Out[14]:

(array([0, 2, 3], dtype=uint32), array([ 0,  6, 10, 12], dtype=uint32))

PolygonArray¶

A PolygonArray stores the coordinates of one polygon (with zero or more holes) per element. Like the MultiLineArray, a PolygonArray is constructed from a doubly nested list of interleaved x and y coordinates. Each inner list contains the interleaved x and y coordinates of a closed ring. The first ring of each element represents the filled polygon outline and should be in a counter-clockwise order. The following rings, if any, represent holes in the polygon and should be in a clockwise order.

In [15]:

# Square from (0, 0) to (1, 1) in CCW order
outline0 = [0, 0, 1, 0, 1, 1, 0, 1, 0, 0]

# Square from (2, 2) to (5, 5) in CCW order
outline1 = [2, 2, 5, 2, 5, 5, 2, 5, 2, 2]

# Triangle hole in CW order
hole1 = [3, 3, 4, 3, 3, 4, 3, 3]

polygon_array = PolygonArray([
    [outline0],
    [outline1, hole1]
])
polygon_array

Out[15]:

<PolygonArray>
[Polygon([[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]]), Polygon([[2, 2, 5, 2, 5, 5, 2, 5, 2, 2], [3, 3, 4, 3, 3, 4, 3, 3]])]
Length: 2, dtype: polygon[int64]

All of the coordinates can be accessed as a single flat array of interleaved x and y coordinates using the buffer_values property.

In [16]:

polygon_array.buffer_values

Out[16]:

array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 2, 2, 5, 2, 5, 5, 2, 5, 2, 2, 3, 3,
       4, 3, 3, 4, 3, 3])

A tuple of arrays of indices into buffer_values that separate the coordinates of individual rings and polygon elements is available as the buffer_offsets property.

In [17]:

polygon_array.buffer_offsets

Out[17]:

(array([0, 1, 3], dtype=uint32), array([ 0, 10, 20, 28], dtype=uint32))

MultiPolygonArray¶

A MultiPolygonArray stores the coordinates of one or more multi-polygons per element, where a multi-polygon consists of zero or more polygons that may each have zero or more holes. The MultiPolygonArray adds one additional level of nesting compared to the PolygonArray.

In [18]:

# Triangle hole in CW order
outline3 = [3, 1, 4, 1, 3, 2, 3, 1]

multipolygon_array = MultiPolygonArray([
    [[outline0], [outline1, hole1]],
    [[outline3]]
])
multipolygon_array

Out[18]:

<MultiPolygonArray>
[MultiPolygon([[[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]], [[2, 2, 5, 2, 5, 5, 2, 5, 2, 2], [3, 3, 4, 3, 3, 4, 3, 3]]]), MultiPolygon([[[3, 1, 4, 1, 3, 2, 3, 1]]])]
Length: 2, dtype: multipolygon[int64]

In [19]:

multipolygon_array.buffer_values

Out[19]:

array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 2, 2, 5, 2, 5, 5, 2, 5, 2, 2, 3, 3,
       4, 3, 3, 4, 3, 3, 3, 1, 4, 1, 3, 2, 3, 1])

In [20]:

multipolygon_array.buffer_offsets

Out[20]:

(array([0, 2, 3], dtype=uint32),
 array([0, 1, 3, 4], dtype=uint32),
 array([ 0, 10, 20, 28, 36], dtype=uint32))

spatialpandas GeoSeries and GeoDataFrame¶

While spatialpandas geometry arrays can be stored in standard pandas Series and DataFrame objects, additional functionality is provided by the GeoSeries and GeoDataFrame subclasses. These classes are designed to implement a subset of the API of the GeoSeries and GeoDataFrame classes provided by the geopandas library.

In [21]:

from spatialpandas import GeoSeries, GeoDataFrame

In [22]:

geo_series = GeoSeries(multipoint_array)
geo_series

Out[22]:

0    MultiPoint([1, 2, 3, 4, 5, 6])
1         MultiPoint([7, 8, 9, 10])
dtype: multipoint[int64]

In [23]:

geo_df = GeoDataFrame({
    'multipoint': multiline_array,
    'line': line_array,
    'multiline': multiline_array,
    'polygon': polygon_array,
    'multipolygon': multipolygon_array
})
geo_df

Out[23]:

	multipoint	line	multiline	polygon	multipolygon
0	MultiLine([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]])	Line([1, 2, 3, 4, 5, 6])	MultiLine([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]])	Polygon([[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]])	MultiPolygon([[[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]]...
1	MultiLine([[11, 12]])	Line([7, 8, 9, 10])	MultiLine([[11, 12]])	Polygon([[2, 2, 5, 2, 5, 5, 2, 5, 2, 2], [3, 3...	MultiPolygon([[[3, 1, 4, 1, 3, 2, 3, 1]]])

GeoPandas conversions¶

GeoSeries and GeoDataFrame objects from geopandas can be converted to spatialpandas objects by passing them to the constructor of the spatialpandas equivalent.

Here is an example of loading the naturalearth_lowres dataset using geopandas as a GeoDataFrame, and then converting it to a spatialpandas GeoDataFrame.

Note: The following examples require the geopandas, matplotlib, descartes, datashader, and holoviews packages, which are not automatically installed with spatialpandas. These can be installed using pip...

$ pip install geopandas matplotlib descartes datashader holoviews

or conda...

$ conda install -c pyviz geopandas matplotlib descartes datashader holoviews

In [24]:

import geopandas
from download import download_map

world_gp = geopandas.read_file(
    download_map('naturalearth_lowres')
)
world_gp = world_gp.sort_values('name').reset_index(drop=True)
world_gp = world_gp[['pop_est', 'continent', 'name', 'geometry']]
world_gp

Out[24]:

	pop_est	continent	name	geometry
0	34124811	Asia	Afghanistan	POLYGON ((66.51861 37.36278, 67.07578 37.35614...
1	3047987	Europe	Albania	POLYGON ((21.02004 40.84273, 20.99999 40.58000...
2	40969443	Africa	Algeria	POLYGON ((-8.68440 27.39574, -8.66512 27.58948...
3	29310273	Africa	Angola	MULTIPOLYGON (((12.99552 -4.78110, 12.63161 -4...
4	4050	Antarctica	Antarctica	MULTIPOLYGON (((-48.66062 -78.04702, -48.15140...
...	...	...	...	...
172	603253	Africa	W. Sahara	POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
173	28036829	Asia	Yemen	POLYGON ((52.00001 19.00000, 52.78218 17.34974...
174	15972000	Africa	Zambia	POLYGON ((30.74001 -8.34001, 31.15775 -8.59458...
175	13805084	Africa	Zimbabwe	POLYGON ((31.19141 -22.25151, 30.65987 -22.151...
176	1467152	Africa	eSwatini	POLYGON ((32.07167 -26.73382, 31.86806 -27.177...

177 rows × 4 columns

In [25]:

world_df = GeoDataFrame(world_gp)
world_df

Out[25]:

	pop_est	continent	name	geometry
0	34124811	Asia	Afghanistan	MultiPolygon([[[66.51860680528867, 37.36278432...
1	3047987	Europe	Albania	MultiPolygon([[[21.0200403174764, 40.842726955...
2	40969443	Africa	Algeria	MultiPolygon([[[-8.684399786809053, 27.3957441...
3	29310273	Africa	Angola	MultiPolygon([[[12.995517205465177, -4.7811032...
4	4050	Antarctica	Antarctica	MultiPolygon([[[-48.66061601418252, -78.047018...
...	...	...	...	...
172	603253	Africa	W. Sahara	MultiPolygon([[[-8.665589565454809, 27.6564258...
173	28036829	Asia	Yemen	MultiPolygon([[[52.00000980002224, 19.00000336...
174	15972000	Africa	Zambia	MultiPolygon([[[30.740009731422095, -8.3400059...
175	13805084	Africa	Zimbabwe	MultiPolygon([[[31.19140913262129, -22.2515096...
176	1467152	Africa	eSwatini	MultiPolygon([[[32.07166548028107, -26.7338200...

177 rows × 4 columns

A spatialpandas GeoDataFrame or GeoSeries can be converted to geopandas using the to_geopandas method.

In [26]:

world_df.iloc[[4, 2, 0]].to_geopandas()

Out[26]:

	pop_est	continent	name	geometry
4	4050	Antarctica	Antarctica	MULTIPOLYGON (((-48.66062 -78.04702, -48.66062...
2	40969443	Africa	Algeria	MULTIPOLYGON (((-8.68440 27.39574, -4.92334 24...
0	34124811	Asia	Afghanistan	MULTIPOLYGON (((66.51861 37.36278, 66.21738 37...

Visualization¶

Visualizing geometry elements using shapely¶

Individual elements of a spatialpandas geometry array can be converted to shapely shapes using the to_shapely method, and shapely objects automatically display themselves in the Jupyter notebook.

In [27]:

world_df.geometry[0].to_shapely()

Out[27]:

Visualizing GeoSeries with matplotlib using geopandas¶

A spatialpandas GeoSeries can be converted to a geopandas GeoSeries using the to_geopandas method, and geopandas GeoSeries objects can be visualized with matplotlib using the plot method.

In [28]:

world_df.geometry.to_geopandas().plot()

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f22a53bd588>

Visualizing GeoSeries with Datashader¶

A spatialpandas GeoDataFrame can be visualized with datashader. Here is an example of visualizing a multi-polygon GeoSeries using datashader. In this case, the polygons are shaded to reflect the estimated population of each country.

In [29]:

import datashader as ds
cvs = ds.Canvas(plot_width=650, plot_height=400)
agg = cvs.polygons(world_df, geometry='geometry', agg=ds.mean('pop_est'))
ds.transfer_functions.shade(agg)

Out[29]:

Visualizing geometry arrays interactively with Datashader and HoloViews¶

By combining datashader and a HoloViews DynamicMap, spatialpandas geometry arrays can be rendered interactively.

In [30]:

import holoviews as hv
hv.extension("bokeh")

In [31]:

def callback(x_range, y_range):
    cvs = ds.Canvas(plot_width=650, plot_height=400, x_range=x_range, y_range=y_range)
    agg = cvs.polygons(world_df, geometry='geometry', agg=ds.mean('pop_est'))
    return hv.Image(agg).opts(cmap="viridis").opts(width=650, height=400)

hv.DynamicMap(
    callback, streams=[hv.streams.RangeXY()]
)

Out[31]:

Spatial selection¶

As with geopandas, the spatialpandas GeoSeries and GeoDataFrame provide a spatial indexer, cx, that can be used to select the entries that reside in a particular spatial region.

Here is an example of using the cx indexer to select those countries that reside, at least partially, in the southern hemisphere.

In [32]:

selected_world_df = world_df.geometry.cx[:, :0]
selected_world_df.to_geopandas().plot()

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f226c3c7e80>

Spatial intersection¶

Spatialpandas has limited support for computing intersections between shapes. Currently, intersects can be called on Point arrays and can used to compare to any other scalar geometry type.

For this example, load the naturalearth_cities dataset included with geopandas. This dataset contains the name and location of one city per row. The intersection calculation can be used to identify which cities intersect with the a particular country polygon.

In [33]:

cities_gp = geopandas.read_file(
    download_map('naturalearth_cities')
)
cities_df = GeoDataFrame(cities_gp)
cities_df.head()

Out[33]:

	name	geometry
0	Vatican City	Point([12.453386544971766, 41.903282179960115])
1	San Marino	Point([12.441770157800141, 43.936095834768004])
2	Vaduz	Point([9.516669472907267, 47.13372377429357])
3	Luxembourg	Point([6.130002806227083, 49.611660379121076])
4	Palikir	Point([158.1499743237623, 6.916643696007725])

In [34]:

country_ind = 0
country_polygon = world_df.geometry.iloc[country_ind]
print("Cities in %s" % world_df.name.iloc[country_ind])
cities_df[cities_df.geometry.intersects(country_polygon)]

Cities in Afghanistan

Out[34]:

	name	geometry
183	Kabul	Point([69.18131419070505, 34.51863614490031])

Spatial join¶

Spatialpandas has limited spatial join support using the sjoin function. Currently, the left_df argument to sjoin must be a GeoDataFrame or DaskGeoDataFrame with an active Point geometry column. The right_df argument must be a GeoDataFrame (not DaskGeoDataFrame), but it may have any geometry type.

Here is an example of using sjoin to join the cities_df and world_df GeoDataFrames on the intersection of the city Point in cities_df and the country MultiPolygon in world_df.

In [35]:

from spatialpandas import sjoin
sjoin(cities_df, world_df)

Out[35]:

	name_left	geometry	index_right	pop_est	continent	name_right
0	Vatican City	Point([12.453386544971766, 41.903282179960115])	77	62137802	Europe	Italy
1	San Marino	Point([12.441770157800141, 43.936095834768004])	77	62137802	Europe	Italy
192	Rome	Point([12.481312562873995, 41.89790148509894])	77	62137802	Europe	Italy
2	Vaduz	Point([9.516669472907267, 47.13372377429357])	8	8754413	Europe	Austria
184	Vienna	Point([16.364693096743736, 48.20196113681686])	8	8754413	Europe	Austria
...	...	...	...	...	...	...
195	Jakarta	Point([106.82749176247012, -6.172471846798885])	72	260580739	Asia	Indonesia
196	Bogota	Point([-74.08528981377441, 4.598369421147822])	32	47698524	South America	Colombia
197	Cairo	Point([31.248022361126118, 30.051906205103705])	45	97041072	Africa	Egypt
198	Tokyo	Point([139.74946157054467, 35.686962764371174])	79	126451398	Asia	Japan
200	Santiago	Point([-70.66898671317483, -33.448067956934096])	30	17789267	South America	Chile

172 rows × 6 columns

For comparison, here is the equivalent spatial join performed using geopandas

In [36]:

from geopandas import sjoin as gp_sjoin
gp_sjoin(cities_gp, world_gp)

Out[36]:

	name_left	geometry	index_right	pop_est	continent	name_right
0	Vatican City	POINT (12.45339 41.90328)	77	62137802	Europe	Italy
1	San Marino	POINT (12.44177 43.93610)	77	62137802	Europe	Italy
192	Rome	POINT (12.48131 41.89790)	77	62137802	Europe	Italy
2	Vaduz	POINT (9.51667 47.13372)	8	8754413	Europe	Austria
184	Vienna	POINT (16.36469 48.20196)	8	8754413	Europe	Austria
...	...	...	...	...	...	...
195	Jakarta	POINT (106.82749 -6.17247)	72	260580739	Asia	Indonesia
196	Bogota	POINT (-74.08529 4.59837)	32	47698524	South America	Colombia
197	Cairo	POINT (31.24802 30.05191)	45	97041072	Africa	Egypt
198	Tokyo	POINT (139.74946 35.68696)	79	126451398	Asia	Japan
200	Santiago	POINT (-70.66899 -33.44807)	30	17789267	South America	Chile

172 rows × 6 columns

Benchmark spatial join¶

Here is a benchmark comparing the spatial join performance of geopandas, spatialpandas between two GeoDataFrames, and spatialpandas between a DaskGeoDataFrame and a GeoDataFrame.

First, start up a local Dask cluster with 16 workers.

In [37]:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=16, threads_per_worker=1)
client = Client(cluster)

Then, replicate the cities GeoDataFrame 10,000 times to create data frames with approximately 2 million Point locations.

In [38]:

import pandas as pd
import dask.dataframe as dd
reps = 10000
# Large geopandas GeoDataFrame
cities_large_gp = pd.concat([cities_gp] * reps, axis=0)

# Large spatialpandas GeoDataFrame
cities_large_df = pd.concat([cities_df] * reps, axis=0)

# Large spatialpandas DaskGeoDataFrame with 16 partitions
cities_large_ddf = dd.from_pandas(cities_large_df, npartitions=16).persist()

# Precompute the partition-level spatial index
cities_large_ddf.partition_sindex

Out[38]:

<spatialpandas.spatialindex.rtree.HilbertRtree at 0x7f2294e7ea90>

In [39]:

print("Number of Point rows: %s" % len(cities_large_df))
print("Number of MultiPolygon rows: %s" % len(world_df))

Number of Point rows: 2020000
Number of MultiPolygon rows: 177

Then, compute the number of rows in the spatial join result.

In [40]:

%%timeit
len(gp_sjoin(cities_large_gp, world_gp))

42.2 s ± 1.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [41]:

%%timeit
len(sjoin(cities_large_df, world_df))

4.71 s ± 40.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [42]:

%%timeit
len(sjoin(cities_large_ddf, world_df))

604 ms ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Length and area calculations¶

spatialpandas provides GeoSeries length and area properties to compute the length and area of each geometry element.

In [43]:

world_df['area'] = world_df.geometry.area
world_df['length'] = world_df.geometry.length
world_df

Out[43]:

	pop_est	continent	name	geometry	area	length
0	34124811	Asia	Afghanistan	MultiPolygon([[[66.51860680528867, 37.36278432...	63.593500	46.043309
1	3047987	Europe	Albania	MultiPolygon([[[21.0200403174764, 40.842726955...	3.185163	8.146598
2	40969443	Africa	Algeria	MultiPolygon([[[-8.684399786809053, 27.3957441...	213.602772	65.867357
3	29310273	Africa	Angola	MultiPolygon([[[12.995517205465177, -4.7811032...	103.599439	54.986811
4	4050	Antarctica	Antarctica	MultiPolygon([[[-48.66061601418252, -78.047018...	6028.836194	1041.993521
...	...	...	...	...	...	...
172	603253	Africa	W. Sahara	MultiPolygon([[[-8.665589565454809, 27.6564258...	8.603984	27.662143
173	28036829	Asia	Yemen	MultiPolygon([[[52.00000980002224, 19.00000336...	38.475618	29.050429
174	15972000	Africa	Zambia	MultiPolygon([[[30.740009731422095, -8.3400059...	62.789498	44.450846
175	13805084	Africa	Zimbabwe	MultiPolygon([[[31.19140913262129, -22.2515096...	32.280371	23.632987
176	1467152	Africa	eSwatini	MultiPolygon([[[32.07166548028107, -26.7338200...	1.639983	4.763277

177 rows × 6 columns

Parquet support¶

spatialpandas GeoDataFrames containing geometry arrays can be saved and loaded from parquet files using the to_parquet and read_parquet methods from the spatialpandas.io module.

In [44]:

from spatialpandas.io import to_parquet, read_parquet

In [45]:

to_parquet(world_df, 'world.parq')

In [46]:

read_parquet('world.parq')

Out[46]:

	pop_est	continent	name	geometry	area	length
0	34124811	Asia	Afghanistan	MultiPolygon([[[66.51860680528867, 37.36278432...	63.593500	46.043309
1	3047987	Europe	Albania	MultiPolygon([[[21.0200403174764, 40.842726955...	3.185163	8.146598
2	40969443	Africa	Algeria	MultiPolygon([[[-8.684399786809053, 27.3957441...	213.602772	65.867357
3	29310273	Africa	Angola	MultiPolygon([[[12.995517205465177, -4.7811032...	103.599439	54.986811
4	4050	Antarctica	Antarctica	MultiPolygon([[[-48.66061601418252, -78.047018...	6028.836194	1041.993521
...	...	...	...	...	...	...
172	603253	Africa	W. Sahara	MultiPolygon([[[-8.665589565454809, 27.6564258...	8.603984	27.662143
173	28036829	Asia	Yemen	MultiPolygon([[[52.00000980002224, 19.00000336...	38.475618	29.050429
174	15972000	Africa	Zambia	MultiPolygon([[[30.740009731422095, -8.3400059...	62.789498	44.450846
175	13805084	Africa	Zimbabwe	MultiPolygon([[[31.19140913262129, -22.2515096...	32.280371	23.632987
176	1467152	Africa	eSwatini	MultiPolygon([[[32.07166548028107, -26.7338200...	1.639983	4.763277

177 rows × 6 columns

Dask support¶

spatialpandas geometry arrays can be processed in parallel by Dask using the DaskGeoSeries and DaskGeoDataFrame classes from the spatialpandas.dask module.

Construction¶

Dask will automatically create a DaskGeoDataFrame when a GeoDataFrame is passed to the dask.dataframe.from_pandas function.

In [47]:

import dask.dataframe as dd
world_ddf = dd.from_pandas(world_df, npartitions=20)
print(type(world_ddf))
world_ddf

<class 'spatialpandas.dask.DaskGeoDataFrame'>

Out[47]:

Dask DataFrame Structure:

	pop_est	continent	name	geometry	area	length
npartitions=20
0	int64	object	object	multipolygon[float64]	float64	float64
9	...	...	...	...	...	...
...	...	...	...	...	...	...
171	...	...	...	...	...	...
176	...	...	...	...	...	...

Dask Name: from_pandas, 20 tasks

Pack partitions¶

Dask parallelizes data frame operations by grouping collections of the DataFrame's rows into separate partitions. In addition, the DaskGeoDataFrame keeps track of the spatial extent of all geometry objects in each partition. Many spatial operations, including spatial indexing with cx, can be performed more efficiently if geometry objects in each partition are close together, and if there is minimal overlap in the spatial extent of different partitions.

The DaskGeoDataFrame.pack_partitions method can be used to create a new data frame with spatially optimized partitions using a Hilbert R-tree packing method. This operation involves a full shuffle of the Dask data frame and can be computationally expensive, so it's a good idea to save the resulting packed DaskGeoDataFrame to a parquet file for later use. See the Distributed parquet support section below for more information.

To demonstrate the effect of packing, we will plot the countries dataset using Datashader before and after packing. In each case, the dataset is grouped into 20 partitions. The plot_partitions helper function below plots the dataset, with each country colored by its partition.

The original world_df GeoDataFrame that was used to create the world_ddf DaskGeoDataFrame was sorted in alphabetical order by country name. Not surprisingly, this ordering does a poor job of grouping nearby countries into the same partition. Instead, we see that countries of each color (i.e. partition) are scattered across the world.

In [48]:

import numpy as np
import pandas as pd
def plot_partitions(ddf):
    # Get divisions array
    divs = np.array(ddf.divisions)[:-1]
    
    # Add categorical "partition" column
    ddf2 = ddf.map_partitions(
        lambda df: df.assign(
            partition=pd.Categorical(np.searchsorted(divs, df.index, side="right"))
        )
    ).compute()
    
    # Create Datashader image, coloring countries by partition
    cvs = ds.Canvas(plot_width=650, plot_height=400)
    agg = cvs.polygons(ddf2, geometry='geometry', agg=ds.count_cat('partition'))
    return ds.transfer_functions.shade(agg)

In [49]:

plot_partitions(world_ddf)

Out[49]:

After packing, however, the partitions contain countries that are much more localized.

In [50]:

world_ddf_packed = world_ddf.pack_partitions(npartitions=20)
plot_partitions(world_ddf_packed)

Out[50]:

Distributed spatial selection¶

Like GeoDataFrame, the DaskGeoDataFrame class supports spatial selection using the cx indexer.

Here is an example of using the cx indexer to select the countries in the southern region of Africa.

In [51]:

southern_aftrica = world_ddf.cx[-15:40, -60:0]
southern_aftrica.geometry.compute().to_geopandas().plot()

Out[51]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f212a0314a8>

Distributed parquet support¶

DaskGeoDataFrame objects can be written to parquet files using the to_parquet method and read using the spatialpandas.io.read_parquet_dask function.

The DaskGeoDataFrame.to_parquet method stores the bounding box extents of each partition as parquet metadata, and the read_parquet_dask function loads this metadata. This makes it possible to perform efficient spatial queries on datasets without the need to first load all of the partitions into memory to compute their spatial extents.

In [52]:

from spatialpandas.io import read_parquet_dask
world_ddf_packed.to_parquet('world_packed.parq')

In [53]:

world_packed_loaded = read_parquet_dask('world_packed.parq')
world_packed_loaded

Out[53]:

Dask DataFrame Structure:

	pop_est	continent	name	geometry	area	length
npartitions=20
	int64	object	object	multipolygon[float64]	float64	float64
	...	...	...	...	...	...
...	...	...	...	...	...	...
	...	...	...	...	...	...
	...	...	...	...	...	...

Dask Name: from-delayed, 40 tasks