Pandas and Dask extensions for vectorized spatial and geometric operations using numba, initially focused on the needs of data visualization libraries.
The goal of spatialpandas is to provide a foundation for implementing custom vectorized geometric algorithms using numba or cython. The initial motivation for the project was to support data visualization libraries. Datashader, for example, builds on spatialpandas to support vectorized polygon rendering using numba.
This project shares some of the goals (and some of the API) of the geopandas
project. However, unlike geopandas
, spatialpandas
is not focused on geographic functionality. Instead, this project is focused on geometric use-cases and is explicitly independent of what space these spatial data structures and operations cover. As such, spatialpandas
has no dependencies on the traditional geographic stack (GDAL, fiona, GEOS, etc.). Instead, it is built solely on top of general-purpose SciPy and PyData libraries and tools, which means that is should be easily installable using Conda or pip without any trickly binary-compatibility issues.
All algorithms are implemented in pure Python, which is then compiled to machine code at runtime with Numba. To support scaling to large problems, operations can run distributed and/or out of core using Dask.
In the future, this library could grow to support most of the geometric operations offered by geopandas
. It could also be extended to include 3-dimensional geometric primitives and operations.
spatialpandas
provides pandas extension arrays for storing arrays of geometry objects as DataFrame columns. Unlike geopandas
, where a geometry column may contain a mix of geometry types, all elements of a spatialpandas
geometry array must be of the same geometry type.
from spatialpandas.geometry import (
PointArray, MultiPointArray, LineArray,
MultiLineArray, PolygonArray, MultiPolygonArray
)
The PointArray
stores the coordinates of a single point per element. A PointArray
can be constructed from several data structures:
A 2D list or array with two columns, the first column containing the x coordinates and the second the y coordinates.
point_array = PointArray([
[1, 2],
[3, 4],
[5, 6],
])
point_array
<PointArray> [Point([1, 2]), Point([3, 4]), Point([5, 6])] Length: 3, dtype: point[int64]
A 1D list or array of the interleaved x and y coordinates of each point
point_array = PointArray([
1, 2,
3, 4,
5, 6,
])
point_array
<PointArray> [Point([1, 2]), Point([3, 4]), Point([5, 6])] Length: 3, dtype: point[int64]
A tuple of two 1D lists or arrays, the first containing the x coordinates and the second the y coordinates.
point_array = PointArray(
([1, 3, 5], [2, 4, 6])
)
point_array
<PointArray> [Point([1, 2]), Point([3, 4]), Point([5, 6])] Length: 3, dtype: point[int64]
The x and y coordinates of a PointArray
can be accessed using the x
and y
properties.
point_array.x
array([1., 3., 5.])
point_array.y
array([2., 4., 6.])
All coordinates can be accessed as a single array of interleaved x and y coordinates using the flat_values
property.
point_array.flat_values
array([1, 2, 3, 4, 5, 6])
A MultiPointArray
stores the coordinates of zero or more points per element. A MultiPointArray
is constructed from a list of lists, where the inner lists contain the interleaved x and y coordinates of the points in that element.
multipoint_array = MultiPointArray([
[1, 2, 3, 4, 5, 6],
[7, 8, 9, 10],
])
multipoint_array
<MultiPointArray> [MultiPoint([1, 2, 3, 4, 5, 6]), MultiPoint([7, 8, 9, 10])] Length: 2, dtype: multipoint[int64]
All of the coordinates can be accessed as a single flat array of interleaved x and y coordinates using the buffer_values
property.
multipoint_array.buffer_values
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
A length one tuple with an array of indices into buffer_values
that separate the coordinates of individual elements is available as the buffer_offsets
property.
multipoint_array.buffer_offsets
(array([ 0, 6, 10], dtype=uint32),)
A LineArray
stores the coordinates of one line per element. Like the MultiPointArray
, a LineArray
is constructed from a list of lists, where the inner lists contain the interleaved x and y coordinates of the vertices of the line in that element.
line_array = LineArray([
[1, 2, 3, 4, 5, 6],
[7, 8, 9, 10],
])
line_array
<LineArray> [Line([1, 2, 3, 4, 5, 6]), Line([7, 8, 9, 10])] Length: 2, dtype: line[int64]
A MultiLineArray
stores the coordinates of zero or more lines per element. A MultiLineArray
is constructed from a doubly nested list of the interleaved x and y coordinates.
multiline_array = MultiLineArray([
[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]],
[[11, 12]]
])
multiline_array
<MultiLineArray> [MultiLine([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]), MultiLine([[11, 12]])] Length: 2, dtype: multiline[int64]
All of the coordinates can be accessed as a single flat array of interleaved x and y coordinates using the buffer_values
property.
multiline_array.buffer_values
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
A tuple of arrays of indices into buffer_values
that separate the coordinates of individual lines and multi-line elements are available as the buffer_offsets
property.
multiline_array.buffer_offsets
(array([0, 2, 3], dtype=uint32), array([ 0, 6, 10, 12], dtype=uint32))
A PolygonArray
stores the coordinates of one polygon (with zero or more holes) per element. Like the MultiLineArray
, a PolygonArray
is constructed from a doubly nested list of interleaved x and y coordinates. Each inner list contains the interleaved x and y coordinates of a closed ring. The first ring of each element represents the filled polygon outline and should be in a counter-clockwise order. The following rings, if any, represent holes in the polygon and should be in a clockwise order.
# Square from (0, 0) to (1, 1) in CCW order
outline0 = [0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
# Square from (2, 2) to (5, 5) in CCW order
outline1 = [2, 2, 5, 2, 5, 5, 2, 5, 2, 2]
# Triangle hole in CW order
hole1 = [3, 3, 4, 3, 3, 4, 3, 3]
polygon_array = PolygonArray([
[outline0],
[outline1, hole1]
])
polygon_array
<PolygonArray> [Polygon([[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]]), Polygon([[2, 2, 5, 2, 5, 5, 2, 5, 2, 2], [3, 3, 4, 3, 3, 4, 3, 3]])] Length: 2, dtype: polygon[int64]
All of the coordinates can be accessed as a single flat array of interleaved x and y coordinates using the buffer_values
property.
polygon_array.buffer_values
array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 2, 2, 5, 2, 5, 5, 2, 5, 2, 2, 3, 3, 4, 3, 3, 4, 3, 3])
A tuple of arrays of indices into buffer_values
that separate the coordinates of individual rings and polygon elements is available as the buffer_offsets
property.
polygon_array.buffer_offsets
(array([0, 1, 3], dtype=uint32), array([ 0, 10, 20, 28], dtype=uint32))
A MultiPolygonArray
stores the coordinates of one or more multi-polygons per element, where a multi-polygon consists of zero or more polygons that may each have zero or more holes. The MultiPolygonArray
adds one additional level of nesting compared to the PolygonArray
.
# Triangle hole in CW order
outline3 = [3, 1, 4, 1, 3, 2, 3, 1]
multipolygon_array = MultiPolygonArray([
[[outline0], [outline1, hole1]],
[[outline3]]
])
multipolygon_array
<MultiPolygonArray> [MultiPolygon([[[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]], [[2, 2, 5, 2, 5, 5, 2, 5, 2, 2], [3, 3, 4, 3, 3, 4, 3, 3]]]), MultiPolygon([[[3, 1, 4, 1, 3, 2, 3, 1]]])] Length: 2, dtype: multipolygon[int64]
multipolygon_array.buffer_values
array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 2, 2, 5, 2, 5, 5, 2, 5, 2, 2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 1, 4, 1, 3, 2, 3, 1])
multipolygon_array.buffer_offsets
(array([0, 2, 3], dtype=uint32), array([0, 1, 3, 4], dtype=uint32), array([ 0, 10, 20, 28, 36], dtype=uint32))
While spatialpandas
geometry arrays can be stored in standard pandas Series
and DataFrame
objects, additional functionality is provided by the GeoSeries
and GeoDataFrame
subclasses. These classes are designed to implement a subset of the API of the GeoSeries
and GeoDataFrame
classes provided by the geopandas
library.
from spatialpandas import GeoSeries, GeoDataFrame
geo_series = GeoSeries(multipoint_array)
geo_series
0 MultiPoint([1, 2, 3, 4, 5, 6]) 1 MultiPoint([7, 8, 9, 10]) dtype: multipoint[int64]
geo_df = GeoDataFrame({
'multipoint': multiline_array,
'line': line_array,
'multiline': multiline_array,
'polygon': polygon_array,
'multipolygon': multipolygon_array
})
geo_df
multipoint | line | multiline | polygon | multipolygon | |
---|---|---|---|---|---|
0 | MultiLine([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]) | Line([1, 2, 3, 4, 5, 6]) | MultiLine([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]) | Polygon([[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]]) | MultiPolygon([[[0, 0, 1, 0, 1, 1, 0, 1, 0, 0]]... |
1 | MultiLine([[11, 12]]) | Line([7, 8, 9, 10]) | MultiLine([[11, 12]]) | Polygon([[2, 2, 5, 2, 5, 5, 2, 5, 2, 2], [3, 3... | MultiPolygon([[[3, 1, 4, 1, 3, 2, 3, 1]]]) |
GeoSeries
and GeoDataFrame
objects from geopandas
can be converted to spatialpandas
objects by passing them to the constructor of the spatialpandas
equivalent.
Here is an example of loading the naturalearth_lowres
dataset using geopandas
as a GeoDataFrame
, and then converting it to a spatialpandas
GeoDataFrame
.
Note: The following examples require the
geopandas
,matplotlib
,descartes
,datashader
, andholoviews
packages, which are not automatically installed withspatialpandas
. These can be installed using pip...
$ pip install geopandas matplotlib descartes datashader holoviews
or conda...
$ conda install -c pyviz geopandas matplotlib descartes datashader holoviews
import geopandas
from download import download_map
world_gp = geopandas.read_file(
download_map('naturalearth_lowres')
)
world_gp = world_gp.sort_values('name').reset_index(drop=True)
world_gp = world_gp[['pop_est', 'continent', 'name', 'geometry']]
world_gp
pop_est | continent | name | geometry | |
---|---|---|---|---|
0 | 34124811 | Asia | Afghanistan | POLYGON ((66.51861 37.36278, 67.07578 37.35614... |
1 | 3047987 | Europe | Albania | POLYGON ((21.02004 40.84273, 20.99999 40.58000... |
2 | 40969443 | Africa | Algeria | POLYGON ((-8.68440 27.39574, -8.66512 27.58948... |
3 | 29310273 | Africa | Angola | MULTIPOLYGON (((12.99552 -4.78110, 12.63161 -4... |
4 | 4050 | Antarctica | Antarctica | MULTIPOLYGON (((-48.66062 -78.04702, -48.15140... |
... | ... | ... | ... | ... |
172 | 603253 | Africa | W. Sahara | POLYGON ((-8.66559 27.65643, -8.66512 27.58948... |
173 | 28036829 | Asia | Yemen | POLYGON ((52.00001 19.00000, 52.78218 17.34974... |
174 | 15972000 | Africa | Zambia | POLYGON ((30.74001 -8.34001, 31.15775 -8.59458... |
175 | 13805084 | Africa | Zimbabwe | POLYGON ((31.19141 -22.25151, 30.65987 -22.151... |
176 | 1467152 | Africa | eSwatini | POLYGON ((32.07167 -26.73382, 31.86806 -27.177... |
177 rows × 4 columns
world_df = GeoDataFrame(world_gp)
world_df
pop_est | continent | name | geometry | |
---|---|---|---|---|
0 | 34124811 | Asia | Afghanistan | MultiPolygon([[[66.51860680528867, 37.36278432... |
1 | 3047987 | Europe | Albania | MultiPolygon([[[21.0200403174764, 40.842726955... |
2 | 40969443 | Africa | Algeria | MultiPolygon([[[-8.684399786809053, 27.3957441... |
3 | 29310273 | Africa | Angola | MultiPolygon([[[12.995517205465177, -4.7811032... |
4 | 4050 | Antarctica | Antarctica | MultiPolygon([[[-48.66061601418252, -78.047018... |
... | ... | ... | ... | ... |
172 | 603253 | Africa | W. Sahara | MultiPolygon([[[-8.665589565454809, 27.6564258... |
173 | 28036829 | Asia | Yemen | MultiPolygon([[[52.00000980002224, 19.00000336... |
174 | 15972000 | Africa | Zambia | MultiPolygon([[[30.740009731422095, -8.3400059... |
175 | 13805084 | Africa | Zimbabwe | MultiPolygon([[[31.19140913262129, -22.2515096... |
176 | 1467152 | Africa | eSwatini | MultiPolygon([[[32.07166548028107, -26.7338200... |
177 rows × 4 columns
A spatialpandas
GeoDataFrame
or GeoSeries
can be converted to geopandas
using the to_geopandas
method.
world_df.iloc[[4, 2, 0]].to_geopandas()
pop_est | continent | name | geometry | |
---|---|---|---|---|
4 | 4050 | Antarctica | Antarctica | MULTIPOLYGON (((-48.66062 -78.04702, -48.66062... |
2 | 40969443 | Africa | Algeria | MULTIPOLYGON (((-8.68440 27.39574, -4.92334 24... |
0 | 34124811 | Asia | Afghanistan | MULTIPOLYGON (((66.51861 37.36278, 66.21738 37... |
world_df.geometry[0].to_shapely()
A spatialpandas
GeoSeries
can be converted to a geopandas
GeoSeries
using the to_geopandas
method, and geopandas
GeoSeries
objects can be visualized with matplotlib
using the plot
method.
world_df.geometry.to_geopandas().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f22a53bd588>
A spatialpandas
GeoDataFrame
can be visualized with datashader
. Here is an example of visualizing a multi-polygon GeoSeries
using datashader
. In this case, the polygons are shaded to reflect the estimated population of each country.
import datashader as ds
cvs = ds.Canvas(plot_width=650, plot_height=400)
agg = cvs.polygons(world_df, geometry='geometry', agg=ds.mean('pop_est'))
ds.transfer_functions.shade(agg)
By combining datashader
and a HoloViews
DynamicMap
, spatialpandas
geometry arrays can be rendered interactively.
import holoviews as hv
hv.extension("bokeh")
def callback(x_range, y_range):
cvs = ds.Canvas(plot_width=650, plot_height=400, x_range=x_range, y_range=y_range)
agg = cvs.polygons(world_df, geometry='geometry', agg=ds.mean('pop_est'))
return hv.Image(agg).opts(cmap="viridis").opts(width=650, height=400)
hv.DynamicMap(
callback, streams=[hv.streams.RangeXY()]
)
As with geopandas
, the spatialpandas
GeoSeries
and GeoDataFrame
provide a spatial indexer, cx
, that can be used to select the entries that reside in a particular spatial region.
Here is an example of using the cx
indexer to select those countries that reside, at least partially, in the southern hemisphere.
selected_world_df = world_df.geometry.cx[:, :0]
selected_world_df.to_geopandas().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f226c3c7e80>
Spatialpandas has limited support for computing intersections between shapes. Currently, intersects
can be called on Point
arrays and can used to compare to any other scalar geometry type.
For this example, load the naturalearth_cities
dataset included with geopandas. This dataset contains the name and location of one city per row. The intersection calculation can be used to identify which cities intersect with the a particular country polygon.
cities_gp = geopandas.read_file(
download_map('naturalearth_cities')
)
cities_df = GeoDataFrame(cities_gp)
cities_df.head()
name | geometry | |
---|---|---|
0 | Vatican City | Point([12.453386544971766, 41.903282179960115]) |
1 | San Marino | Point([12.441770157800141, 43.936095834768004]) |
2 | Vaduz | Point([9.516669472907267, 47.13372377429357]) |
3 | Luxembourg | Point([6.130002806227083, 49.611660379121076]) |
4 | Palikir | Point([158.1499743237623, 6.916643696007725]) |
country_ind = 0
country_polygon = world_df.geometry.iloc[country_ind]
print("Cities in %s" % world_df.name.iloc[country_ind])
cities_df[cities_df.geometry.intersects(country_polygon)]
Cities in Afghanistan
name | geometry | |
---|---|---|
183 | Kabul | Point([69.18131419070505, 34.51863614490031]) |
Spatialpandas has limited spatial join support using the sjoin
function. Currently, the left_df
argument to sjoin
must be a GeoDataFrame
or DaskGeoDataFrame
with an active Point
geometry column. The right_df
argument must be a GeoDataFrame
(not DaskGeoDataFrame
), but it may have any geometry type.
Here is an example of using sjoin
to join the cities_df
and world_df
GeoDataFrames
on the intersection of the city Point
in cities_df
and the country MultiPolygon
in world_df
.
from spatialpandas import sjoin
sjoin(cities_df, world_df)
name_left | geometry | index_right | pop_est | continent | name_right | |
---|---|---|---|---|---|---|
0 | Vatican City | Point([12.453386544971766, 41.903282179960115]) | 77 | 62137802 | Europe | Italy |
1 | San Marino | Point([12.441770157800141, 43.936095834768004]) | 77 | 62137802 | Europe | Italy |
192 | Rome | Point([12.481312562873995, 41.89790148509894]) | 77 | 62137802 | Europe | Italy |
2 | Vaduz | Point([9.516669472907267, 47.13372377429357]) | 8 | 8754413 | Europe | Austria |
184 | Vienna | Point([16.364693096743736, 48.20196113681686]) | 8 | 8754413 | Europe | Austria |
... | ... | ... | ... | ... | ... | ... |
195 | Jakarta | Point([106.82749176247012, -6.172471846798885]) | 72 | 260580739 | Asia | Indonesia |
196 | Bogota | Point([-74.08528981377441, 4.598369421147822]) | 32 | 47698524 | South America | Colombia |
197 | Cairo | Point([31.248022361126118, 30.051906205103705]) | 45 | 97041072 | Africa | Egypt |
198 | Tokyo | Point([139.74946157054467, 35.686962764371174]) | 79 | 126451398 | Asia | Japan |
200 | Santiago | Point([-70.66898671317483, -33.448067956934096]) | 30 | 17789267 | South America | Chile |
172 rows × 6 columns
For comparison, here is the equivalent spatial join performed using geopandas
from geopandas import sjoin as gp_sjoin
gp_sjoin(cities_gp, world_gp)
name_left | geometry | index_right | pop_est | continent | name_right | |
---|---|---|---|---|---|---|
0 | Vatican City | POINT (12.45339 41.90328) | 77 | 62137802 | Europe | Italy |
1 | San Marino | POINT (12.44177 43.93610) | 77 | 62137802 | Europe | Italy |
192 | Rome | POINT (12.48131 41.89790) | 77 | 62137802 | Europe | Italy |
2 | Vaduz | POINT (9.51667 47.13372) | 8 | 8754413 | Europe | Austria |
184 | Vienna | POINT (16.36469 48.20196) | 8 | 8754413 | Europe | Austria |
... | ... | ... | ... | ... | ... | ... |
195 | Jakarta | POINT (106.82749 -6.17247) | 72 | 260580739 | Asia | Indonesia |
196 | Bogota | POINT (-74.08529 4.59837) | 32 | 47698524 | South America | Colombia |
197 | Cairo | POINT (31.24802 30.05191) | 45 | 97041072 | Africa | Egypt |
198 | Tokyo | POINT (139.74946 35.68696) | 79 | 126451398 | Asia | Japan |
200 | Santiago | POINT (-70.66899 -33.44807) | 30 | 17789267 | South America | Chile |
172 rows × 6 columns
Here is a benchmark comparing the spatial join performance of geopandas
, spatialpandas
between two GeoDataFrames
, and spatialpandas
between a DaskGeoDataFrame
and a GeoDataFrame
.
First, start up a local Dask cluster with 16 workers.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=16, threads_per_worker=1)
client = Client(cluster)
Then, replicate the cities GeoDataFrame
10,000 times to create data frames with approximately 2 million Point
locations.
import pandas as pd
import dask.dataframe as dd
reps = 10000
# Large geopandas GeoDataFrame
cities_large_gp = pd.concat([cities_gp] * reps, axis=0)
# Large spatialpandas GeoDataFrame
cities_large_df = pd.concat([cities_df] * reps, axis=0)
# Large spatialpandas DaskGeoDataFrame with 16 partitions
cities_large_ddf = dd.from_pandas(cities_large_df, npartitions=16).persist()
# Precompute the partition-level spatial index
cities_large_ddf.partition_sindex
<spatialpandas.spatialindex.rtree.HilbertRtree at 0x7f2294e7ea90>
print("Number of Point rows: %s" % len(cities_large_df))
print("Number of MultiPolygon rows: %s" % len(world_df))
Number of Point rows: 2020000 Number of MultiPolygon rows: 177
Then, compute the number of rows in the spatial join result.
%%timeit
len(gp_sjoin(cities_large_gp, world_gp))
42.2 s ± 1.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
len(sjoin(cities_large_df, world_df))
4.71 s ± 40.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
len(sjoin(cities_large_ddf, world_df))
604 ms ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
spatialpandas
provides GeoSeries
length
and area
properties to compute the length and area of each geometry element.
world_df['area'] = world_df.geometry.area
world_df['length'] = world_df.geometry.length
world_df
pop_est | continent | name | geometry | area | length | |
---|---|---|---|---|---|---|
0 | 34124811 | Asia | Afghanistan | MultiPolygon([[[66.51860680528867, 37.36278432... | 63.593500 | 46.043309 |
1 | 3047987 | Europe | Albania | MultiPolygon([[[21.0200403174764, 40.842726955... | 3.185163 | 8.146598 |
2 | 40969443 | Africa | Algeria | MultiPolygon([[[-8.684399786809053, 27.3957441... | 213.602772 | 65.867357 |
3 | 29310273 | Africa | Angola | MultiPolygon([[[12.995517205465177, -4.7811032... | 103.599439 | 54.986811 |
4 | 4050 | Antarctica | Antarctica | MultiPolygon([[[-48.66061601418252, -78.047018... | 6028.836194 | 1041.993521 |
... | ... | ... | ... | ... | ... | ... |
172 | 603253 | Africa | W. Sahara | MultiPolygon([[[-8.665589565454809, 27.6564258... | 8.603984 | 27.662143 |
173 | 28036829 | Asia | Yemen | MultiPolygon([[[52.00000980002224, 19.00000336... | 38.475618 | 29.050429 |
174 | 15972000 | Africa | Zambia | MultiPolygon([[[30.740009731422095, -8.3400059... | 62.789498 | 44.450846 |
175 | 13805084 | Africa | Zimbabwe | MultiPolygon([[[31.19140913262129, -22.2515096... | 32.280371 | 23.632987 |
176 | 1467152 | Africa | eSwatini | MultiPolygon([[[32.07166548028107, -26.7338200... | 1.639983 | 4.763277 |
177 rows × 6 columns
spatialpandas
GeoDataFrames
containing geometry arrays can be saved and loaded from parquet
files using the to_parquet
and read_parquet
methods from the spatialpandas.io
module.
from spatialpandas.io import to_parquet, read_parquet
to_parquet(world_df, 'world.parq')
read_parquet('world.parq')
pop_est | continent | name | geometry | area | length | |
---|---|---|---|---|---|---|
0 | 34124811 | Asia | Afghanistan | MultiPolygon([[[66.51860680528867, 37.36278432... | 63.593500 | 46.043309 |
1 | 3047987 | Europe | Albania | MultiPolygon([[[21.0200403174764, 40.842726955... | 3.185163 | 8.146598 |
2 | 40969443 | Africa | Algeria | MultiPolygon([[[-8.684399786809053, 27.3957441... | 213.602772 | 65.867357 |
3 | 29310273 | Africa | Angola | MultiPolygon([[[12.995517205465177, -4.7811032... | 103.599439 | 54.986811 |
4 | 4050 | Antarctica | Antarctica | MultiPolygon([[[-48.66061601418252, -78.047018... | 6028.836194 | 1041.993521 |
... | ... | ... | ... | ... | ... | ... |
172 | 603253 | Africa | W. Sahara | MultiPolygon([[[-8.665589565454809, 27.6564258... | 8.603984 | 27.662143 |
173 | 28036829 | Asia | Yemen | MultiPolygon([[[52.00000980002224, 19.00000336... | 38.475618 | 29.050429 |
174 | 15972000 | Africa | Zambia | MultiPolygon([[[30.740009731422095, -8.3400059... | 62.789498 | 44.450846 |
175 | 13805084 | Africa | Zimbabwe | MultiPolygon([[[31.19140913262129, -22.2515096... | 32.280371 | 23.632987 |
176 | 1467152 | Africa | eSwatini | MultiPolygon([[[32.07166548028107, -26.7338200... | 1.639983 | 4.763277 |
177 rows × 6 columns
spatialpandas
geometry arrays can be processed in parallel by Dask using the DaskGeoSeries
and DaskGeoDataFrame
classes from the spatialpandas.dask
module.
Dask will automatically create a DaskGeoDataFrame
when a GeoDataFrame
is passed to the dask.dataframe.from_pandas
function.
import dask.dataframe as dd
world_ddf = dd.from_pandas(world_df, npartitions=20)
print(type(world_ddf))
world_ddf
<class 'spatialpandas.dask.DaskGeoDataFrame'>
pop_est | continent | name | geometry | area | length | |
---|---|---|---|---|---|---|
npartitions=20 | ||||||
0 | int64 | object | object | multipolygon[float64] | float64 | float64 |
9 | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
171 | ... | ... | ... | ... | ... | ... |
176 | ... | ... | ... | ... | ... | ... |
Dask parallelizes data frame operations by grouping collections of the DataFrame's rows into separate partitions. In addition, the DaskGeoDataFrame
keeps track of the spatial extent of all geometry objects in each partition. Many spatial operations, including spatial indexing with cx
, can be performed more efficiently if geometry objects in each partition are close together, and if there is minimal overlap in the spatial extent of different partitions.
The DaskGeoDataFrame.pack_partitions
method can be used to create a new data frame with spatially optimized partitions using a Hilbert R-tree packing method. This operation involves a full shuffle of the Dask data frame and can be computationally expensive, so it's a good idea to save the resulting packed DaskGeoDataFrame
to a parquet file for later use. See the Distributed parquet support section below for more information.
To demonstrate the effect of packing, we will plot the countries dataset using Datashader before and after packing. In each case, the dataset is grouped into 20 partitions. The plot_partitions
helper function below plots the dataset, with each country colored by its partition.
The original world_df
GeoDataFrame
that was used to create the world_ddf
DaskGeoDataFrame
was sorted in alphabetical order by country name. Not surprisingly, this ordering does a poor job of grouping nearby countries into the same partition. Instead, we see that countries of each color (i.e. partition) are scattered across the world.
import numpy as np
import pandas as pd
def plot_partitions(ddf):
# Get divisions array
divs = np.array(ddf.divisions)[:-1]
# Add categorical "partition" column
ddf2 = ddf.map_partitions(
lambda df: df.assign(
partition=pd.Categorical(np.searchsorted(divs, df.index, side="right"))
)
).compute()
# Create Datashader image, coloring countries by partition
cvs = ds.Canvas(plot_width=650, plot_height=400)
agg = cvs.polygons(ddf2, geometry='geometry', agg=ds.count_cat('partition'))
return ds.transfer_functions.shade(agg)
plot_partitions(world_ddf)
After packing, however, the partitions contain countries that are much more localized.
world_ddf_packed = world_ddf.pack_partitions(npartitions=20)
plot_partitions(world_ddf_packed)
Like GeoDataFrame
, the DaskGeoDataFrame
class supports spatial selection using the cx
indexer.
Here is an example of using the cx
indexer to select the countries in the southern region of Africa.
southern_aftrica = world_ddf.cx[-15:40, -60:0]
southern_aftrica.geometry.compute().to_geopandas().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f212a0314a8>
DaskGeoDataFrame
objects can be written to parquet files using the to_parquet
method and read using the spatialpandas.io.read_parquet_dask
function.
The DaskGeoDataFrame.to_parquet
method stores the bounding box extents of each partition as parquet metadata, and the read_parquet_dask
function loads this metadata. This makes it possible to perform efficient spatial queries on datasets without the need to first load all of the partitions into memory to compute their spatial extents.
from spatialpandas.io import read_parquet_dask
world_ddf_packed.to_parquet('world_packed.parq')
world_packed_loaded = read_parquet_dask('world_packed.parq')
world_packed_loaded
pop_est | continent | name | geometry | area | length | |
---|---|---|---|---|---|---|
npartitions=20 | ||||||
int64 | object | object | multipolygon[float64] | float64 | float64 | |
... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... |