Spatial joins¶

Goals of this notebook:

Based on the countries and cities dataframes, determine for each city the country in which it is located.
To solve this problem, we will use the the concept of a 'spatial join' operation: combining information of geospatial datasets based on their spatial relationship.

In [1]:

%matplotlib inline

import pandas as pd
import geopandas

pd.options.display.max_rows = 10

In [2]:

countries = geopandas.read_file("zip://./data/ne_110m_admin_0_countries.zip")
cities = geopandas.read_file("zip://./data/ne_110m_populated_places.zip")
rivers = geopandas.read_file("zip://./data/ne_50m_rivers_lake_centerlines.zip")

Recap - joining dataframes¶

Pandas provides functionality to join or merge dataframes in different ways, see https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/ for an overview and https://pandas.pydata.org/pandas-docs/stable/merging.html for the full documentation.

To illustrate the concept of joining the information of two dataframes with pandas, let's take a small subset of our cities and countries datasets:

In [3]:

cities2 = cities[cities['name'].isin(['Bern', 'Brussels', 'London', 'Paris'])].copy()
cities2['iso_a3'] = ['CHE', 'BEL', 'GBR', 'FRA']

In [4]:

cities2

Out[4]:

	name	geometry	iso_a3
26	Bern	POINT (7.466975462482424 46.91668275866772)	CHE
170	Brussels	POINT (4.33137074969045 50.83526293533032)	BEL
219	London	POINT (-0.1186677024759319 51.5019405883275)	GBR
235	Paris	POINT (2.33138946713035 48.86863878981461)	FRA

In [5]:

countries2 = countries[['iso_a3', 'name', 'continent']]
countries2.head()

Out[5]:

	iso_a3	name	continent
0	AFG	Afghanistan	Asia
1	AGO	Angola	Africa
2	ALB	Albania	Europe
3	ARE	United Arab Emirates	Asia
4	ARG	Argentina	South America

We added a 'iso_a3' column to the cities dataset, indicating a code of the country of the city. This country code is also present in the countries dataset, which allows us to merge those two dataframes based on the common column.

Joining the cities dataframe with countries will transfer extra information about the countries (the full name, the continent) to the cities dataframe, based on a common key:

In [6]:

cities2.merge(countries2, on='iso_a3')

Out[6]:

	name_x	geometry	iso_a3	name_y	continent
0	Bern	POINT (7.466975462482424 46.91668275866772)	CHE	Switzerland	Europe
1	Brussels	POINT (4.33137074969045 50.83526293533032)	BEL	Belgium	Europe
2	London	POINT (-0.1186677024759319 51.5019405883275)	GBR	United Kingdom	Europe
3	Paris	POINT (2.33138946713035 48.86863878981461)	FRA	France	Europe

But, for this illustrative example, we added the common column manually, it is not present in the original dataset. However, we can still know how to join those two datasets based on their spatial coordinates.

Recap - spatial relationships between objects¶

In the previous notebook 02-spatial-relationships.ipynb, we have seen the notion of spatial relationships between geometry objects: within, contains, intersects, ...

In this case, we know that each of the cities is located within one of the countries, or the other way around that each country can contain multiple cities.

We can test such relationships using the methods we have seen in the previous notebook:

In [7]:

france = countries.loc[countries['name'] == 'France', 'geometry'].squeeze()

In [8]:

cities.within(france)

Out[8]:

0      False
1      False
2      False
3      False
4      False
       ...  
238    False
239    False
240    False
241    False
242    False
Length: 243, dtype: bool

The above gives us a boolean series, indicating for each point in our cities dataframe whether it is located within the area of France or not.
Because this is a boolean series as result, we can use it to filter the original dataframe to only show those cities that are actually within France:

In [9]:

cities[cities.within(france)]

Out[9]:

	name	geometry
10	Monaco	POINT (7.406913173465057 43.73964568785249)
13	Andorra	POINT (1.51648596050552 42.5000014435459)
186	Geneva	POINT (6.140028034091699 46.21000754707626)
235	Paris	POINT (2.33138946713035 48.86863878981461)

We could now repeat the above analysis for each of the countries, and add a column to the cities dataframe indicating this country. However, that would be tedious to do manually, and is also exactly what the spatial join operation provides us.

(note: the above result is incorrect, but this is just because of the coarse-ness of the countries dataset)

Spatial join operation¶

SPATIAL JOIN = transferring attributes from one layer to another based on their spatial relationship

Different parts of this operations:

The GeoDataFrame to which we want add information
The GeoDataFrame that contains the information we want to add
The spatial relationship we want to use to match both datasets ('intersects', 'contains', 'within')
The type of join: left or inner join

In this case, we want to join the cities dataframe with the information of the countries dataframe, based on the spatial relationship between both datasets.

We use the geopandas.sjoin function:

In [10]:

joined = geopandas.sjoin(cities, countries, op='within', how='left')

In [11]:

joined

Out[11]:

	name_left	geometry	index_right	iso_a3	name_right	continent	pop_est	gdp_md_est
0	Vatican City	POINT (12.45338654497177 41.90328217996012)	79.0	ITA	Italy	Europe	6.213780e+07	2221000.0
1	San Marino	POINT (12.44177015780014 43.936095834768)	79.0	ITA	Italy	Europe	6.213780e+07	2221000.0
2	Vaduz	POINT (9.516669472907267 47.13372377429357)	9.0	AUT	Austria	Europe	8.754413e+06	416600.0
3	Lobamba	POINT (31.19999710971274 -26.46666746135247)	152.0	SWZ	Swaziland	Africa	1.467152e+06	11060.0
4	Luxembourg	POINT (6.130002806227083 49.61166037912108)	97.0	LUX	Luxembourg	Europe	5.941300e+05	58740.0
...	...	...	...	...	...	...	...	...
238	Rio de Janeiro	POINT (-43.22696665284366 -22.92307731561596)	22.0	BRA	Brazil	South America	2.073534e+08	3081000.0
239	São Paulo	POINT (-46.62696583905523 -23.55673372837896)	22.0	BRA	Brazil	South America	2.073534e+08	3081000.0
240	Sydney	POINT (151.1832339501475 -33.91806510862875)	8.0	AUS	Australia	Oceania	2.323241e+07	1189000.0
241	Singapore	POINT (103.853874819099 1.294979325105942)	111.0	MYS	Malaysia	Asia	3.138199e+07	863000.0
242	Hong Kong	POINT (114.183063458463 22.30692675357551)	30.0	CHN	China	Asia	1.379303e+09	21140000.0

243 rows × 8 columns

In [12]:

joined['continent'].value_counts()

Out[12]:

Asia             59
Africa           57
Europe           46
North America    26
South America    14
Oceania           8
Name: continent, dtype: int64

Lets's practice!¶

We will again use the Paris datasets to do some exercises. Let's start importing them again:

In [13]:

districts = geopandas.read_file("data/paris_districts_utm.geojson")
stations = geopandas.read_file("data/paris_sharing_bike_stations_utm.geojson")

EXERCISE: Make a plot of the density of bike stations by district

Determine for each bike station in which district it is located (using a spatial join!). Call the result `joined`.
Based on this result, calculate the number of bike stations in each district (e.g. using `groupby` method; you can use the `size` size method to know the size of each group).
- Make sure the result is a DataFrame called `counts` with the columns 'district_name' and 'n_bike_stations'.
- To go from a Series to a DataFrame, you can use the `reset_index` or `to_frame` method (both have a `name` keyword to specify a column name for the original Series values.
Add those counts to the original `districts` dataframe, creating a new `districts2` dataframe (tip: this is a merge operation).
Calculate a new column 'n_bike_stations_by_area'.
Make a plot showing the density in bike stations of the districts.

In [14]:

joined = geopandas.sjoin(stations, districts[['district_name', 'geometry']], op='within')

In [15]:

joined.head()

Out[15]:

	name	bike_stands	available_bikes	geometry	index_right	district_name
0	14002 - RASPAIL QUINET	44	4	POINT (450804.448740735 5409797.268203795)	52	Montparnasse
143	14112 - FAUBOURG SAINT JACQUES CASSINI	16	0	POINT (451419.446715647 5409421.528587255)	52	Montparnasse
293	14033 - DAGUERRE GASSENDI	38	1	POINT (450708.2275807534 5409406.941172979)	52	Montparnasse
346	14006 - SAINT JACQUES TOMBE ISSOIRE	22	0	POINT (451340.0264470892 5409124.574548723)	52	Montparnasse
429	14111 - DENFERT-ROCHEREAU CASSINI	24	8	POINT (451274.5111513372 5409609.730783217)	52	Montparnasse

In [16]:

# counting the number of stations in each district
counts = joined.groupby('district_name').size()
counts.head()

Out[16]:

district_name
Amérique           17
Archives            4
Arsenal             7
Arts-et-Metiers     4
Auteuil            21
dtype: int64

In [17]:

# convert the above in a DataFrame with two columns
counts = counts.reset_index(name='n_bike_stations')
counts.head()

Out[17]:

	district_name	n_bike_stations
0	Amérique	17
1	Archives	4
2	Arsenal	7
3	Arts-et-Metiers	4
4	Auteuil	21

In [18]:

# merging those counts back into the districts dataset
districts2 = districts.merge(counts, on='district_name')
districts2.head()

Out[18]:

	id	district_name	population	geometry	n_bike_stations
0	1	St-Germain-l'Auxerrois	1672	POLYGON ((451922.1333912524 5411438.484355546,...	4
1	2	Halles	8984	POLYGON ((452278.4194036503 5412160.89282334, ...	13
2	3	Palais-Royal	3195	POLYGON ((451553.8057660239 5412340.522224233,...	6
3	4	Place-Vendôme	3044	POLYGON ((451004.907944323 5412654.094913081, ...	5
4	5	Gaillon	1345	POLYGON ((451328.7522686935 5412991.278156867,...	4

In [19]:

# calculating the relative number of bike stations by area
districts2['n_bike_stations_by_area'] = (
    districts2['n_bike_stations'] / districts2.geometry.area)

In [20]:

ax = districts2.plot(column='n_bike_stations_by_area',
                    figsize=(15, 6))
ax.set_axis_off()

The overlay operation¶

In the spatial join operation above, we are not changing the geometries itself. We are not joining geometries, but joining attributes based on a spatial relationship between the geometries. This also means that the geometries need to at least overlap partially.

If you want to create new geometries based on joining (combining) geometries of different dataframes into one new dataframe (eg by taking the intersection of the geometries), you want an overlay operation.

In [21]:

africa = countries[countries['continent'] == 'Africa']

In [22]:

africa.plot()

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9e890c1f98>

In [23]:

cities['geometry'] = cities.buffer(2)

In [24]:

geopandas.overlay(africa, cities, how='difference').plot()

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9e80dbfb00>

REMEMBER

Spatial join: transfer attributes from one dataframe to another based on the spatial relationship
Spatial overlay: construct new geometries based on spatial operation between both dataframes (and combining attributes of both dataframes)