Goals of this notebook:
countries
and cities
dataframes, determine for each city the country in which it is located.%matplotlib inline
import pandas as pd
import geopandas
pd.options.display.max_rows = 10
countries = geopandas.read_file("zip://./data/ne_110m_admin_0_countries.zip")
cities = geopandas.read_file("zip://./data/ne_110m_populated_places.zip")
rivers = geopandas.read_file("zip://./data/ne_50m_rivers_lake_centerlines.zip")
Pandas provides functionality to join or merge dataframes in different ways, see https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/ for an overview and https://pandas.pydata.org/pandas-docs/stable/merging.html for the full documentation.
To illustrate the concept of joining the information of two dataframes with pandas, let's take a small subset of our cities
and countries
datasets:
cities2 = cities[cities['name'].isin(['Bern', 'Brussels', 'London', 'Paris'])].copy()
cities2['iso_a3'] = ['CHE', 'BEL', 'GBR', 'FRA']
cities2
name | geometry | iso_a3 | |
---|---|---|---|
26 | Bern | POINT (7.466975462482424 46.91668275866772) | CHE |
170 | Brussels | POINT (4.33137074969045 50.83526293533032) | BEL |
219 | London | POINT (-0.1186677024759319 51.5019405883275) | GBR |
235 | Paris | POINT (2.33138946713035 48.86863878981461) | FRA |
countries2 = countries[['iso_a3', 'name', 'continent']]
countries2.head()
iso_a3 | name | continent | |
---|---|---|---|
0 | AFG | Afghanistan | Asia |
1 | AGO | Angola | Africa |
2 | ALB | Albania | Europe |
3 | ARE | United Arab Emirates | Asia |
4 | ARG | Argentina | South America |
We added a 'iso_a3' column to the cities
dataset, indicating a code of the country of the city. This country code is also present in the countries
dataset, which allows us to merge those two dataframes based on the common column.
Joining the cities
dataframe with countries
will transfer extra information about the countries (the full name, the continent) to the cities
dataframe, based on a common key:
cities2.merge(countries2, on='iso_a3')
name_x | geometry | iso_a3 | name_y | continent | |
---|---|---|---|---|---|
0 | Bern | POINT (7.466975462482424 46.91668275866772) | CHE | Switzerland | Europe |
1 | Brussels | POINT (4.33137074969045 50.83526293533032) | BEL | Belgium | Europe |
2 | London | POINT (-0.1186677024759319 51.5019405883275) | GBR | United Kingdom | Europe |
3 | Paris | POINT (2.33138946713035 48.86863878981461) | FRA | France | Europe |
But, for this illustrative example, we added the common column manually, it is not present in the original dataset. However, we can still know how to join those two datasets based on their spatial coordinates.
In the previous notebook 02-spatial-relationships.ipynb, we have seen the notion of spatial relationships between geometry objects: within, contains, intersects, ...
In this case, we know that each of the cities is located within one of the countries, or the other way around that each country can contain multiple cities.
We can test such relationships using the methods we have seen in the previous notebook:
france = countries.loc[countries['name'] == 'France', 'geometry'].squeeze()
cities.within(france)
0 False 1 False 2 False 3 False 4 False ... 238 False 239 False 240 False 241 False 242 False Length: 243, dtype: bool
The above gives us a boolean series, indicating for each point in our cities
dataframe whether it is located within the area of France or not.
Because this is a boolean series as result, we can use it to filter the original dataframe to only show those cities that are actually within France:
cities[cities.within(france)]
name | geometry | |
---|---|---|
10 | Monaco | POINT (7.406913173465057 43.73964568785249) |
13 | Andorra | POINT (1.51648596050552 42.5000014435459) |
186 | Geneva | POINT (6.140028034091699 46.21000754707626) |
235 | Paris | POINT (2.33138946713035 48.86863878981461) |
We could now repeat the above analysis for each of the countries, and add a column to the cities
dataframe indicating this country. However, that would be tedious to do manually, and is also exactly what the spatial join operation provides us.
(note: the above result is incorrect, but this is just because of the coarse-ness of the countries dataset)
SPATIAL JOIN = transferring attributes from one layer to another based on their spatial relationship
Different parts of this operations:
In this case, we want to join the cities
dataframe with the information of the countries
dataframe, based on the spatial relationship between both datasets.
We use the geopandas.sjoin
function:
joined = geopandas.sjoin(cities, countries, op='within', how='left')
joined
name_left | geometry | index_right | iso_a3 | name_right | continent | pop_est | gdp_md_est | |
---|---|---|---|---|---|---|---|---|
0 | Vatican City | POINT (12.45338654497177 41.90328217996012) | 79.0 | ITA | Italy | Europe | 6.213780e+07 | 2221000.0 |
1 | San Marino | POINT (12.44177015780014 43.936095834768) | 79.0 | ITA | Italy | Europe | 6.213780e+07 | 2221000.0 |
2 | Vaduz | POINT (9.516669472907267 47.13372377429357) | 9.0 | AUT | Austria | Europe | 8.754413e+06 | 416600.0 |
3 | Lobamba | POINT (31.19999710971274 -26.46666746135247) | 152.0 | SWZ | Swaziland | Africa | 1.467152e+06 | 11060.0 |
4 | Luxembourg | POINT (6.130002806227083 49.61166037912108) | 97.0 | LUX | Luxembourg | Europe | 5.941300e+05 | 58740.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
238 | Rio de Janeiro | POINT (-43.22696665284366 -22.92307731561596) | 22.0 | BRA | Brazil | South America | 2.073534e+08 | 3081000.0 |
239 | São Paulo | POINT (-46.62696583905523 -23.55673372837896) | 22.0 | BRA | Brazil | South America | 2.073534e+08 | 3081000.0 |
240 | Sydney | POINT (151.1832339501475 -33.91806510862875) | 8.0 | AUS | Australia | Oceania | 2.323241e+07 | 1189000.0 |
241 | Singapore | POINT (103.853874819099 1.294979325105942) | 111.0 | MYS | Malaysia | Asia | 3.138199e+07 | 863000.0 |
242 | Hong Kong | POINT (114.183063458463 22.30692675357551) | 30.0 | CHN | China | Asia | 1.379303e+09 | 21140000.0 |
243 rows × 8 columns
joined['continent'].value_counts()
Asia 59 Africa 57 Europe 46 North America 26 South America 14 Oceania 8 Name: continent, dtype: int64
We will again use the Paris datasets to do some exercises. Let's start importing them again:
districts = geopandas.read_file("data/paris_districts_utm.geojson")
stations = geopandas.read_file("data/paris_sharing_bike_stations_utm.geojson")
joined = geopandas.sjoin(stations, districts[['district_name', 'geometry']], op='within')
joined.head()
name | bike_stands | available_bikes | geometry | index_right | district_name | |
---|---|---|---|---|---|---|
0 | 14002 - RASPAIL QUINET | 44 | 4 | POINT (450804.448740735 5409797.268203795) | 52 | Montparnasse |
143 | 14112 - FAUBOURG SAINT JACQUES CASSINI | 16 | 0 | POINT (451419.446715647 5409421.528587255) | 52 | Montparnasse |
293 | 14033 - DAGUERRE GASSENDI | 38 | 1 | POINT (450708.2275807534 5409406.941172979) | 52 | Montparnasse |
346 | 14006 - SAINT JACQUES TOMBE ISSOIRE | 22 | 0 | POINT (451340.0264470892 5409124.574548723) | 52 | Montparnasse |
429 | 14111 - DENFERT-ROCHEREAU CASSINI | 24 | 8 | POINT (451274.5111513372 5409609.730783217) | 52 | Montparnasse |
# counting the number of stations in each district
counts = joined.groupby('district_name').size()
counts.head()
district_name Amérique 17 Archives 4 Arsenal 7 Arts-et-Metiers 4 Auteuil 21 dtype: int64
# convert the above in a DataFrame with two columns
counts = counts.reset_index(name='n_bike_stations')
counts.head()
district_name | n_bike_stations | |
---|---|---|
0 | Amérique | 17 |
1 | Archives | 4 |
2 | Arsenal | 7 |
3 | Arts-et-Metiers | 4 |
4 | Auteuil | 21 |
# merging those counts back into the districts dataset
districts2 = districts.merge(counts, on='district_name')
districts2.head()
id | district_name | population | geometry | n_bike_stations | |
---|---|---|---|---|---|
0 | 1 | St-Germain-l'Auxerrois | 1672 | POLYGON ((451922.1333912524 5411438.484355546,... | 4 |
1 | 2 | Halles | 8984 | POLYGON ((452278.4194036503 5412160.89282334, ... | 13 |
2 | 3 | Palais-Royal | 3195 | POLYGON ((451553.8057660239 5412340.522224233,... | 6 |
3 | 4 | Place-Vendôme | 3044 | POLYGON ((451004.907944323 5412654.094913081, ... | 5 |
4 | 5 | Gaillon | 1345 | POLYGON ((451328.7522686935 5412991.278156867,... | 4 |
# calculating the relative number of bike stations by area
districts2['n_bike_stations_by_area'] = (
districts2['n_bike_stations'] / districts2.geometry.area)
ax = districts2.plot(column='n_bike_stations_by_area',
figsize=(15, 6))
ax.set_axis_off()
In the spatial join operation above, we are not changing the geometries itself. We are not joining geometries, but joining attributes based on a spatial relationship between the geometries. This also means that the geometries need to at least overlap partially.
If you want to create new geometries based on joining (combining) geometries of different dataframes into one new dataframe (eg by taking the intersection of the geometries), you want an overlay operation.
africa = countries[countries['continent'] == 'Africa']
africa.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f9e890c1f98>
cities['geometry'] = cities.buffer(2)
geopandas.overlay(africa, cities, how='difference').plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f9e80dbfb00>