# Spatial joins¶

Goals of this notebook:

• Based on the countries and cities dataframes, determine for each city the country in which it is located.
• To solve this problem, we will use the the concept of a 'spatial join' operation: combining information of geospatial datasets based on their spatial relationship.
In [ ]:
%matplotlib inline

import pandas as pd
import geopandas

In [ ]:
countries = geopandas.read_file("zip://./data/ne_110m_admin_0_countries.zip")


## Recap - joining dataframes¶

Pandas provides functionality to join or merge dataframes in different ways, see https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/ for an overview and https://pandas.pydata.org/pandas-docs/stable/merging.html for the full documentation.

To illustrate the concept of joining the information of two dataframes with pandas, let's take a small subset of our cities and countries datasets:

In [ ]:
cities2 = cities[cities['name'].isin(['Bern', 'Brussels', 'London', 'Paris'])].copy()
cities2['iso_a3'] = ['CHE', 'BEL', 'GBR', 'FRA']

In [ ]:
cities2

In [ ]:
countries2 = countries[['iso_a3', 'name', 'continent']]


We added a 'iso_a3' column to the cities dataset, indicating a code of the country of the city. This country code is also present in the countries dataset, which allows us to merge those two dataframes based on the common column.

Joining the cities dataframe with countries will transfer extra information about the countries (the full name, the continent) to the cities dataframe, based on a common key:

In [ ]:
cities2.merge(countries2, on='iso_a3')


But, for this illustrative example, we added the common column manually, it is not present in the original dataset. However, we can still know how to join those two datasets based on their spatial coordinates.

## Recap - spatial relationships between objects¶

In the previous notebook 02-spatial-relationships.ipynb, we have seen the notion of spatial relationships between geometry objects: within, contains, intersects, ...

In this case, we know that each of the cities is located within one of the countries, or the other way around that each country can contain multiple cities.

We can test such relationships using the methods we have seen in the previous notebook:

In [ ]:
france = countries.loc[countries['name'] == 'France', 'geometry'].squeeze()

In [ ]:
cities.within(france)


The above gives us a boolean series, indicating for each point in our cities dataframe whether it is located within the area of France or not.
Because this is a boolean series as result, we can use it to filter the original dataframe to only show those cities that are actually within France:

In [ ]:
cities[cities.within(france)]


We could now repeat the above analysis for each of the countries, and add a column to the cities dataframe indicating this country. However, that would be tedious to do manually, and is also exactly what the spatial join operation provides us.

(note: the above result is incorrect, but this is just because of the coarse-ness of the countries dataset)

## Spatial join operation¶

**SPATIAL JOIN** = *transferring attributes from one layer to another based on their spatial relationship*
Different parts of this operations: * The GeoDataFrame to which we want add information * The GeoDataFrame that contains the information we want to add * The spatial relationship we want to use to match both datasets ('intersects', 'contains', 'within') * The type of join: left or inner join ![](img/illustration-spatial-join.svg)

In this case, we want to join the cities dataframe with the information of the countries dataframe, based on the spatial relationship between both datasets.

We use the geopandas.sjoin function:

In [ ]:
joined = geopandas.sjoin(cities, countries, op='within', how='left')

In [ ]:
joined

In [ ]:
joined['continent'].value_counts()


## Lets's practice!¶

We will again use the Paris datasets to do some exercises. Let's start importing them again:

In [ ]:
districts = geopandas.read_file("data/paris_districts.geojson").to_crs(epsg=2154)

**EXERCISE:** * Determine for each bike station in which district it is located (using a spatial join!). Call the result joined.
In [ ]:
# %load _solved/solutions/04-spatial-joins1.py

In [ ]:
# %load _solved/solutions/04-spatial-joins2.py

**EXERCISE: Map of tree density by district (I)** Using a dataset of all trees in public spaces in Paris, the goal is to make a map of the tree density by district. For this, we first need to find out how many trees each district contains, which we will do in this exercise. In the following exercise, we will then use this result to calculate the density and create a map. To obtain the tree count by district, we first need to know in which district each tree is located, which we can do with a spatial join. Then, using the result of the spatial join, we will calculate the number of trees located in each district using the pandas 'group-by' functionality. - Import the trees dataset "paris_trees.gpkg" and call the result trees. Also read the districts dataset we have seen previously ("paris_districts.geojson"), and call this districts. Convert the districts dataset to the same CRS as the trees dataset. - Add a column with the 'district_name' to the trees dataset using a spatial join. Call the result joined.
Hints - Remember, we can perform a spatial join with the geopandas.sjoin() function. - geopandas.sjoin() takes as first argument the dataframe to which we want to add information, and as second argument the dataframe that contains this additional information. - The op argument is used to specify which spatial relationship between both dataframes we want to use for joining (options are 'intersects', 'contains', 'within').
In [ ]:
# %load _solved/solutions/04-spatial-joins3.py

In [ ]:
# %load _solved/solutions/04-spatial-joins4.py

In [ ]:
# %load _solved/solutions/04-spatial-joins5.py

**EXERCISE: Map of tree density by district (II)** - Calculate the number of trees located in each district: group the joined DataFrame by the 'district_name' column, and calculate the size of each group. We convert the resulting Series trees_by_district to a DataFrame for the next exercise.
Hints - The general group-by syntax in pandas is: df.groupby('key').aggregation_method(), substituting 'key' and 'aggregation_method' with the appropriate column name and method. - To know the size of groups, we can use the .size() method.
In [ ]:
# %load _solved/solutions/04-spatial-joins6.py

In [ ]:
# %load _solved/solutions/04-spatial-joins7.py

In [ ]:
# %load _solved/solutions/04-spatial-joins8.py

**EXERCISE: Map of tree density by district (III)** Now we have obtained the number of trees by district, we can make the map of the districts colored by the tree density. For this, we first need to merge the number of trees in each district we calculated in the previous step (trees_by_district) back to the districts dataset. We will use the [pd.merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) function to join two dataframes based on a common column. Since not all districts have the same size, it is a fairer comparison to visualize the tree density: the number of trees relative to the area. - Use the pd.merge() function to merge districts and trees_by_district dataframes on the 'district_name' column. Call the result districts_trees. - Add a column 'n_trees_per_area' to the districts_trees dataframe, based on the 'n_trees' column divided by the area. - Make a plot of the districts_trees dataframe, using the 'n_trees_per_area' column to determine the color of the polygons.
Hints - The pandas pd.merge() function takes the two dataframes you want to merge as the first two arguments. - The column name on which you want to merge both datasets can be specified with the on keyword. - Accessing a column of a DataFrame can be done with df['col'], while adding a column to a DataFrame can be done with df['new_col'] = values where values can be the result of a computation. - Remember, the area of each geometry in a GeoSeries or GeoDataFrame can be retrieved using the area attribute. So considering a GeoDataFrame gdf, then gdf.geometry.area will return a Series with the area of each geometry. - We can use the .plot() method of a GeoDataFrame to make a visualization of the geometries. - For using one of the columns of the GeoDataFrame to determine the fill color, use the column= keyword.
In [ ]:
# %load _solved/solutions/04-spatial-joins9.py

In [ ]:
# %load _solved/solutions/04-spatial-joins10.py

In [ ]:
# %load _solved/solutions/04-spatial-joins11.py


## The overlay operation¶

In the spatial join operation above, we are not changing the geometries itself. We are not joining geometries, but joining attributes based on a spatial relationship between the geometries. This also means that the geometries need to at least overlap partially.

If you want to create new geometries based on joining (combining) geometries of different dataframes into one new dataframe (eg by taking the intersection of the geometries), you want an overlay operation.

In [ ]:
africa = countries[countries['continent'] == 'Africa']

In [ ]:
africa.plot()

In [ ]:
cities['geometry'] = cities.buffer(2)

In [ ]:
geopandas.overlay(africa, cities, how='difference').plot()

REMEMBER
* **Spatial join**: transfer attributes from one dataframe to another based on the spatial relationship * **Spatial overlay**: construct new geometries based on spatial operation between both dataframes (and combining attributes of both dataframes)
**EXERCISE: Exploring a Land Use dataset** For the following exercises, we first introduce a new dataset: a dataset about the land use of Paris (a simplified version based on the open European [Urban Atlas](https://land.copernicus.eu/local/urban-atlas)). The land use indicates for what kind of activity a certain area is used, such as residential area or for recreation. It is a polygon dataset, with a label representing the land use class for different areas in Paris. In this exercise, we will read the data, explore it visually, and calculate the total area of the different classes of land use in the area of Paris. * Read in the 'paris_land_use.shp' file and assign the result to a variable land_use. * Make a plot of land_use, using the 'class' column to color the polygons. We also add a legend. Note: it might take a few seconds for the plot to generate because there are a lot of polygons. * Add a new column 'area' with the area of each polygon. * Calculate the total area in km² for each 'class' using the groupby() method, and print the result.
Hints * Reading a file can be done with the geopandas.read_file() function. * To use a column to color the geometries, use the column keyword to indicate the column name. * The area of each geometry can be accessed with the area attribute of the geometry of the GeoDataFrame. * The groupby() method takes the column name on which you want to group as the first argument.
In [ ]:
# %load _solved/solutions/04-spatial-joins12.py

In [ ]:
# %load _solved/solutions/04-spatial-joins13.py

In [ ]:
# %load _solved/solutions/04-spatial-joins14.py

In [ ]:
# %load _solved/solutions/04-spatial-joins15.py

**EXERCISE: Intersection of two polygons** For this exercise, we are going to use 2 individual polygons: the district of Muette extracted from the districts dataset, and the green urban area of Boulogne, a large public park in the west of Paris, extracted from the land_use dataset. The two polygons have already been assigned to the muette and park_boulogne variables. We first visualize the two polygons. You will see that they overlap, but the park is not fully located in the district of Muette. Let's determine the overlapping part. * Plot the two polygons in a single map to examine visually the degree of overlap * Calculate the intersection of the park_boulogne and muette polygons. * Print the proportion of the area of the district that is occupied by the park.
Hints * The intersection of to scalar polygons can be calculated with the intersection() method of one of the polygons, and passing the other polygon as the argument to that method.
In [ ]:
land_use = geopandas.read_file("zip://./data/paris_land_use.zip")

In [ ]:
# extract polygons
land_use['area'] = land_use.geometry.area
park_boulogne = land_use[land_use['class'] == "Green urban areas"].sort_values('area').geometry.iloc[-1]
muette = districts[districts.district_name == 'Muette'].geometry.squeeze()

In [ ]:
# Plot the two polygons
geopandas.GeoSeries([park_boulogne, muette]).plot(alpha=0.5, color=['green', 'blue'])

In [ ]:
# %load _solved/solutions/04-spatial-joins16.py

In [ ]:
# %load _solved/solutions/04-spatial-joins17.py

In [ ]:
# %load _solved/solutions/04-spatial-joins18.py

**EXERCISE: Intersecting a GeoDataFrame with a Polygon** Combining the land use dataset and the districts dataset, we can now investigate what the land use is in a certain district. For that, we first need to determine the intersection of the land use dataset with a given district. Let's take again the *Muette* district as example case. * Calculate the intersection of the land_use polygons with the single muette polygon. Call the result land_use_muette. * Make a quick plot of this intersection, and pass edgecolor='black' to more clearly see the boundaries of the different polygons. * Print the first five rows of land_use_muette.
Hints * The intersection of each geometry of a GeoSeries with another single geometry can be performed with the intersection() method of a GeoSeries. * The intersection() method takes as argument the geometry for which to calculate the intersection.
In [ ]:
land_use = geopandas.read_file("zip://./data/paris_land_use.zip")
muette = districts[districts.district_name == 'Muette'].geometry.squeeze()

In [ ]:
# %load _solved/solutions/04-spatial-joins19.py

In [ ]:
# %load _solved/solutions/04-spatial-joins20.py

In [ ]:
# Print the first five rows of the intersection


You can see in the plot that we now only have a subset of the full land use dataset. The land_use_muette still has the same number of rows as the original land_use, though. But many of the rows, as you could see by printing the first rows, consist now of empty polygons when it did not intersect with the Muette district.

In [ ]:
land_use_muette = land_use.copy()
land_use_muette['geometry'] = land_use.geometry.intersection(muette)
land_use_muette = land_use_muette[~land_use_muette.is_empty]

In [ ]:
land_use_muette.dissolve(by='class')

In [ ]:
land_use_muette.dissolve(by='class').reset_index().plot(column='class')

**EXERCISE: Overlaying spatial datasets** We will now combine both datasets in an overlay operation. Create a new GeoDataFrame consisting of the intersection of the land use polygons wich each of the districts, but make sure to bring the attribute data from both source layers. Once we created the overlay of the land use and districts datasets, we can more easily inspect the land use for the different districts. Let's get back to the example district of Muette, and inspect the land use of that district. * Create a new GeoDataFrame from the intersections of land_use and districts. Assign the result to a variable combined. * Print the first rows the resulting GeoDataFrame (combined). * Add a new column 'area' with the area of each polygon to the combined GeoDataFrame. * Create a subset called land_use_muette where the 'district_name' is equal to "Muette". * Make a plot of land_use_muette, using the 'class' column to color the polygons. * Calculate the total area for each 'class' of land_use_muette using the groupby() method, and print the result.
Hints * The intersection of two GeoDataFrames can be calculated with the geopandas.overlay() function. * The overlay() functions takes first the two GeoDataFrames to combine, and a third how keyword indicating how to combine the two layers. * For making an overlay based on the intersection, you can pass how='intersection'. * The area of each geometry can be accessed with the area attribute of the geometry of the GeoDataFrame. * To use a column to color the geometries, pass its name to the column keyword. * The groupby() method takes the column name on which you want to group as the first argument. * The total area for each class can be calculated by taking the sum() of the area.
In [ ]:
land_use = geopandas.read_file("zip://./data/paris_land_use.zip")

In [ ]:
# %load _solved/solutions/04-spatial-joins21.py

In [ ]:
# %load _solved/solutions/04-spatial-joins22.py

In [ ]:
# %load _solved/solutions/04-spatial-joins23.py

In [ ]:
# %load _solved/solutions/04-spatial-joins24.py

In [ ]:
# %load _solved/solutions/04-spatial-joins25.py

In [ ]:
# %load _solved/solutions/04-spatial-joins26.py

In [ ]: