tobler
¶This notebook guides you through to one of the main functionalities in tobler
which is the harmonize
function. This functions allow users to pass a set of different set of polygons, such as a different set of US census tracts from different years that were modified. With harmonize
you can choose different methods for the harmonization and perform the estimates to multiple variables (either extensive or intensive) at once.
Note: to run appropriately this notebook, we recommend the user to have the most uptodated versions of the package. For example, you need to have numpy, scipy and statsmodels with their newest versions and it is likely that you'd have to install statsmodels from its source running in command line (pip install git+http://github.com/statsmodels/statsmodels.git@master
)
from tobler.area_weighted import *
from tobler import harmonize
import os
import time
import numpy as np
import pandas as pd
import geopandas as gpd
from quilt.data.spatialucr import census
from shapely.wkb import loads
import warnings
warnings.filterwarnings("ignore")
import rasterio
from rasterio.plot import show
import matplotlib.pyplot as plt
%matplotlib inline
This example uses all census data that the user must provide your own copy of the external database. A step-by-step procedure for downloading the data can be found here: https://github.com/spatialucr/geosnap/tree/master/geosnap/data. After the user download the zip files, you must provide the path to these files.
import geosnap
from geosnap.data.data import read_ltdb
#os.chdir('path_to_zipfiles')
os.chdir('C:\\Users\\renan\\Desktop\\osnap_data')
sample = "LTDB_Std_All_Sample.zip"
full = "LTDB_Std_All_fullcount.zip"
read_ltdb(sample = sample, fullcount = full)
df = geosnap.data.db.ltdb
The Philadelphia County is the code '42101' (State Code + County Code): https://en.wikipedia.org/wiki/List_of_counties_in_Pennsylvania
n_total_pop
(total population), n_white_under_15
(total population of white people under 15 years old) and p_white_under_15
(percentage of white people under 15 years old of a specific region) are variables that are present in all census data to be harmonized. Therefore, a good set of variables to be used as as example for multiple harmonization of both extensive and intensive types.
df['geoid'] = df.index
df['state'] = df['geoid'].str[0:2]
variables = ['geoid', 'n_total_pop', 'n_white_under_15', 'p_white_under_15']
df_phili_2010 = df[(df['year'] == 2010) & (df.geoid.str[0:5] == '42101')][variables]
We need to get the raw census data of 2000 since the geoid of the data and the polygons to merge.
df_2000 = geosnap.data.db.census_00[variables]
df_2000['geoid'] = [str(item).zfill(11) for item in list(df_2000.geoid.astype(str))]
df_phili_2000 = df_2000[df_2000.geoid.str[0:5] == '42101']
Getting the raw census tracts polygons for all years of the analysis (this may take some minutes). Starting with 1990:
aux = census.tracts_1990()
aux['geometry'] = aux.wkb.apply(lambda x: loads(x, hex=True))
aux = aux.drop(['wkb', 'index'], axis = 1)
gdf_1990 = gpd.GeoDataFrame(aux)
gdf_1990['year'] = '1990'
gdf_1990.head()
geoid | geometry | year | |
---|---|---|---|
0 | 01023009567 | POLYGON ((-88.414064 32.179518, -88.414226 32.... | 1990 |
1 | 01023009568 | POLYGON ((-88.32355099999999 32.02982, -88.324... | 1990 |
2 | 01023009569 | POLYGON ((-88.457246 31.81052, -88.458158 31.8... | 1990 |
3 | 01023009570 | POLYGON ((-88.188908 31.698461, -88.194 31.698... | 1990 |
0 | 01037009610 | POLYGON ((-86.09742900000001 32.912974, -86.09... | 1990 |
... and now, 2000:
aux = census.tracts_2000()
aux['geometry'] = aux.wkb.apply(lambda x: loads(x, hex=True))
aux = aux.drop(['wkb'], axis = 1)
gdf_2000 = gpd.GeoDataFrame(aux)
gdf_2000['year'] = '2000'
gdf_2000.head()
geoid | geometry | year | |
---|---|---|---|
0 | 01071950100 | POLYGON ((-85.659972 34.755957, -85.660201 34.... | 2000 |
1 | 01071950200 | POLYGON ((-85.771604 34.917734, -85.772938 34.... | 2000 |
2 | 01071950300 | POLYGON ((-86.02624400000001 34.783033, -86.02... | 2000 |
3 | 01071950400 | POLYGON ((-86.32297199999999 34.741224, -86.32... | 2000 |
4 | 01071950500 | POLYGON ((-86.171466 34.621673, -86.1737300000... | 2000 |
... lastly, 2010:
aux = census.tracts_2010()
aux['geometry'] = aux.wkb.apply(lambda x: loads(x, hex=True))
aux = aux.drop(['wkb'], axis = 1)
gdf_2010 = gpd.GeoDataFrame(aux)
gdf_2010['geoid'] = gdf_2010['GEOID']
gdf_2010['year'] = '2010'
gdf_2010.head()
GEOID | geometry | geoid | year | |
---|---|---|---|---|
0 | 47157007300 | POLYGON ((-89.94575399999999 35.115323, -89.94... | 47157007300 | 2010 |
1 | 47157021320 | POLYGON ((-89.870952 35.089518, -89.8707709999... | 47157021320 | 2010 |
2 | 47157020222 | POLYGON ((-89.92942699999999 35.262985, -89.92... | 47157020222 | 2010 |
3 | 47157022410 | POLYGON ((-90.127607 34.994684, -90.127318 34.... | 47157022410 | 2010 |
4 | 47157005300 | POLYGON ((-90.09169199999999 35.084691, -90.09... | 47157005300 | 2010 |
Merging the data with the original tract polygons of 2000 and 2010:
full_phili_2000 = gdf_2000.merge(df_phili_2000, on = 'geoid')
full_phili_2010 = gdf_2010.merge(df_phili_2010, on = 'geoid')
phili_1990 = gdf_1990[gdf_1990.geoid.str[0:5] == '42101']
ax = phili_1990.plot(figsize = (10,10), alpha = 0.5, edgecolor = 'black')
ax.set_title("Philadelphia County in 1990 (target polygons)", fontsize = 20)
plt.axis('off')
(-75.29683499999999, -74.939665, 39.85221, 40.15119)
ax = full_phili_2000.plot(column = 'n_total_pop', figsize = (10,10), alpha = 0.5, edgecolor = 'black')
ax.set_title("Total Population of Philadelphia County in 2000 (1º source polygons)", fontsize = 20)
plt.axis('off')
(-75.29646379999998, -74.9392762, 39.8523232, 40.1512988)
ax = full_phili_2010.plot(column = 'n_total_pop', figsize = (10,10), alpha = 0.5, edgecolor = 'black')
ax.set_title("Total Population of Philadelphia County in 2010 (2º source polygons)", fontsize = 20)
plt.axis('off')
(-75.29649115, -74.93953785, 39.85345459999999, 40.1515414)
Although these maps look similar to each other, they actually do not represent the same set of polygons. There are some differences if we look closer. This is confirmed as the number of tracts of each year is different:
print('N. of tracts in 1990 (target): {} \nN. of tracts in 2000 (source): {} \nN. of tracts in 2010 (source): {}'.format(len(phili_1990), len(full_phili_2000), len(full_phili_2010)))
N. of tracts in 1990 (target): 367 N. of tracts in 2000 (source): 381 N. of tracts in 2010 (source): 384
Since we'd like to harmonize different set of tracts, we need to make sure that all input are in the same Coordinate Reference System (CRS):
phili_1990.crs = full_phili_2000.crs = full_phili_2010.crs = {'init': 'epsg:4326'}
The input of the harmonize
function is a list of GeoDataFrame where it is present both target dataset and all source datasets. All the variables that are going to harmonized should have the same name among all source datasets. We call this set a community.
raw_community = [phili_1990, full_phili_2000, full_phili_2010]
from tobler import harmonize
help(harmonize)
Help on function harmonize in module tobler.harmonize: harmonize(raw_community, target_year_of_reference, weights_method='area', extensive_variables=[], intensive_variables=[], allocate_total=True, raster_path=None, codes=[21, 22, 23, 24], force_crs_match=True) Harmonize Multiples GeoData Sources with different approaches Parameters ---------- raw_community : list Multiple GeoDataFrames given by a list (see (1) in Notes). target_year_of_reference : string The target year that represents the bondaries of all datasets generated in the harmonization. Could be, for example '2010'. weights_method : string The method that the harmonization will be conducted. This can be set to: "area" : harmonization according to area weights. "land_type_area" : harmonization according to the Land Types considered 'populated' areas. "land_type_Poisson_regression" : NOT YET INTRODUCED. "land_type_Gaussian_regression" : NOT YET INTRODUCED. extensive_variables : list The names of variables in each dataset of raw_community that contains extensive variables to be harmonized (see (2) in Notes). intensive_variables : list The names of variables in each dataset of raw_community that contains intensive variables to be harmonized (see (2) in Notes). allocate_total : boolean True if total value of source area should be allocated. False if denominator is area of i. Note that the two cases would be identical when the area of the source polygon is exhausted by intersections. See (3) in Notes for more details. raster_path : the path to the associated raster image that has the types of each pixel in the spatial context. Only taken into consideration for harmonization raster based. codes : an integer list of codes values that should be considered as 'populated'. Since this draw inspiration using the National Land Cover Database (NLCD), the default is 21 (Developed, Open Space), 22 (Developed, Low Intensity), 23 (Developed, Medium Intensity) and 24 (Developed, High Intensity). The description of each code can be found here: https://www.mrlc.gov/sites/default/files/metadata/landcover.html Only taken into consideration for harmonization raster based. force_crs_match : bool. Default is True. Wheter the Coordinate Reference System (CRS) of the polygon will be reprojected to the CRS of the raster file. It is recommended to let this argument as True. Only taken into consideration for harmonization raster based. Notes ----- 1) Each GeoDataFrame of raw_community is assumed to have a 'year' column. Also, all GeoDataFrames must have the same Coordinate Reference System (CRS). 2) A quick explanation of extensive and intensive variables can be found here: http://ibis.geog.ubc.ca/courses/geob370/notes/intensive_extensive.htm. 3) For an extensive variable, the estimate at target polygon j (default case) is: v_j = \sum_i v_i w_{i,j} w_{i,j} = a_{i,j} / \sum_k a_{i,k} If the area of the source polygon is not exhausted by intersections with target polygons and there is reason to not allocate the complete value of an extensive attribute, then setting allocate_total=False will use the following weights: v_j = \sum_i v_i w_{i,j} w_{i,j} = a_{i,j} / a_i where a_i is the total area of source polygon i. For an intensive variable, the estimate at target polygon j is: v_j = \sum_i v_i w_{i,j} w_{i,j} = a_{i,j} / \sum_k a_{k,j}
t0 = time.time()
harmonized_area = harmonize(raw_community,
target_year_of_reference = '1990',
extensive_variables = ['n_total_pop', 'n_white_under_15'],
intensive_variables = ['p_white_under_15'],
weights_method = 'area')
t1 = time.time()
print('Time taken (in seconds): {}'.format(t1 - t0))
Starting to Harmonize the year of 2000... nan values in variable: p_white_under_15, replacing with 0.0 Harmonizing extensive variable n_total_pop of the year 2000. Harmonizing extensive variable n_white_under_15 of the year 2000. Harmonizing intensive variable p_white_under_15 of the year 2000. Starting to Harmonize the year of 2010... nan values in variable: p_white_under_15, replacing with 0.0 Harmonizing extensive variable n_total_pop of the year 2010. Harmonizing extensive variable n_white_under_15 of the year 2010. Harmonizing intensive variable p_white_under_15 of the year 2010. Time taken (in seconds): 2.0715413093566895
The output of the function is a single GeoDataFrame with all interpolated variables of all source years with the same original name but with the addition of interpolated_
prefixes.
harmonized_area.head()
geoid | geometry | year | interpolated_n_total_pop | interpolated_n_white_under_15 | interpolated_p_white_under_15 | |
---|---|---|---|---|---|---|
0 | 42101000001 | POLYGON ((-75.1478 39.9486, -75.14919999999999... | 2000 | 2692.555587 | 37.926295 | 1.316683 |
1 | 42101000002 | POLYGON ((-75.15900000000001 39.9536, -75.1594... | 2000 | 1364.565915 | 6.785932 | 0.460593 |
2 | 42101000003 | POLYGON ((-75.17749999999999 39.9551, -75.1776... | 2000 | 2656.200162 | 97.529551 | 3.765605 |
3 | 42101000004 | POLYGON ((-75.1782 39.9523, -75.17919999999999... | 2000 | 4289.591198 | 40.594815 | 0.938504 |
4 | 42101000005 | POLYGON ((-75.15170000000001 39.9517, -75.152 ... | 2000 | 1295.854409 | 2.704585 | 0.134690 |
This method is similar to the area method previously. However, instead of using the actual polygon area, it uses only the area of the polygon that are considered "inhabited" according to the land type aiming to improve the estimates. In this method the proxy for the polygon area is actually the number of considered "populated pixels".
The first step to use these functionalities is to read a file from NLCD. We chose to use the data from Pennsylvania state like in example of the GeoPySpark homepage and then download the file with these commands:
curl -o /tmp/NLCD2011_LC_Pennsylvania.zip "https://s3-us-west-2.amazonaws.com/prd-tnm/StagedProducts/NLCD/data/2011/landcover/states/NLCD2011_LC_Pennsylvania.zip?ORIG=513_SBDDG"
unzip -d /tmp /tmp/NLCD2011_LC_Pennsylvania.zip
Then in the same directory as this notebook do
ln -s /tmp/NLCD2011_LC_Pennsylvania.tif
Now, you need to specify the file path of the unzipped file that would look like something like this:
raster_filepath = '~/NLCD2011_LC_Pennsylvania.tif'
raster_filepath = 'C:\\Users\\renan\\Desktop\\harmonization_osnap\\NLCD2011_LC_Pennsylvania\\NLCD2011_LC_Pennsylvania.tif'
Just for the sake of visualization, we read the raster file:
penn_raster = rasterio.open(raster_filepath)
show(penn_raster, cmap='terrain')
<matplotlib.axes._subplots.AxesSubplot at 0x1f6c7099240>
Now, we can use the land_type_area
option of the weights_method
. We also need to pass the raster file with the raster
argument.
Other options are available for the code of the pixels types. Since this function dram inspiration from the NLCD data, the default codes are 21 (Developed, Open Space), 22 (Developed, Low Intensity), 23 (Developed, Medium Intensity) and 24 (Developed, High Intensity).
t0 = time.time()
harmonized_nlcd = harmonize(raw_community,
target_year_of_reference = '1990',
extensive_variables = ['n_total_pop', 'n_white_under_15'],
intensive_variables = ['p_white_under_15'],
weights_method = 'land_type_area',
raster_path = raster_filepath)
t1 = time.time()
print('Time taken (in seconds): {}'.format(t1 - t0))
Starting to Harmonize the year of 2000... Harmonizing extensive variable n_total_pop of the year 2000. Harmonizing extensive variable n_white_under_15 of the year 2000. Harmonizing intensive variable p_white_under_15 of the year 2000. Starting to Harmonize the year of 2010... Harmonizing extensive variable n_total_pop of the year 2010. Harmonizing extensive variable n_white_under_15 of the year 2010. Harmonizing intensive variable p_white_under_15 of the year 2010. Time taken (in seconds): 16.739511728286743
Once again, the output of the function is a single GeoDataFrame with all interpolated variables of all source years with the same original name but with the addition of interpolated_
prefixes.
harmonized_nlcd.head()
geoid | geometry | year | interpolated_n_total_pop | interpolated_n_white_under_15 | interpolated_p_white_under_15 | |
---|---|---|---|---|---|---|
0 | 42101000001 | POLYGON ((-75.1478 39.9486, -75.14919999999999... | 2000 | 2666.167205 | 37.135867 | 1.307928 |
1 | 42101000002 | POLYGON ((-75.15900000000001 39.9536, -75.1594... | 2000 | 1391.014345 | 6.948442 | 0.461179 |
2 | 42101000003 | POLYGON ((-75.17749999999999 39.9551, -75.1776... | 2000 | 2694.413349 | 98.151988 | 3.749538 |
3 | 42101000004 | POLYGON ((-75.1782 39.9523, -75.17919999999999... | 2000 | 4295.455408 | 40.858452 | 0.943535 |
4 | 42101000005 | POLYGON ((-75.15170000000001 39.9517, -75.152 ... | 2000 | 1303.176395 | 2.765546 | 0.136141 |
As we can see in the graphs below, both methods produce similar results.
plt.scatter(harmonized_area.interpolated_n_total_pop, harmonized_nlcd.interpolated_n_total_pop)
plt.title('Interpolated Total Population', fontsize = 20)
plt.xlabel('Area', fontsize = 18)
plt.ylabel('NLCD (area)', fontsize = 18)
Text(0, 0.5, 'NLCD (area)')
plt.scatter(harmonized_area.interpolated_n_white_under_15, harmonized_nlcd.interpolated_n_white_under_15)
plt.title('Interpolated White Under 15 years Population', fontsize = 20)
plt.xlabel('Area', fontsize = 18)
plt.ylabel('NLCD (area)', fontsize = 18)
Text(0, 0.5, 'NLCD (area)')
plt.scatter(harmonized_area.interpolated_p_white_under_15, harmonized_nlcd.interpolated_p_white_under_15)
plt.title('Interpolated Per. of White Under 15 years', fontsize = 20)
plt.xlabel('Area', fontsize = 18)
plt.ylabel('NLCD (area)', fontsize = 18)
Text(0, 0.5, 'NLCD (area)')
harmonized_area[harmonized_area.year == '2000'].interpolated_n_total_pop.sum()
1517550.0041874615
harmonized_nlcd[harmonized_nlcd.year == '2000'].interpolated_n_total_pop.sum()
1517550.0
full_phili_2000.n_total_pop.sum()
1517550
harmonized_area[harmonized_area.year == '2000'].interpolated_n_white_under_15.sum()
95054.00082107751
harmonized_nlcd[harmonized_nlcd.year == '2000'].interpolated_n_white_under_15.sum()
95054.0
full_phili_2000.n_white_under_15.sum()
95054
harmonized_area[harmonized_area.year == '2010'].interpolated_n_total_pop.sum()
1525811.0035805448
harmonized_nlcd[harmonized_nlcd.year == '2010'].interpolated_n_total_pop.sum()
1525811.0
full_phili_2010.n_total_pop.sum()
1525811.0
harmonized_area[harmonized_area.year == '2010'].interpolated_n_white_under_15.sum()
64390.00004099
harmonized_nlcd[harmonized_nlcd.year == '2010'].interpolated_n_white_under_15.sum()
64389.999999999985
full_phili_2010.n_white_under_15.sum()
64390.0
full_phili_2010.total_bounds
array([-75.280266, 39.867004, -74.955763, 40.137992])