There are many sources of GIS data. Here are some useful links:
See also Wikipedia links
Let's import the packages we will use and set the paths for outputs.
# Let's import pandas and some other basic packages we will use
from __future__ import division
%pylab --no-import-all
%matplotlib inline
import pandas as pd
import numpy as np
import os, sys
Using matplotlib backend: MacOSX Populating the interactive namespace from numpy and matplotlib
# GIS packages
import geopandas as gpd
from geopandas.tools import overlay
from shapely.geometry import Polygon, Point
import georasters as gr
# Alias for Geopandas
gp = gpd
# Plotting
import matplotlib as mpl
import seaborn as sns
# Setup seaborn
sns.set()
# Paths
pathout = './data/'
if not os.path.exists(pathout):
os.mkdir(pathout)
pathgraphs = './graphs/'
if not os.path.exists(pathgraphs):
os.mkdir(pathgraphs)
Let's download a shapefile with all the polygons for countries so we can visualize and analyze some of the data we have downloaded in other notebooks. Natural Earth provides lots of free data so let's use that one.
For shapefiles and other polygon type data geopandas
is the most useful package. geopandas
is to GIS what pandas
is to other data. Since gepandas
extends the functionality of pandas
to a GIS dataset, all the nice functions and properties of pandas
are also available in geopandas
. Of course, geopandas
includes functions and properties unique to GIS data.
Next we will use it to download the shapefile (which is contained in a zip archive). geopandas
extends pandas
for use with GIS data. We can use many functions and properties of the GeoDataFrame
to analyze our data.
import requests
import io
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
url = 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip'
r = requests.get(url, headers=headers)
countries = gp.read_file(io.BytesIO(r.content))
#countries = gpd.read_file('https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip')
Let's look inside this GeoDataFrame
countries
featurecla | scalerank | LABELRANK | SOVEREIGNT | SOV_A3 | ADM0_DIF | LEVEL | TYPE | ADMIN | ADM0_A3 | ... | FCLASS_TR | FCLASS_ID | FCLASS_PL | FCLASS_GR | FCLASS_IT | FCLASS_NL | FCLASS_SE | FCLASS_BD | FCLASS_UA | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Admin-0 country | 0 | 2 | Indonesia | IDN | 0 | 2 | Sovereign country | Indonesia | IDN | ... | None | None | None | None | None | None | None | None | None | MULTIPOLYGON (((117.70361 4.16341, 117.70361 4... |
1 | Admin-0 country | 0 | 3 | Malaysia | MYS | 0 | 2 | Sovereign country | Malaysia | MYS | ... | None | None | None | None | None | None | None | None | None | MULTIPOLYGON (((117.70361 4.16341, 117.69711 4... |
2 | Admin-0 country | 0 | 2 | Chile | CHL | 0 | 2 | Sovereign country | Chile | CHL | ... | None | None | None | None | None | None | None | None | None | MULTIPOLYGON (((-69.51009 -17.50659, -69.50611... |
3 | Admin-0 country | 0 | 3 | Bolivia | BOL | 0 | 2 | Sovereign country | Bolivia | BOL | ... | None | None | None | None | None | None | None | None | None | POLYGON ((-69.51009 -17.50659, -69.51009 -17.5... |
4 | Admin-0 country | 0 | 2 | Peru | PER | 0 | 2 | Sovereign country | Peru | PER | ... | None | None | None | None | None | None | None | None | None | MULTIPOLYGON (((-69.51009 -17.50659, -69.63832... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
253 | Admin-0 country | 0 | 4 | China | CH1 | 1 | 2 | Country | Macao S.A.R | MAC | ... | None | None | None | None | None | None | None | None | None | MULTIPOLYGON (((113.55860 22.16303, 113.56943 ... |
254 | Admin-0 country | 6 | 5 | Australia | AU1 | 1 | 2 | Dependency | Ashmore and Cartier Islands | ATC | ... | None | None | None | None | None | None | None | None | None | POLYGON ((123.59702 -12.42832, 123.59775 -12.4... |
255 | Admin-0 country | 6 | 8 | Bajo Nuevo Bank (Petrel Is.) | BJN | 0 | 2 | Indeterminate | Bajo Nuevo Bank (Petrel Is.) | BJN | ... | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | POLYGON ((-79.98929 15.79495, -79.98782 15.796... |
256 | Admin-0 country | 6 | 5 | Serranilla Bank | SER | 0 | 2 | Indeterminate | Serranilla Bank | SER | ... | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | Unrecognized | POLYGON ((-78.63707 15.86209, -78.64041 15.864... |
257 | Admin-0 country | 6 | 6 | Scarborough Reef | SCR | 0 | 2 | Indeterminate | Scarborough Reef | SCR | ... | None | None | None | None | None | None | None | None | None | POLYGON ((117.75389 15.15437, 117.75569 15.151... |
258 rows × 162 columns
Each row contains the information for one country.
Each column is one property or variable.
Unlike pandas
DataFrame
s, geopandas
always must have a geometry
column.
Let's plot this data
%matplotlib inline
fig, ax = plt.subplots(figsize=(15,10))
countries.plot(ax=ax)
ax.set_title("WGS84 (lat/lon)", fontdict={'fontsize':34})
Text(0.5, 1.0, 'WGS84 (lat/lon)')
We can also get some additional information on this data. For example its projection
countries.crs
<Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
We can reproject the data from its current WGS84 projection to other ones. Let's do this and plot the results so we can see how different projections distort results.
fig, ax = plt.subplots(figsize=(15,10))
countries_merc = countries.to_crs(epsg=3395)
countries_merc.plot(ax=ax)
ax.set_title("Mercator", fontdict={'fontsize':34})
Text(0.5, 1.0, 'Mercator')
cea = {'datum': 'WGS84',
'lat_ts': 0,
'lon_0': 0,
'no_defs': True,
'over': True,
'proj': 'cea',
'units': 'm',
'x_0': 0,
'y_0': 0}
fig, ax = plt.subplots(figsize=(15,10))
countries_cea = countries.to_crs(crs=cea)
countries_cea.plot(ax=ax)
ax.set_title("Cylindrical Equal Area", fontdict={'fontsize':34})
Text(0.5, 1.0, 'Cylindrical Equal Area')
Notice that each projection shows the world in a very different manner, distoring areas, distances etc. So you need to take care when doing computations to use the correct projection. An important issue to remember is that you need a projected (not geographical) projection to compute areas and distances. Let's compare these three a bit. Start with the boundaries of each.
print('[xmin, ymin, xmax, ymax] in three projections')
print(countries.total_bounds)
print(countries_merc.total_bounds)
print(countries_cea.total_bounds)
[xmin, ymin, xmax, ymax] in three projections [-180. -90. 180. 83.63410065] [-2.00375083e+07 -2.25002355e+08 2.00375083e+07 1.83863917e+07] [-20037508.34278923 -6363885.33192604 20037508.34278924 6324296.52646162]
Let's describe the areas of these countries in the three projections
print('Area distribution in WGS84')
print(countries.area.describe(), '\n')
Area distribution in WGS84 count 258.000000 mean 83.053683 std 443.786684 min 0.000001 25% 0.065859 50% 5.857276 75% 37.279026 max 6049.574693 dtype: float64
/var/folders/q1/7qsx8kmj439d81kr4f_k_wbr0000gp/T/ipykernel_88099/1371744286.py:2: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. print(countries.area.describe(), '\n')
print('Area distribution in Mercator')
print(countries_merc.area.describe(), '\n')
Area distribution in Mercator count 2.580000e+02 mean 3.422771e+13 std 5.295871e+14 min 2.196509e+04 25% 9.736132e+08 50% 8.654572e+10 75% 5.377488e+11 max 8.507019e+15 dtype: float64
print('Area distribution in CEA')
print(countries_cea.area.describe(), '\n')
Area distribution in CEA count 2.580000e+02 mean 5.690945e+11 std 1.826917e+12 min 1.220383e+04 25% 6.986659e+08 50% 5.148888e+10 75% 3.544773e+11 max 1.698019e+13 dtype: float64
Let's compare the area of each country in the two projected projections
countries_merc = countries_merc.set_index('ADM0_A3')
countries_cea = countries_cea.set_index('ADM0_A3')
countries_merc['ratio_area'] = countries_merc.area / countries_cea.area
countries_cea['ratio_area'] = countries_merc.area / countries_cea.area
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_context("talk")
fig, ax = plt.subplots()
sns.scatterplot(x=countries_cea.area/1e6, y=countries_merc.area/1e6, ax=ax)
sns.lineplot(x=countries_cea.area/1e6, y=countries_cea.area/1e6, color='r', ax=ax)
ax.set_ylabel('Mercator')
ax.set_xlabel('CEA')
ax.set_title("Areas")
Text(0.5, 1.0, 'Areas')
Now, how do we know what is correct? Let's get some data from WDI to compare the areas of countries in these projections to what the correct area should be (notice that each country usually will use a local projection that ensures areas are correctly computed, so their data should be closer to the truth than any of our global ones).
Here we use some of what we learned before in this notebook.
from pandas_datareader import data, wb
wbcountries = wb.get_countries()
wbcountries['name'] = wbcountries.name.str.strip()
wdi = wb.download(indicator=['AG.LND.TOTL.K2'], country=wbcountries.iso2c.values, start=2017, end=2017)
wdi.columns = ['WDI_area']
wdi = wdi.reset_index()
wdi = wdi.merge(wbcountries[['iso3c', 'iso2c', 'name']], left_on='country', right_on='name')
countries_cea['CEA_area'] = countries_cea.area / 1e6
countries_merc['MERC_area'] = countries_merc.area / 1e6
areas = pd.merge(countries_cea['CEA_area'], countries_merc['MERC_area'], left_index=True, right_index=True)
/Users/ozak/anaconda3/envs/GeoPython39env/lib/python3.9/site-packages/pandas_datareader/wb.py:592: UserWarning: Non-standard ISO country codes: 1A, 1W, 4E, 6F, 6N, 6X, 7E, 8S, A4, A5, A9, B1, B2, B3, B4, B6, B7, B8, C4, C5, C6, C7, C8, C9, D2, D3, D4, D5, D6, D7, D8, D9, EU, F1, F6, JG, M1, M2, N6, OE, R6, S1, S2, S3, S4, T2, T3, T4, T5, T6, T7, V1, V2, V3, V4, XC, XD, XE, XF, XG, XH, XI, XJ, XK, XL, XM, XN, XO, XP, XQ, XT, XU, XY, Z4, Z7, ZB, ZF, ZG, ZH, ZI, ZJ, ZQ, ZT warnings.warn(
Let's merge the WDI data with what we have computed before.
wdi = wdi.merge(areas, left_on='iso3c', right_index=True)
wdi
country | year | WDI_area | iso3c | iso2c | name | CEA_area | MERC_area | |
---|---|---|---|---|---|---|---|---|
0 | Aruba | 2017 | 180.0 | ABW | AW | Aruba | 1.697662e+02 | 1.780777e+02 |
2 | Afghanistan | 2017 | 652860.0 | AFG | AF | Afghanistan | 6.421811e+05 | 9.306783e+05 |
4 | Angola | 2017 | 1246700.0 | AGO | AO | Angola | 1.244652e+06 | 1.307625e+06 |
5 | Albania | 2017 | 27400.0 | ALB | AL | Albania | 2.833579e+04 | 4.983393e+04 |
6 | Andorra | 2017 | 470.0 | AND | AD | Andorra | 4.522394e+02 | 8.305222e+02 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
260 | Samoa | 2017 | 2830.0 | WSM | WS | Samoa | 2.780425e+03 | 2.945930e+03 |
262 | Yemen, Rep. | 2017 | 527970.0 | YEM | YE | Yemen, Rep. | 4.530748e+05 | 4.899480e+05 |
263 | South Africa | 2017 | 1213090.0 | ZAF | ZA | South Africa | 1.219825e+06 | 1.597733e+06 |
264 | Zambia | 2017 | 743390.0 | ZMB | ZM | Zambia | 7.519143e+05 | 7.960516e+05 |
265 | Zimbabwe | 2017 | 386850.0 | ZWE | ZW | Zimbabwe | 3.893382e+05 | 4.355822e+05 |
213 rows × 8 columns
How correlated are these measures?
wdi.corr()
WDI_area | CEA_area | MERC_area | |
---|---|---|---|
WDI_area | 1.000000 | 0.997178 | 0.821555 |
CEA_area | 0.997178 | 1.000000 | 0.852409 |
MERC_area | 0.821555 | 0.852409 | 1.000000 |
Let's change the shape of the data so we can plot it using seaborn
.
wdi2 = wdi.melt(id_vars=['iso3c', 'iso2c', 'name', 'country', 'year', 'WDI_area'], value_vars=['CEA_area', 'MERC_area'])
wdi2
iso3c | iso2c | name | country | year | WDI_area | variable | value | |
---|---|---|---|---|---|---|---|---|
0 | ABW | AW | Aruba | Aruba | 2017 | 180.0 | CEA_area | 1.697662e+02 |
1 | AFG | AF | Afghanistan | Afghanistan | 2017 | 652860.0 | CEA_area | 6.421811e+05 |
2 | AGO | AO | Angola | Angola | 2017 | 1246700.0 | CEA_area | 1.244652e+06 |
3 | ALB | AL | Albania | Albania | 2017 | 27400.0 | CEA_area | 2.833579e+04 |
4 | AND | AD | Andorra | Andorra | 2017 | 470.0 | CEA_area | 4.522394e+02 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
421 | WSM | WS | Samoa | Samoa | 2017 | 2830.0 | MERC_area | 2.945930e+03 |
422 | YEM | YE | Yemen, Rep. | Yemen, Rep. | 2017 | 527970.0 | MERC_area | 4.899480e+05 |
423 | ZAF | ZA | South Africa | South Africa | 2017 | 1213090.0 | MERC_area | 1.597733e+06 |
424 | ZMB | ZM | Zambia | Zambia | 2017 | 743390.0 | MERC_area | 7.960516e+05 |
425 | ZWE | ZW | Zimbabwe | Zimbabwe | 2017 | 386850.0 | MERC_area | 4.355822e+05 |
426 rows × 8 columns
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_context("talk")
fig, ax = plt.subplots()
sns.scatterplot(x='WDI_area', y='value', data=wdi2, hue='variable', ax=ax)
#sns.scatterplot(x='WDI_area', y='MERC_area', data=wdi, ax=ax)
sns.lineplot(x='WDI_area', y='WDI_area', data=wdi, color='r', ax=ax)
ax.set_ylabel('Other')
ax.set_xlabel('WDI')
ax.set_title("Areas")
ax.legend()
<matplotlib.legend.Legend at 0x197119b20>
We could use other data to compare, e.g. data from the CIA Factbook.
cia_area = pd.read_csv('https://web.archive.org/web/20201116182145if_/https://www.cia.gov/LIBRARY/publications/the-world-factbook/rankorder/rawdata_2147.txt', sep='\t', header=None)
cia_area = pd.DataFrame(cia_area[0].str.strip().str.split('\s\s+').tolist(), columns=['id', 'Name', 'area'])
cia_area.area = cia_area.area.str.replace(',', '').astype(int)
cia_area
id | Name | area | |
---|---|---|---|
0 | 1 | Russia | 17098242 |
1 | 2 | Antarctica | 14000000 |
2 | 3 | Canada | 9984670 |
3 | 4 | United States | 9833517 |
4 | 5 | China | 9596960 |
... | ... | ... | ... |
249 | 250 | Spratly Islands | 5 |
250 | 251 | Ashmore and Cartier Islands | 5 |
251 | 252 | Coral Sea Islands | 3 |
252 | 253 | Monaco | 2 |
253 | 254 | Holy See (Vatican City) | 0 |
254 rows × 3 columns
print('CEA area for Russia', countries_cea.area.loc['RUS'] / 1e6)
print('MERC area for Russia', countries_merc.area.loc['RUS'] / 1e6)
print('WDI area for Russia', wdi.loc[wdi.iso3c=='RUS', 'WDI_area'])
print('CIA area for Russia', cia_area.loc[cia_area.Name=='Russia', 'area'])
CEA area for Russia 16980189.499200087 MERC area for Russia 82883546.83332941 WDI area for Russia 202 16376870.0 Name: WDI_area, dtype: float64 CIA area for Russia 0 17098242 Name: area, dtype: int64
Again very similar result. CEA
is closest to both WDI
and CIA
.
CIA
data with the wdi data. You need to get correct codes for the countries to allow for the merge or correct the names to ensure they are compatible.wdi2
and plot the association between these measuresLet's use the geoplot
package to plot data in a map. As usual we can do it in many ways, but geoplot
makes our life very easy. Let's import the various packages we will use.
import geoplot as gplt
import geoplot.crs as gcrs
import mapclassify as mc
import textwrap
Let's import some of the data we had downloaded before. Specifically, let's import the Penn World Tables data.
pwt = pd.read_stata(pathout + 'pwt91.dta')
pwt_xls = pd.read_excel(pathout + 'pwt91.xlsx')
pwt
countrycode | country | currency_unit | year | rgdpe | rgdpo | pop | emp | avh | hc | ... | csh_x | csh_m | csh_r | pl_c | pl_i | pl_g | pl_x | pl_m | pl_n | pl_k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ABW | Aruba | Aruban Guilder | 1950 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ABW | Aruba | Aruban Guilder | 1951 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | ABW | Aruba | Aruban Guilder | 1952 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | ABW | Aruba | Aruban Guilder | 1953 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | ABW | Aruba | Aruban Guilder | 1954 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12371 | ZWE | Zimbabwe | US Dollar | 2013 | 28086.937500 | 28329.810547 | 15.054506 | 7.914061 | NaN | 2.504635 | ... | 0.169638 | -0.426188 | 0.090225 | 0.577488 | 0.582022 | 0.448409 | 0.723247 | 0.632360 | 0.383488 | 0.704313 |
12372 | ZWE | Zimbabwe | US Dollar | 2014 | 29217.554688 | 29355.759766 | 15.411675 | 8.222112 | NaN | 2.550258 | ... | 0.141791 | -0.340442 | 0.051500 | 0.600760 | 0.557172 | 0.392895 | 0.724510 | 0.628352 | 0.349735 | 0.704991 |
12373 | ZWE | Zimbabwe | US Dollar | 2015 | 30091.923828 | 29150.750000 | 15.777451 | 8.530669 | NaN | 2.584653 | ... | 0.137558 | -0.354298 | -0.023353 | 0.622927 | 0.580814 | 0.343926 | 0.654940 | 0.564430 | 0.348472 | 0.713156 |
12374 | ZWE | Zimbabwe | US Dollar | 2016 | 30974.292969 | 29420.449219 | 16.150362 | 8.839398 | NaN | 2.616257 | ... | 0.141248 | -0.310446 | 0.003050 | 0.640176 | 0.599462 | 0.337853 | 0.657060 | 0.550084 | 0.346553 | 0.718671 |
12375 | ZWE | Zimbabwe | US Dollar | 2017 | 32693.474609 | 30940.816406 | 16.529903 |