In this tutorial, we will manipulate the data structure from and to several data formats.
The formats that support unstructured (non relational) data are:
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
}
The character encoding is UTF-8. The data types in JSON files may be numbers, string, boolean, array, obkect (collection of name-value pairs), null. More information on JSON from the EarthDataScience course.
The main formats that support pixelised raster data are:
GeoTIFF: metadata standard that allows for georeferencing information embedded in a TIFF (Tagged Image File Format) file. GeoTIFF is enhanced to be cloud optimized.
The formats that support tabular data are:
The data formats for big heterogeneous data (different data types):
import requests, zipfile , os, io
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import netCDF4 as nc
import numpy as np
import pandas as pd
# import pycrs
import rasterio
import h5py
import rasterio
import netCDF4 as nc
import wget
from folium.plugins import MarkerCluster
from rasterio.mask import mask
from rasterio.plot import show
Raster data is any pixelated (or gridded) data where each pixel is associated with a specific geographical location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use).
The python package rasterio
, with documentation here, and that can read formats such as GeoTIFF
and GeoJSON
.
See additional introductory materials from EarthDataScience, and tutorials from the GeoHackweek.
We will download topography files found on this page, but stored on a Dropbox folder.
The file name is HYP_50M_SR
and is a zipped file.
# Dowload the data using wget.
fname = 'HYP_50M_SR'
wget.download("https://www.dropbox.com/s/r75ecms0bvyqaca/"+str(fname)+"?dl=1") # note the last character as a string to request the file itself
The data file will be saved on the home directory, we want to move it into a data
folder:
os.replace(fname+".zip", './data/'+fname)
Unzip the file
os.makedirs("./data/"+fname+"/",exist_ok=True)
# wget.download(url,out="HYP_50M_SR") # this does not work on the hub
z = zipfile.ZipFile('./data/'+fname+".zip")
z.extractall("./data/"+fname+"/")
Now let's get the Digital Elevation Map. We open the unzipped file using the package rasterio
.
elevation = rasterio.open("./data/"+fname+"/"+fname+".tif")
print(elevation.variables.keys())
Let us have a look at the dimensions of the data:
elevation.height
elevation.width
elevation.indexes
Can you guess on how to call the data types of the file entry?
# type below
and at the boundaries of the dataset:
elevation.bounds
print(elevation.transform * (0, 0)) # North West corner
print(elevation.transform * (elevation.width, elevation.height)) # South East corner
Here is the projection used for the data:
elevation.crs
How to interpret the data: There are three layers for the three colors red, green, and blue:
print(elevation.colorinterp[0])
print(elevation.colorinterp[1])
print(elevation.colorinterp[2])
print(np.min(elevation.read(1)), np.max(elevation.read(1)))
print(np.min(elevation.read(2)), np.max(elevation.read(2)))
print(np.min(elevation.read(3)), np.max(elevation.read(3)))
Let us now plot the data:
image = elevation.read()
show(image)
GeoTIFF are not the only kind of files that we can read with geopandas. Let us look at an example of reading data from a geojson file (which is a special case of json file with geographical coordinates).
url = 'https://www.nps.gov/lib/npmap.js/4.0.0/examples/data/national-parks.geojson'
parks = gpd.read_file(url)
parks.head()
Let us plot the data.
folium is a nice Python package for visualization. The Geohackweek tutorial on Folium is also informative.
m = folium.Map(location=[40, -100], zoom_start=4)
folium.GeoJson(parks).add_to(m)
marker_cluster = MarkerCluster().add_to(m)
m
We are going to focus on the parks in Washington State:
parks_WA = parks.iloc[[94, 127, 187, 228, 286, 294, 295, 297, 299, 300, 302]].reset_index()
We create a list of locations to add popups to the map.
locations = []
for index in range(0, len(parks_WA)):
location = [parks_WA['geometry'][index].y, parks_WA['geometry'][index].x]
locations.append(location)
m = folium.Map(location=[47, -121], zoom_start=7)
marker_cluster = MarkerCluster().add_to(m)
for point in range(0, len(locations)):
folium.Marker(location = locations[point], popup=parks_WA['Name'].iloc[point]).add_to(marker_cluster)
m
Hierarchical data formats are designed to store large amount of data into a single file. They mimic a file system (e.g., tree-like data structure with nested directories) into a single file. There are two domiment hierarchical data formats (HDF5 and NETCDF4), and one emerging for cloud (Zarr). Hierarchical formats in general can store many data types (numeric vs string).
The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data. HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. The HDF5 format also allows for embedding of metadata making it self-describing. HDF5 files are self describing - this means that all elements (the file itself, groups and datasets) can have associated metadata that describes the information contained within the element.
HDF Structure Example:
The network Common Data Form, or netCDF, was created in the early 1990s, and set out to solve some of the challenges in working with N-dimensional arrays. Netcdf is a collection of self-describing, machine-independent binary data formats and software tools that facilitate the creation, access and sharing of scientific data stored in N-dimensional arrays, along with metadata describing the contents of each array. Netcdf was built by the climate science community at a time when regional climate models were beginning to produce larger and larger output files. NetCDF version 4 is now a subset of HDF5.
The NetCDF & H5 format have no limit on file sizes. However, any analysis tools that read data from a NetCDF array into memory for some computational operation will be limited by that particular machine's available memory.
When reading a hierarchical file, the whole tree of the data structure is scanned from the root node down. Since it has to be done each time a user makes an inquiry, reading H5 and Netcdf is slow. There are faster
We will download a geological map stored in a netCDF format on Dropbox. The original data can be found on the USGS database. (https://doi.org/10.3133/ofr20191081)
# Download the geological framework
file1 = wget.download("https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_GeologicFrameworkGrids.nc?dl=1") #"./data/NCM_GeologicFrameworkGrids.nc"
# Download the coordinate grids
file2 = wget.download("https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_SpatialGrid.nc?dl=1") #"./data/NCM_GeologicFrameworkGrids.nc"
# move data
os.replace(file1,'./data/'+file1)
os.replace(file2,'./data/'+file2)
# read data
geology = nc.Dataset('./data/'+file1)
grid = nc.Dataset('./data/'+file2)
geology
geology['Surface Elevation']
np.shape(geology['Surface Elevation'])
geology['Surface Elevation'][3246, 1234]
x = grid['x'][0:4901, 0:3201]
y = grid['y'][0:4901, 0:3201]
elevation = geology['Surface Elevation'][0:4901, 0:3201]
plt.contourf(x, y, elevation)
Zarr is a cloud-optimized data format that handles heterogeneous data sets.
In the following exercise, we will use the Xarray open data sets air_temperature and save it into a Zarr file.
Let's work in groups to :