Intake is a lightweight package for finding, investigating, loading and disseminating data. This notebook illutrates the usefulness of intake for a "Data User". Intake simplifies loading data from many formats into familiar Python objects like Pandas DataFrames or Xarray Datasets. Intake is especially useful for remote datasets - it allows us to bypass downloading data and instead load directly into a Python object for analysis.
Let's say we want to save a version of the data from our geopandas.ipynb tutorial for easy sharing and future use. intake has csv support by default but for loading data with geopandas we need to make sure the intake_geopandas plugin is installed.
import intake
import xarray
print(intake.__version__)
xarray.set_options(display_style="html")
# Save data locally from our queries
import pandas as pd
import geopandas as gpd
server = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?'
query = 'service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = pd.read_csv(server+query)
df.to_csv('votw.csv', index=False)
# Or save as geojson
# Now load query results as json directly in geopandas
query = 'service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=json'
gf = gpd.read_file(server+query)
gf.to_file('votw.geojson', driver='GeoJSON')
%%writefile votw-intake-catalog.yaml
metadata:
version: 1
sources:
votw_pandas:
args:
csv_kwargs:
blocksize: null #prevent reading in parallel with dask
#urlpath: 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
urlpath: './votw.csv'
description: 'Smithsonian_VOTW_Holocene_Volcanoes 4.8.4'
driver: csv
metadata:
citation: 'Global Volcanism Program, 2013. Volcanoes of the World, v. 4.8.4. Venzke, E (ed.). Smithsonian Institution. Downloaded 06 Dec 2019. https://doi.org/10.5479/si.GVP.VOTW4-2013'
plots:
last_eruption_year:
kind: violin
by: 'Region'
y: 'Last_Eruption_Year'
invert: True
width: 700
height: 500
votw_geopandas:
args:
#urlpath: 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=json'
urlpath: './votw.geojson'
description: 'Smithsonian_VOTW_Holocene_Volcanoes 4.8.4'
driver: geojson
metadata:
citation: 'Global Volcanism Program, 2013. Volcanoes of the World, v. 4.8.4. Venzke, E (ed.). Smithsonian Institution. Downloaded 06 Dec 2019. https://doi.org/10.5479/si.GVP.VOTW4-2013'
# put this catalog, votw.csv, and votw.geojson, in a public place like GitHub!
# This facilitates sharing and version controlled analysis
cat = intake.open_catalog('votw-intake-catalog.yaml')
print(list(cat))
cat.votw_pandas.description
# Loading the data is now very straightforward:
# We know the data will be read into a Pandas DataFrame because
cat.votw_pandas.container
df = cat.votw_pandas.read()
df.head()
# Notice we also specified some pre-defined plots in the catalog
# This requires hvplot
import hvplot.pandas
source = cat.votw_pandas
source.plot.last_eruption_year()
# Load a different dataset in the same catalog
source = cat.votw_geopandas
source.description
gf = source.read()
test = gf.loc[:,['Last_Eruption_Year', 'Volcano_Name', 'geometry']]
test.hvplot.points(geo=True, hover_cols=['Volcano_Name'], color='Last_Eruption_Year')
We've seen a plugin to load geospatial vector data into geopandas geodataframes, there is also a plugin to facilitate loading geospatial raster data into xarray dataarrays! https://github.com/intake/intake-xarray
# load a catalog stored on github
xcat = intake.open_catalog('https://raw.githubusercontent.com/intake/intake-xarray/master/examples/catalog.yml')
display(list(xcat))
The use of the intake catalog is much the same as above, except that the data container has switched to xarray objects.
geotiff = xcat.geotiff
geotiff.plot.band_image()
da = geotiff.read() # to xarray.DataArray
da.max('band')
Instead of creating your own metadata catalogs from scratch as YAML files, intake plugins exist to read catalogs in different formats. For example, for geospatial data on the web, SpatioTemporal Asset Catalogs (STAC) are emerging as a standard way to descripe data that you want to search for based on georeference location, time, and perhaps other metadata fields. The intake-stac plugin greatly facilitates loading datasets referenced in STAC catalogs into Python Xarray objects for analysis. https://github.com/pangeo-data/intake-stac
stac_cat = intake.open_stac_catalog(
'https://storage.googleapis.com/pdd-stac/disasters/catalog.json',
name='planet-disaster-data'
)
display(list(stac_cat))
print(stac_cat['Houston-East-20170831-103f-100d-0f4f-RGB'])
Entries in the catalog are accessed just like above. Below we pull the thumbnail image from the Hurricane Harvey composite image.
da = stac_cat['Houston-East-20170831-103f-100d-0f4f-RGB']['thumbnail'].to_dask()
da
da.plot.imshow(rgb='channel')