#!/usr/bin/env python # coding: utf-8 # ## Reading Tabular Data # # The Planetary Computer provides tabular data in the [Apache Parquet](https://parquet.apache.org/) file format, which provides a standardized high-performance columnar storage format. # # When working from Python, there are several options for reading parquet datasets. The right choice depends on the size and kind of the data you're reading. When reading geospatial data, with one or more columns containing vector geometries, we recommend using [geopandas](https://geopandas.org/) for small datasets and [dask-geopandas](https://github.com/geopandas/dask-geopandas) for large datasets. For non-geospatial tabular data, we recommend [pandas](https://pandas.pydata.org/) for small datasets and [Dask](https://dask.org/) for large datasets. # # Regardless of which library you're using to read the data, we recommend using [STAC](https://stacspec.org/) to discover which datasets are available, and which options should be provided when reading the data. # # In this example we'll work with data from the US Forest Service's [Forest Inventory and Analysis](https://planetarycomputer.microsoft.com/dataset/fia) dataset. This includes a collection of tables providing information about forest health and location in the United States. # In[1]: import pystac_client import planetary_computer catalog = pystac_client.Client.open( "https://planetarycomputer.microsoft.com/api/stac/v1", modifier=planetary_computer.sign_inplace, ) fia = catalog.get_collection("fia") fia # The FIA Collection has a number of items, each of which represents a different table stored in Parquet format. # In[2]: list(fia.get_all_items()) # To load a single table, get it's item and extract the `href` from the `data` asset. The "boundary" table, which provides information about subplots, is relatively small and doesn't contain a geospatial geometry column, so it can be read with pandas. # In[3]: import pandas as pd import planetary_computer boundary = fia.get_item(id="boundary") asset = boundary.assets["data"] df = pd.read_parquet( asset.href, storage_options=asset.extra_fields["table:storage_options"], columns=["CN", "AZMLEFT", "AZMCORN"], ) df.head() # There are a few imporant pieces to highlight # # 1. As usual with the Planetary Computer, we signed the STAC item so that we could access the data. See [Using tokens for data access](https://planetarycomputer.microsoft.com/docs/concepts/sas/) for more. # 2. We relied on the asset to provide all the information necessary to load the data like the `href` and the `storage_options`. All we needed to know was the ID of the Collection and Item # 3. We used pandas' and parquet's ability to select subsets of the data with the `columns` keyword. # Larger datasets can be read using [Dask](https://dask.org/). For example, the `cpdata/raw/fia/tree.parquet` folder contains about 160 individual Parquet files, totalling about 22 million rows. In this case, pass the path to the directory to `dask.dataframe.read_parquet`. # In[4]: import dask.dataframe as dd tree = fia.get_item(id="tree") asset = tree.assets["data"] df = dd.read_parquet( asset.href, storage_options=asset.extra_fields["table:storage_options"], columns=["SPCD", "CARBON_BG", "CARBON_AG"], engine="pyarrow", ) df # That lazily loads the data into a Dask DataFrame. We can operate on the DataFrame with pandas-like methods, and call `.compute()` to get the result. In this case, we'll compute the average amount of carbon sequestered above and below ground for each tree, grouped by species type. To cut down on execution time we'll select just the first partition. # In[5]: result = df.get_partition(0).groupby("SPCD").mean().compute() # group by species result # ### Geospatial parquet datasets # # The `us-census` collection has some items that include a `geometry` column, and so can be loaded with `geopandas`. All parquet datasets hosted by the Planetary Computer with one or more geospatial columns use the [geoparquet](https://github.com/opengeospatial/geoparquet) standard for encoding the geospatial metadata. # In[6]: import geopandas item = catalog.get_collection("us-census").get_item("2020-cb_2020_us_state_500k") asset = item.assets["data"] df = geopandas.read_parquet( asset.href, storage_options=asset.extra_fields["table:storage_options"] ) df.head() # With this, we can visualize the boundaries for the continental United States. # In[7]: import contextily drop = ["GU", "AK", "MP", "HI", "VI", "PR", "AS"] ax = df[~df.STUSPS.isin(drop)].plot() contextily.add_basemap( ax, crs=df.crs.to_string(), source=contextily.providers.Esri.NatGeoWorldMap ) # ### Learn more # # This quickstart briefly introduced how to access tabular data on the Planetary Computer. For more, see # # * The [pandas documentation](https://pandas.pydata.org/docs/user_guide/io.html#parquet) for an introduction to Parquet # * [Scale with Dask](https://planetarycomputer.microsoft.com/docs/quickstarts/scale-with-dask/) for more on using Dask to work with large datasets in parallel # * The [Forest Inventory and Analysis](https://planetarycomputer.microsoft.com/dataset/fia) catalog page.