#!/usr/bin/env python # coding: utf-8 # ## Zonal Statistics # # In this tutorial, you'll learn how to use [xarray-spatial](https://xarray-spatial.org/) to work with zonal statistics. Zonal statistics help you better understand data from one source by analyzing it for different zones defined by another source. This operation uses two datasets: One dataset, the *zones raster*, defines one or more zones. A second dataset, the *values raster*, contains the data you want to analyze for each of the zones defined by the first dataset. If you're familiar with SQL or dataframe libraries like [pandas](https://pandas.pydata.org/), this is similar to a group by operation: you essentially group the *values* raster by the *zones* raster. # # In this tutorial, we'll use the [Esri / Observatory 10-Meter Land Cover](https://planetarycomputer.microsoft.com/dataset/io-lulc) dataset, which classifies pixels into one of ten classes, as the zones raster. Our values raster will be [Normalized difference vegetation index (NDVI)](https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index), derived from [Sentinel-2 Level 2-A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a). # # --- # In[1]: import numpy as np import planetary_computer import pystac_client import stackstac import rasterio.features import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from xrspatial.multispectral import true_color from xrspatial.classify import equal_interval from xrspatial.zonal import stats as zonal_stats from xrspatial.zonal import crosstab as zonal_crosstab # ### Preparation: Create a local Dask cluster # # We'll make a local Dask cluster to process the data in parallel. # In[2]: from dask.distributed import Client client = Client() print(f"/proxy/{client.scheduler_info()['services']['dashboard']}/status") # To follow the progress of your computation, you can [access the Dask Dashboard](https://planetarycomputer.microsoft.com/docs/quickstarts/scale-with-dask/#Open-the-dashboard) at the URL from the previous cell's output. # ### Load and coregister data # # The area of interest covers Lake Bridgeport, Texas, USA. We'll use `pystac-client` to find items covering that area, and `stackstac` to load the items into a `DataArray`. # In[3]: catalog = pystac_client.Client.open( "https://planetarycomputer.microsoft.com/api/stac/v1/", modifier=planetary_computer.sign_inplace, ) bounds = (-98.00080760573508, 32.99921674609716, -96.9991860639418, 34.000729644613706) landcover_search = catalog.search(collections=["io-lulc"], bbox=bounds) landcover_items = landcover_search.item_collection() landcover_data = ( stackstac.stack( landcover_items, epsg=3857, bounds_latlon=bounds, dtype="int8", fill_value=0, chunksize=2048, resolution=100, ) .pipe(stackstac.mosaic, nodata=0) .squeeze() .rename("Landcover") .persist() ) landcover_data # The landcover data includes some class labels, which we'll use to label the outputs later on. # In[4]: landcover_labels = dict( enumerate(landcover_data.coords["label:classes"].item()["classes"]) ) landcover_labels # Finally, the landcover COGs also include a colormap. # In[5]: class_count = len(landcover_labels) with rasterio.open(landcover_items[0].assets["data"].href) as src: landcover_colormap_def = src.colormap(1) # get metadata colormap for band 1 landcover_colormap = [ np.array(landcover_colormap_def[i]) / 255 for i in range(class_count) ] landcover_cmap = ListedColormap(landcover_colormap) # #### Sentinel data # # Now we'll do the same thing, only for `sentinel-2-l2a` items from July 2020. We'll also try to select some not-too cloudy items and mosaic over time to remove as many clouds as possible. # In[6]: sentinel_search = catalog.search( collections=["sentinel-2-l2a"], bbox=bounds, datetime="2020-07-01/2020-07-30", query={ "eo:cloud_cover": { "lt": 10, } }, ) sentinel_items = sentinel_search.item_collection() # We'll need the zones and values rasters to be in the same CRS and resolution, so we'll make sure to pass those values while we're loading the data with `stackstac` (see [Coregistration](coregistration.ipynb) for more). # In[7]: sentinel_data = ( ( stackstac.stack( sentinel_items, resolution=landcover_data.resolution, epsg=landcover_data.spec.epsg, bounds=landcover_data.spec.bounds, assets=["B02", "B03", "B04", "B08"], # blue, green, red, nir chunksize=2048, ) .assign_coords(band=lambda x: x.common_name.rename("band")) # use common names .where(lambda x: x > 0, other=np.nan) # Sentinels uses 0 as nodata ) .median(dim="time", keep_attrs=True) .persist() ) sentinel_data # Create a true color image for the Sentinel data: # In[8]: sentinel_img = true_color( sentinel_data.sel(band="red"), sentinel_data.sel(band="green"), sentinel_data.sel(band="blue"), c=30, th=0.075, name="True color (Sentinel)", ) # In[9]: fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18, 8)) landcover_data.plot.imshow(ax=ax1, cmap=landcover_cmap) sentinel_img.plot.imshow(ax=ax2) plt.tight_layout() # As a final bit of setup, we'll compute NDVI for the sentinel-2 scene. This will be our values raster when we compute zonal statistics later on. # In[10]: red, nir = sentinel_data.sel(band=["red", "nir"]) ndvi = ((nir - red) / (nir + red)).persist() ndvi # ### Compute zonal statistics # # Now that we have our zones raster (the landcover DataArray) and our values raster (NDVI derived from the Sentinel composite), we're ready to compute zonal stats. # We'll use [xrspatial.zonal.stats](https://xarray-spatial.org/reference/_autosummary/xrspatial.zonal.stats.html) to find a few summary statistics for each landcover class. # In[11]: quantile_stats = ( zonal_stats( zones=landcover_data, values=ndvi, stats_funcs=["mean", "max", "min", "count"], ) .compute() .set_index("zone") .rename(landcover_labels) ) quantile_stats.style.background_gradient() # Unsurprisingly, landcover classes like "trees" and "grass" have relatively high NDVI values. Classes like "water" and "bare" are relatively low. # ### Compute zonal cross-tabulation statistics # # [xrspatial.zonal.crosstab](https://xarray-spatial.org/reference/_autosummary/xrspatial.zonal.crosstab.html) calculates cross-tabulated areas between the zone raster and value raster. This function requires a 2D zone raster and a categorical value raster of 2D or 3D data. # # The `crosstab` function calculates different cross-tabulation statistics. It has an `agg` parameter to define which aggregation method to use. It returns a DataFrame where each row represents a zone from the zone raster and each column represents a category from the value raster. # # This example uses the NASADEM elevation zones as its zone raster and the Esri Land Cover data as its categorical value raster. The resulting DataFrame will show the percentage of each land cover category for each of the five elevation zones. # In[12]: nasadem_search = catalog.search(collections=["nasadem"], bbox=bounds) nasadem_items = nasadem_search.item_collection() nasadem_data = ( stackstac.stack( nasadem_items, epsg=landcover_data.spec.epsg, resolution=landcover_data.resolution, bounds=landcover_data.spec.bounds, chunksize=2048, ) .pipe(stackstac.mosaic) .squeeze() .persist() ).rename("Elevation (NASADEM)") nasadem_data # Now let's discretize it into five zones. # In[13]: num_zones = 5 elevation_cmap = ListedColormap(plt.get_cmap("Set1").colors[:num_zones]) quantile_zones = equal_interval( nasadem_data, k=num_zones, name="Zones (Classified Elevation - NASADEM)" ) fig, ax = plt.subplots(figsize=(8, 8)) img = quantile_zones.plot.imshow(cmap=elevation_cmap, ax=ax) img.colorbar.set_ticks([0.4, 1.2, 2.0, 2.8, 3.6]) img.colorbar.set_ticklabels(range(5)) ax.set_axis_off() # Finally, calculate the cross-tabulation statistics and display a table demonstrating how the land cover categories are distributed over each of the five elevation zones. Set values for `zone_ids` and `cat_ids` to indicate the zones and the categories of interest. In this example, we'll use all the available zones and categories. # In[14]: crosstab = ( zonal_crosstab( zones=quantile_zones, values=landcover_data, agg="percentage", ) .rename(columns=landcover_labels) .compute() .set_index("zone") ) crosstab.style.background_gradient() # ### Next steps: analyze different datasets # # The [Planetary Computer Data Catalog](https://planetarycomputer.microsoft.com/catalog) includes petabytes of environmental monitoring data. All data sets are available in consistent, analysis-ready formats. You can access them through APIs as well as directly via [Azure Storage](https://docs.microsoft.com/en-us/azure/storage/). # # Try using [xrspatial.zonal.stats](https://xarray-spatial.org/reference/_autosummary/xrspatial.zonal.stats.html) and [xrspatial.zonal.crosstab](https://xarray-spatial.org/reference/_autosummary/xrspatial.zonal.crosstab.html) with different datasets from the [Data Catalog](https://planetarycomputer.microsoft.com/catalog). For example, try using the land cover categories from the [Esri 10m Land Cover](https://www.arcgis.com/home/item.html?id=d6642f8a4f6d4685a24ae2dc0c73d4ac) dataset as zonal raster. Or try using the [Map of Biodiversity Importance (MoBI)](https://planetarycomputer.microsoft.com/dataset/mobi) dataset as a values raster. # # There are also [other zonal functions in xarray spatial](https://xarray-spatial.org/reference/zonal.html) to use with your datasets.