This dataset contains global estimates of 10-class land use/land cover for 2020, derived from ESA Sentinel-2 imagery at 10m resolution. In this notebook, we'll demonstrate how to access and work with this data through the Planetary Computer.
This notebook works with or without an API key, but you will be given more permissive access to the data with an API key. The Planetary Computer Hub is pre-configured to use your API key.
import dask.distributed
from matplotlib.colors import ListedColormap
import pystac_client
from pystac.extensions.projection import ProjectionExtension as proj
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import planetary_computer
import rasterio
import rasterio.features
import stackstac
# Set the environment variable PC_SDK_SUBSCRIPTION_KEY, or set it here.
# The Hub sets PC_SDK_SUBSCRIPTION_KEY automatically.
# pc.settings.set_subscription_key(<YOUR API Key>)
The datasets hosted by the Planetary Computer are available from Azure Blob Storage. We'll use pystac-client to search the Planetary Computer's STAC API for the subset of the data that we care about, and then we'll load the data directly from Azure Blob Storage. We'll specify a modifier
so that we can access the data stored in the Planetary Computer's private Blob Storage Containers. See Reading from the STAC API and Using tokens for data access for more.
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=planetary_computer.sign_inplace,
)
We'll pick an area surrounding Manila, Philippines and use the STAC API to find what data items are available. We won't select a date range since this dataset contains items from a single timeframe in 2020.
area_of_interest = {
"type": "Polygon",
"coordinates": [
[
[120.88256835937499, 14.466596475463248],
[121.34948730468749, 14.466596475463248],
[121.34948730468749, 14.81737062015525],
[120.88256835937499, 14.817370620155254],
[120.88256835937499, 14.466596475463248],
]
],
}
search = catalog.search(collections=["io-lulc"], intersects=area_of_interest)
# Check how many items were returned
items = search.item_collection()
print(f"Returned {len(items)} Items")
Returned 4 Items
We found 4 items that intersect with our area of interest, which means the data we want to work with is spread out over 4 non-overlapping GeoTIFF files stored on blob storage. In order to merge them together, we could open each item, clip to the subset of our AoI, and merge them together manually with rasterio. We'd also have to reproject each item which may span multiple UTM projections.
Instead, we'll use the stackstac library to read, merge, and reproject in a single step - all without loading the rest of the file data we don't need.
# The STAC metadata contains some information we'll want to use when creating
# our merged dataset. Get the EPSG code of the first item and the nodata value.
item = items[0]
epsg = proj.ext(item).epsg
nodata = item.assets["data"].extra_fields["raster:bands"][0]["nodata"]
bounds_latlon = rasterio.features.bounds(area_of_interest)
# Create a single DataArray from out multiple resutls with the corresponding
# rasters projected to a single CRS. Note that we set the dtype to ubyte, which
# matches our data, since stackstac will use float64 by default.
stack = stackstac.stack(
items, epsg=epsg, dtype=np.ubyte, fill_value=nodata, bounds_latlon=bounds_latlon
)
stack
<xarray.DataArray 'stackstac-5c28b96685cb1458a25519df2b6bfa98' (time: 1, band: 1, y: 3924, x: 5063)> dask.array<fetch_raster_window, shape=(1, 1, 3924, 5063), dtype=uint8, chunksize=(1, 1, 1024, 1024), chunktype=numpy.ndarray> Coordinates: (12/20) * time (time) datetime64[ns] 2020-06-01 id (time) <U8 '51P-2020' * band (band) <U4 'data' * x (x) float64 2.718e+05 2.718e+05 ... 3.224e+05 3.224e+05 * y (y) float64 1.639e+06 1.639e+06 ... 1.6e+06 1.6e+06 proj:epsg int64 32651 ... ... proj:transform object {0.0, 137987.1216733564, 10.0, 1799862.99727356... proj:bbox object {908347.1216733564, 137987.1216733564, 790452.9... label:type <U6 'raster' file:size int64 103458531 raster:bands object {'nodata': 0, 'spatial_resolution': 10} epsg int64 32651 Attributes: spec: RasterSpec(epsg=32651, bounds=(271760.0, 1599970.0, 322390.0... crs: epsg:32651 transform: | 10.00, 0.00, 271760.00|\n| 0.00,-10.00, 1639210.00|\n| 0.0... resolution: 10.0
Start up a local Dask cluster to allow us to do parallel reads. Use the following URL to open a dashboard in the Hub's Dask Extension.
client = dask.distributed.Client(processes=False)
print(f"/proxy/{client.scheduler_info()['services']['dashboard']}/status")
/srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/node.py:183: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 46119 instead warnings.warn(
/proxy/46119/status
So far, we haven't read in any data. Stackstac has used the STAC metadata to construct a DataArray that will contain our Item data. Let's mosaic the rasters across the time
dimension (remember, they're all from a single synthesized "time" from 2020) and drop the single band
dimension. Finally, we ask Dask to read the actual data by calling .compute()
.
merged = stackstac.mosaic(stack, dim="time", axis=None, nodata=0).squeeze().compute()
merged
<xarray.DataArray 'stackstac-5c28b96685cb1458a25519df2b6bfa98' (y: 3924, x: 5063)> array([[5, 5, 5, ..., 2, 2, 2], [5, 5, 5, ..., 2, 2, 2], [5, 5, 5, ..., 2, 2, 2], ..., [7, 7, 7, ..., 2, 2, 2], [7, 7, 7, ..., 2, 2, 2], [7, 7, 7, ..., 2, 2, 2]], dtype=uint8) Coordinates: (12/18) band <U4 'data' * x (x) float64 2.718e+05 2.718e+05 ... 3.224e+05 3.224e+05 * y (y) float64 1.639e+06 1.639e+06 ... 1.6e+06 1.6e+06 proj:epsg int64 32651 label:classes object {'name': '', 'classes': ['nodata', 'water', 'tr... start_datetime <U20 '2020-01-01T00:00:00Z' ... ... proj:transform object {0.0, 137987.1216733564, 10.0, 1799862.99727356... proj:bbox object {908347.1216733564, 137987.1216733564, 790452.9... label:type <U6 'raster' file:size int64 103458531 raster:bands object {'nodata': 0, 'spatial_resolution': 10} epsg int64 32651 Attributes: spec: RasterSpec(epsg=32651, bounds=(271760.0, 1599970.0, 322390.0... crs: epsg:32651 transform: | 10.00, 0.00, 271760.00|\n| 0.00,-10.00, 1639210.00|\n| 0.0... resolution: 10.0
Now a quick plot to check that we've got the data we want.
merged.plot()
plt.show()
It looks good, but it doesn't look like a land cover map. The source GeoTIFFs contain a colormap and the STAC metadata contains the class names. We'll open one of the source files just to read this metadata and construct the right colors and names for our plot.
class_names = merged.coords["label:classes"].item()["classes"]
class_count = len(class_names)
with rasterio.open(item.assets["data"].href) as src:
colormap_def = src.colormap(1) # get metadata colormap for band 1
colormap = [
np.array(colormap_def[i]) / 255 for i in range(class_count)
] # transform to matplotlib color format
cmap = ListedColormap(colormap)
fig, ax = plt.subplots(
figsize=(12, 6), dpi=125, subplot_kw=dict(projection=ccrs.epsg(epsg)), frameon=False
)
p = merged.plot(
ax=ax,
transform=ccrs.epsg(epsg),
cmap=cmap,
add_colorbar=False,
vmin=0,
vmax=class_count,
)
ax.set_title("Manila 2020 LULC")
cbar = plt.colorbar(p)
cbar.set_ticks(range(class_count))
cbar.set_ticklabels(class_names)
That looks better. Let's also plot a histogram of the pixel values to see the distribution of land cover types within our area of interest. We can reuse the colormap we generated to help tie the two visualizations together.
colors = list(cmap.colors)
ax = (
pd.value_counts(merged.data.ravel(), sort=False)
.sort_index()
.reindex(range(len(colors)), fill_value=0)
.rename(dict(enumerate(class_names)))
.plot.barh(color=colors, rot=0, width=0.9)
)
ax.set(
title="Distribution of Land Cover classes",
ylabel="Landcover class",
xlabel="Pixel count",
);