#!/usr/bin/env python
# coding: utf-8

# # Data Ingestion - Geospatial-Specific Tooling

# ![PySTAC](images/pystac.png "PySTACK")

# ---

# ## Overview

# In this notebook, you will ingest Landsat data for use in machine learning. Machine learning tasks often involve a lot of data, and in Python, data is typically stored in memory as simple [NumPy](https://foundations.projectpythia.org/core/numpy.html) arrays. However, higher-level containers built on top of NumPy arrays provide more functionality for multidimensional gridded data ([xarray](http://xarray.pydata.org)) or out-of-core and distributed data ([Dask](http://dask.pydata.org)). Our goal for data ingestion will be to load specific Landsat data of interest into one of these higher-level containers.
# 
# [Microsoft Plantery Computer](https://planetarycomputer.microsoft.com/docs/overview/about) is one of several providers of [Landsat Data](https://planetarycomputer.microsoft.com/dataset/group/landsat). We are using it together with [pystac-client](https://pystac-client.readthedocs.io/en/stable/index.html) and [odc-stac](https://odc-stac.readthedocs.io/en/latest/index.html) because together they provide a nice Python API for searching and loading with specific criteria such as spatial area, datetime, Landsat mission, and cloud coverage.
# 
# Earth science datasets are often stored on remote servers that may be too large to download locally. Therefore, in this cookbook, we will focus primarily on ingestion approaches that load small portions of data from a remote source, as needed. However, the approach for your own work will depend not only on data size and location but also the intended analysis, so in a follow up notebook, you will see an alternative approache for generalized data access and management.

# ## Prerequisites
# 
# | Concepts | Importance | Notes |
# | --- | --- | --- |
# | [Intro to Landsat](./0.0_Intro_Landsat.ipynb) | Necessary | Background |
# | [About the Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/docs/overview/about) | Helpful | Background |
# | [pystac-client Usage](https://pystac-client.readthedocs.io/en/stable/usage.html) | Helpful | Consult as needed |
# | [odc.stac.load Reference](https://odc-stac.readthedocs.io/en/latest/_api/odc.stac.load.html) | Helpful | Consult as needed |
# | [xarray](https://foundations.projectpythia.org/core/xarray.html) | Necessary |  |
# | [Intro to Dask Array](https://docs.dask.org/en/stable/array.html) | Helpful | |
# | [Panel Getting Started Guide](https://panel.holoviz.org/getting_started/build_app.html) | Helpful | |
# 
# - **Time to learn**: 10 minutes

# ## Imports

# In[ ]:


import odc.stac
import pandas as pd
import planetary_computer
import pystac_client
import xarray as xr
from pystac.extensions.eo import EOExtension as eo

# Viz
import hvplot.xarray
import panel as pn

pn.extension()


# ## Open and read the root of the STAC catalog

# In[ ]:


catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)
catalog.title


# Microsoft Planetary Computer has a public STAC metadata but the actual data assets are in private Azure Blob Storage containers and require authentication. `pystac-client` provides a `modifier` keyword that we can use to manually sign the item. Otherwise, we'd get an error when trying to access the asset.

# # Search for Landsat Data

# Let's say that an analysis we want to run requires landsat data over a specific region and from a specific time period. We can use our catalog to search for assets that fit our search criteria.

# First, let's find the name of the landsat dataset. [This page](https://planetarycomputer.microsoft.com/catalog) is a nice resource for browsing the available collections, but we can also just search the catalog for 'landsat':

# In[ ]:


all_collections = [i.id for i in catalog.get_collections()]
landsat_collections = [
    collection for collection in all_collections if "landsat" in collection
]
landsat_collections


# We'll use the `landsat-c2-l2` dataset, which stands for Collection 2 Level-2. It contains data from several landsat missions and has better data quality than Level 1 (`landsat-c2-l1`). Microsoft Planetary Computer has descriptions of [Level 1](https://planetarycomputer.microsoft.com/dataset/landsat-c2-l1) and [Level 2](https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2), but a direct and succinct comparison can be found in [this community post](https://gis.stackexchange.com/questions/439767/landsat-collections), and the information can be verified with [USGS](https://www.usgs.gov/landsat-missions/landsat-collection-2).

# Now, let's set our search parameters. You may already know the bounding box (region/area of interest) coordinates, but if you don't, there are many useful tools like [bboxfinder.com](http://bboxfinder.com/) that can help.

# In[ ]:


bbox = [-118.89, 38.54, -118.57, 38.84]  # Region over a lake in Nevada, USA
datetime = "2017-06-01/2017-09-30"  # Summer months of 2017
collection = "landsat-c2-l2"


# We can also specify other parameters in the query, such as a specific landsat mission and the max percent of cloud cover:

# In[ ]:


platform = "landsat-8"
cloudy_less_than = 1  # percent


# Now we run the search and list the results:

# In[ ]:


search = catalog.search(
    collections=["landsat-c2-l2"],
    bbox=bbox,
    datetime=datetime,
    query={"eo:cloud_cover": {"lt": cloudy_less_than}, "platform": {"in": [platform]}},
)
items = search.item_collection()
print(f"Returned {len(items)} Items:")
item_id = {(i, item.id): i for i, item in enumerate(items)}
item_id


# It looks like there were three image stacks taken by Landsat 8 over this spatial region during the summer months of 2017 that has less than 1 percent cloud cover.

# ## Preview Results and Select a Dataset

# Before loading one of the available image stacks, it would be useful to get a visual check of the results. Many datasets have a rendered preview or thumbnail image that can be accessed without having to load the full resolution data.
# 
# We can create a simple interactive application using the [Panel](https://panel.holoviz.org/index.html) library to access and display rendered PNG previews of the our search results. Note that these pre-rendered images are of large tiles that span beyond our bounding box of interest. In the next steps, we will only be loading in a small area around the lake.

# In[ ]:


item_sel = pn.widgets.Select(value=1, options=item_id, name="item")

def get_preview(i):
    return pn.panel(items[i].assets["rendered_preview"].href, height=300)


pn.Row(item_sel, pn.bind(get_preview, item_sel))


# In[ ]:


selected_item = items[1]
selected_item


# ## Access the Data

# Now that we have selected a dataset from our catalog, we can procede to access the data. We want to be very selective about the data that we read and when we read it because the amount of downloaded data can quickly get out of hand. Therefore, let's select only a subset of images.
# 
# First, we'll preview the different image assets (or [Bands](https://github.com/stac-extensions/eo)) available in the Landsat item.

# In[ ]:


assets = []
for _, asset in selected_item.assets.items():
    try:
        assets.append(asset.extra_fields["eo:bands"][0])
    except:
        pass

cols_ordered = [
    "common_name",
    "description",
    "name",
    "center_wavelength",
    "full_width_half_max",
]
bands = pd.DataFrame.from_dict(assets)[cols_ordered]
bands


# Then we will select a few bands (images) of interest:

# In[ ]:


bands_of_interest = ["red", "green", "blue"]


# Finally, we lazily load the selected data. We will use the package called `odc` which allows us to load only a specific region of interest (bounding box or 'bbox') and specific bands (images) of interest. We will also use the `chunks` argument to load the data as dask arrays; this will load the metadata now and delay the loading until we actually use the data, or until we force the data to be loaded by using `.compute()`.

# In[ ]:


ds = odc.stac.stac_load(
    [selected_item],
    bands=bands_of_interest,
    bbox=bbox,
    chunks={},  # <-- use Dask
).isel(time=0)
ds


# Let's combine the bands of the dataset into a single DataArray that has the band names as coordinates of a new 'band' dimension, and also call `.compute()` to finally load the data.

# In[ ]:


da = ds.to_array(dim="band").compute()
da


# ## Visualize the data

# Often, data ingestion involves quickly visualizing your raw data to get a sense that things are proceeding accordingly. As we have created an array with red, blue, and green bands, we can quickly display a natural color image of the lake using the `.plot.imshow()` function of `xarray`. We'll use the `robust=True` argument because the data values are outside the range of typical RGB images.

# In[ ]:


da.plot.imshow(robust=True, size=3)


# Now, let's use `hvplot` to provide an interactive visualization of the inividual bands in our array.

# In[ ]:


ds


# In[ ]:


da.hvplot.image(x="x", y="y", cmap="viridis", aspect=1)


# Let's plot the bands as seperate columns by specifying a dimension to expand with `col='band'`. We can also set `rasterize=True` to use [Datashader](https://datashader.org/) (another HoloViz tool) to render large data into a 2D histogram, where every array cell counts the data points falling into that pixel, as set by the resolution of your screen. This is especially important for large and high resolution images that would otherwise cause issues when attempting to render in a browser.

# In[ ]:


da.hvplot.image(
    x="x", y="y", col="band", cmap="viridis", xaxis=False, yaxis=False, colorbar=False, rasterize=True
)


# Select the zoom tool and zoom in on of the plots to see that all the images are all automatically linked!

# ## Retain Attributes

# When working with many image arrays, it's critical to retain the data properties as xarray attributes:

# In[ ]:


da.attrs = selected_item.properties
da


# Notice that you can now expand the `Attributes: ` dropdown to see the properties of this data.

# ## Set the `crs` attribute

# As the data is in 'meter' units from a reference point, we can plot in commonly used longitude, latitude coordinates with `.hvplot(geo=True)` if our array has a valid coordinate reference system (CRS) attribute. This value is provided from Microsoft Planetary Computer as the `proj:epsg` property, so we just need to copy it to a new attribute `crs` so that hvPlot can automatically find it, without us having to further specify anything in our plotting code
# 
# Note, this CRS is referenced by an EPSG code that can be accessed from the metadata of our selected catalog search result. We can see more about this dataset's specific code at [EPSG.io/32611](https://epsg.io/32611). You can also read more about EPSG codes in general in this [Coordinate Reference Systems: EPSG codes](https://pygis.io/docs/d_understand_crs_codes.html#epsg-codes) online book chapter. 

# In[ ]:


da.attrs["crs"] = f"epsg:{selected_item.properties['proj:epsg']}"
da.attrs["crs"]


# Now we can use `.hvplot(geo=True)` to plot in longitude and latitude coordinates. Informing `hvPlot` that this is geographic data also allows us to overlay data on aligned geographic tiles using the `tiles` parameter.

# In[ ]:


da.hvplot.image(
    x="x", y="y", cmap="viridis", geo=True, alpha=.9, tiles="ESRI", xlabel="Longitude", ylabel="Latitude", colorbar=False, aspect=1,
)


# ___

# ## Summary
# The data access approach should adapt to features of the data and your intended analysis. As Landsat data is large and multidimensional, a good approach is to use [Microsoft Plantery Computer](https://planetarycomputer.microsoft.com/docs/overview/about), [pystac-client](https://pystac-client.readthedocs.io/en/stable/index.html), and [odc-stac](https://odc-stac.readthedocs.io/en/latest/index.html) together for searching the metadata catalog and lazily loading specific data chunks. Once you have accessed data, visualize it with hvPlot to ensure that it matches your expectations.
# 
# ### What's next?
# Before we proceed to workflow examples, we can explore an alternate way of accessing data using generalized tooling.

# ## Resources and References
# - Authored by Demetris Roumis circa Jan, 2023
# - Guidance for parts of this notebook was provided by Microsoft in ['Reading Data from the STAC API'](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/)
# - The image used in the banner is from an announcement about PySTAC from Azavea

# In[ ]: