#!/usr/bin/env python
# coding: utf-8

# ## Accessing US Census data with the Planetary Computer STAC API
# 
# The [US Census](https://planetarycomputer.microsoft.com/dataset/us-census) collection provides information on population, demographics, and administrative boundaries at various levels of cartographic aggregation for the United States. It consists of many tabular datasets, one for each level of cartographic aggregation, each stored in [Apache Parquet](https://parquet.apache.org/) format. In this notebook, we'll use [geopandas](https://geopandas.org/) and dask-geopandas to read the files, which will preserve the `geometry` column with administrative boundaries.

# In[1]:


import geopandas
import dask_geopandas
import contextily as ctx
import seaborn as sns
import planetary_computer
import pystac_client


# ### Data access
# 
# The datasets hosted by the Planetary Computer are available from [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/). We'll use [pystac-client](https://pystac-client.readthedocs.io/) to search the Planetary Computer's [STAC API](https://planetarycomputer.microsoft.com/api/stac/v1/docs) for the subset of the data that we care about, and then we'll load the data directly from Azure Blob Storage. We'll specify a `modifier` so that we can access the data stored in the Planetary Computer's private Blob Storage Containers. See [Reading from the STAC API](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/) and [Using tokens for data access](https://planetarycomputer.microsoft.com/docs/concepts/sas/) for more.

# In[2]:


catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)


# Each item in the `us-census` collection represents a single table, aggregating the census data a some level (Congressional district, state, etc.).

# In[3]:


search = catalog.search(collections=["us-census"])
items = {item.id: item for item in search.items()}
list(items)


# ### Read Congressional districts
# 
# The `2020-cb_2020_us_cd116_500k` dataset contains geometries for Congressional Districts for the 166th Congress.

# In[4]:


item = items["2020-cb_2020_us_cd116_500k"]
item


# Each of the items contains a single asset, `data`, that has all the URL to the Parquet dataset and all the information necessary to load it.

# In[5]:


asset = item.assets["data"]
asset


# This is an [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) URL, which is used by libraries like pandas, geopandas, and Dask to work with files from remote storage like Azure Blob Storage. We already signed this asset to include a `credential`. If you use an unsigned asset you'll see a `ClientAuthenticationError` error when trying to open the dataset

# In[6]:


df = geopandas.read_parquet(
    asset.href,
    storage_options=asset.extra_fields["table:storage_options"],
)
df.head()


# We'll select a single district (Maryland's 2nd) and plot it.

# In[7]:


ax = (
    df[df.GEOID == "2402"]
    .to_crs(epsg=3857)
    .plot(figsize=(10, 10), alpha=0.5, edgecolor="k")
)
ax.set_title(
    "2nd Congressional District: Maryland",
    fontdict={"fontsize": "20", "fontweight": "2"},
)
ctx.add_basemap(ax, source=ctx.providers.Esri.NatGeoWorldMap)
ax.set_axis_off()


# ### Read Census Block data
# 
# Census blocks are the smallest cartographic unit available from the Census Bureau. There are over 8 million census blocks.

# In[8]:


geo = dask_geopandas.read_parquet(
    "abfs://us-census/2020/census_blocks_geo.parquet",
    storage_options=asset.extra_fields["table:storage_options"],
    calculate_divisions=True,
)
geo


# In[9]:


import dask.dataframe

pop = dask.dataframe.read_parquet(
    "abfs://us-census/2020/census_blocks_population.parquet",
    storage_options=asset.extra_fields["table:storage_options"],
    calculate_divisions=True,
)
pop


# In[10]:


ri = (
    geo.get_partition(39)
    .compute()
    .join(pop[["P0010001"]].get_partition(39).compute(), how="inner")
)
ri = ri[ri.P0010001 > 10]


# In[11]:


ax = ri.to_crs(epsg=3857).plot(figsize=(10, 10), alpha=0.5, edgecolor="k")
ax.set_title(
    "Census Blocks with Population Greater than 150: Providence County, RI",
    fontdict={"fontsize": "20", "fontweight": "2"},
)
ctx.add_basemap(ax, source=ctx.providers.Esri.NatGeoWorldMap)
ax.set_axis_off()


# Let's filter out the blocks with 0 reported population and plot the distribution of people per census block.

# In[12]:


sns.displot(ri.P0010001, log_scale=True);


# Or we can plot the relationship between the population and the size of the census block.

# In[13]:


import numpy as np

sns.jointplot(x=ri.ALAND, y=np.log(ri.P0010001), marker=".");


# ### Next Steps
# 
# Now that you've seen an introduction to working with US Census data from the Planetary Computer, learn more with
# 
# * The [US Census data tutorial](../../tutorials/census-data.ipynb), which includes examples for accessing data at each level of cartographic aggregation available
# * The [Reading tabular data quickstart](../../quickstarts/reading-tabular-data.ipynb), which introduces how to use tabular data with the Planetary Computer