The Project Eclipse Network is a low-cost air quality sensing network for cities and a research project led by the Urban Innovation Group at Microsoft Research.
Project Eclipse data are distributed as a set of parquet files -- one per week. We can use the STAC API to search for files for a specific week.
import pystac_client
import planetary_computer
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=planetary_computer.sign_inplace,
)
search = catalog.search(collections=["eclipse"], datetime="2022-03-01")
items = search.item_collection()
print(f"Found {len(items)} item")
item = items[0]
item
Found 1 item
We'll load the parquet file with pandas.
import geopandas
import pandas as pd
asset = item.assets["data"]
df = pd.read_parquet(
asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
df
City | DeviceId | LocationName | Latitude | Longitude | ReadingDateTimeUTC | PM25 | CalibratedPM25 | CalibratedO3 | CalibratedNO2 | CO | Temperature | Humidity | BatteryLevel | PercentBattery | CellSignal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Chicago | 2002 | State & Garfield (SB) | 41.794921 | -87.625857 | 2022-02-27 00:04:04 | 9.126071 | 10.79 | 17.44 | 10.49 | 0.105193 | -0.526352 | 59.703064 | 4.143906 | 91.634804 | -83.0 |
1 | Chicago | 2002 | State & Garfield (SB) | 41.794921 | -87.625857 | 2022-02-27 00:09:14 | 10.927937 | 11.83 | 15.63 | 8.46 | 0.114015 | -0.649185 | 60.223389 | 4.142812 | 91.634804 | -80.0 |
2 | Chicago | 2002 | State & Garfield (SB) | 41.794921 | -87.625857 | 2022-02-27 00:14:24 | 10.395282 | 11.38 | 18.29 | 5.25 | 0.096386 | -0.627823 | 60.884094 | 4.141094 | 91.634804 | -82.0 |
3 | Chicago | 2002 | State & Garfield (SB) | 41.794921 | -87.625857 | 2022-02-27 00:19:33 | 9.431242 | 10.85 | 15.11 | 8.53 | 0.119355 | -0.809402 | 61.984253 | 4.142969 | 91.385475 | -81.0 |
4 | Chicago | 2002 | State & Garfield (SB) | 41.794921 | -87.625857 | 2022-02-27 00:24:44 | 9.648221 | 11.05 | 13.05 | 9.92 | 0.125682 | -0.809402 | 62.377930 | 4.142344 | 91.385475 | -82.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
187062 | Chicago | 2214 | EPA COM ED Maintenance Bldg F | 41.751200 | -87.713490 | 2022-03-05 22:33:08 | 0.747107 | 8.14 | 14.76 | 13.74 | 0.285773 | 19.898682 | 53.863525 | 4.184062 | 89.785156 | -110.0 |
187063 | Chicago | 2214 | EPA COM ED Maintenance Bldg F | 41.751200 | -87.713490 | 2022-03-05 22:53:22 | 0.733524 | 12.95 | 14.37 | 20.18 | 1.288297 | 20.149689 | 54.324341 | 4.187969 | 89.785156 | -111.0 |
187064 | Chicago | 2214 | EPA COM ED Maintenance Bldg F | 41.751200 | -87.713490 | 2022-03-05 23:13:37 | 0.000000 | 12.95 | 15.69 | 17.94 | 2.921081 | 20.024185 | 53.430176 | 4.188281 | 89.785156 | -111.0 |
187065 | Chicago | 2214 | EPA COM ED Maintenance Bldg F | 41.751200 | -87.713490 | 2022-03-05 23:33:57 | 3.208157 | 14.79 | 14.00 | 20.47 | 0.784464 | 20.149689 | 53.668213 | 4.183750 | 89.785156 | -109.0 |
187066 | Chicago | 2214 | EPA COM ED Maintenance Bldg F | 41.751200 | -87.713490 | 2022-03-05 23:54:09 | 0.000000 | 8.09 | 13.44 | 16.40 | 0.331082 | 20.299225 | 53.356934 | 4.184219 | 89.683594 | -110.0 |
187067 rows × 16 columns
df = df[(df.Longitude > -89) & (df.Longitude < -86)]
len(df)
187067
df.CalibratedO3
0 17.44 1 15.63 2 18.29 3 15.11 4 13.05 ... 187062 14.76 187063 14.37 187064 15.69 187065 14.00 187066 13.44 Name: CalibratedO3, Length: 187067, dtype: float64
ts = df.resample("h", on="ReadingDateTimeUTC")[
["CalibratedPM25", "Humidity", "CalibratedO3", "CalibratedNO2", "CO"]
].mean()
ts.plot(subplots=True, sharex=True, figsize=(12, 12));
The dataset contains many observations from each sensor. We can plot the location of each sensor with geopandas, by selecting just the first observation for that sensor.
gdf = geopandas.GeoDataFrame(
df, geometry=geopandas.points_from_xy(df.Longitude, df.Latitude), crs="epsg:4326"
)
gdf[["LocationName", "geometry"]].drop_duplicates(
subset="LocationName"
).dropna().explore(marker_kwds=dict(radius=8))
Using a named aggregation we can compute a summary per senor and plot it on a map. Hover over the markers to see the average Calibrated PM 25 per sensor.
average_pm25 = geopandas.GeoDataFrame(
gdf.groupby("LocationName").agg(
mean_pm25=("CalibratedPM25", "mean"), geometry=("geometry", "first")
),
crs="epsg:4326",
)
average_pm25.explore(
marker_kwds=dict(radius=10),
)
eclipse = catalog.get_collection("eclipse")
asset = planetary_computer.sign(eclipse.assets["data"])
import adlfs
import dask.dataframe as dd
fs = adlfs.AzureBlobFileSystem(**asset.extra_fields["table:storage_options"])
files = [f"az://{x}" for x in fs.ls(asset.href)]
ddf = dd.read_parquet(
files, storage_options=asset.extra_fields["table:storage_options"]
)
ddf
City | DeviceId | LocationName | Latitude | Longitude | ReadingDateTimeUTC | PM25 | CalibratedPM25 | CalibratedO3 | CalibratedNO2 | CO | Temperature | Humidity | BatteryLevel | PercentBattery | CellSignal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
npartitions=89 | ||||||||||||||||
string | int32 | string | float64 | float64 | datetime64[ns] | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |