Non-linear dimension reductiona and clustering is a powerful tool to very fastly see where are the most significant differences between different spectra and what is driving these differences. We propose here to use UMAP as a dimension reduction algorithm and plot the 2D representation for different scenarios. We will start exactly like we did in the other notebook (data_analysis_with_pypam), so we can make sure we have the HMD downloaded.
First we need to install all the packages which we need to exectue this notebook. You don't need to do this if you're using mybinder
!pip install pvlib
!pip install lifewatch-pypam==0.3.2
!pip install minio
We will download some processed HMB data stored in the cloud to give some examples of how can it be used. These data will be downloaded in this jupyterlab space, and you will be able to find them under the folder you specify here below, organized by station. Please change this line depending on where you want to store the data.
We first start importing the packages we need for this part of the notebook
# Import the necessary packages
import minio
import os
import pathlib
from datetime import datetime
import re
local_path = '../data'
max_days = 10
start_date_time = '2021-01-01'
end_date_time = '2021-02-01'
# Update end_date_time if defined temporal range exceeds max_days
time_delta = datetime.fromisoformat(end_date_time) - datetime.fromisoformat(start_date_time)
if time_delta.days > max_days:
end_date_time = str(datetime.fromisoformat(start_date_time) + timedelta(days=max_days))
print(f'end_date_time updated to {end_date_time}')
We will then start defining a function to download the data ONLY from the period we are interested in which can be then reused for different stations (then we don't need to write as much!)
# First we create a function to download data
def download_data_station(station_name,
client_obj,
bucket_str,
prefix_str,
data_path,
name_format,
start_datetime,
end_datetime):
start_datetime_obj = datetime.fromisoformat(start_datetime)
end_datetime_obj = datetime.fromisoformat(end_datetime)
station_folder = pathlib.Path(data_path).joinpath(station_name)
if not station_folder.exists():
os.mkdir(station_folder)
objects = list(client_obj.list_objects(bucket_str, prefix=prefix_str))
ct = 0
for i, obj in enumerate(objects):
object_name = obj.object_name
path_name = pathlib.Path(object_name).name
if (not path_name.startswith('.')) & path_name.endswith('.nc'):
match = re.findall(r"_(\d+)", path_name)[-1]
file_date = datetime.strptime(match, name_format)
if (file_date >= start_datetime_obj) & (file_date <= end_datetime_obj):
download_path = data_path + '/' + station_name + '/' + pathlib.Path(object_name).name
if os.path.isfile(download_path):
print('Already downloaded: ' + download_path)
else:
print('Download ' + str(ct) + ' of ' + str(len(objects)) + ': ' + download_path)
object_data = client.get_object(bucket, object_name)
if not os.path.isdir(download_path):
with open(str(download_path), 'wb') as file_data:
for data in object_data:
file_data.write(data)
file_data.close()
else:
print('Ignored, out of selected period or not a netCDF file ' + path_name)
ct += 1
Then we will download the data for 2 stations. For this workshop we have chosen MARS and NRS11, but feel free to change the station names here if you wish to play around with other stations from the same providers
Let's start with MARS. We first need to describe where the data are located
# Set up the download for MARS data
client = minio.Minio( "s3.us-west-2.amazonaws.com", secure=False)
bucket = 'pacific-sound-spectra'
prefix = '2021/'
station = 'MARS'
name_format = '%Y%m%d'
download_data_station(station, client, bucket, prefix, local_path, name_format=name_format, start_datetime=start_date_time, end_datetime=end_date_time)
Then we move to the NRS11 data. We repreat the same process just with different parameters. These are the data we can find at sanctsound, daily millidecade bands.
# Set up the download for NRS11 data
client = minio.Minio('storage.googleapis.com')
bucket = 'noaa-passive-bioacoustic'
station = 'NRS11'
prefix = 'soundcoop/%s/' % station
name_format = '%Y%m%d'
download_data_station(station, client, bucket, prefix, local_path, name_format=name_format, start_datetime=start_date_time, end_datetime=end_date_time)
First we set some style guidelines for plotting
import matplotlib.pyplot as plt
%matplotlib inline
# Clear out default notebook settings
%config InlineBackend.print_figure_kwargs = {'bbox_inches':None}
# Set figure size and layout
plt.rcParams["figure.figsize"] = [12.00, 5.00]
plt.rcParams["figure.autolayout"] = True
# Import pypam modules
import pypam.utils
import pypam.plots
# Import other tools
import pathlib
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
RESAMPLE_RESOLUTION = '1h' # we could also set it to 1d for daily resolution for example
min_freq = 10
max_freq = 10000
# Only get data from 2021
def load_data_from_station_year(station, year, data_vars):
deployment_path = pathlib.Path(local_path).joinpath(station)
print('loading station %s...' % station)
aggregated_ds = pypam.utils.join_all_ds_output_deployment(deployment_path, data_vars=data_vars, datetime_coord='time',
join_only_if_contains='_%s' % year, load=True,
parallel=False, freq_band=[min_freq, max_freq],
freq_coord='frequency',
time_resample='1h',
drop_variables=['psd_image_colormap',
'psd_image',
'percentile_image',
'percentile_image_colormap'])
return aggregated_ds # this assigns an xarray dataset
# Load the data for each station
# Quality flag is only available for NRS11, not for MARS. We can decide to ignore it or load it by removing it from the data_vars list
ds_collection = {}
ds_collection['MARS'] = load_data_from_station_year('MARS', 2021, data_vars=['psd'])
ds_collection['NRS11'] = load_data_from_station_year('NRS11', 2021, data_vars=['psd', 'quality_flag'])
For the dimension reduction we need an extra package, umap (PCA would also work, from sklearn but umap captures better non-linear relationships).
You don't need to run this cell on mybinder, as this has been installed for you
!pip install umap-learn
import umap
import pandas as pd
import seaborn as sns
# Select for both stations the maximum frequency
ds1 = pypam.utils.select_frequency_range(ds_collection['NRS11'],
min_freq=min_freq,
max_freq=2000, freq_coord='frequency')
ds2 = pypam.utils.select_frequency_range(ds_collection['MARS'],
min_freq=min_freq,
max_freq=2000, freq_coord='frequency')
We first need to convert the data from xarray to pandas format, as this is what UMAP algorithm takes as an input
df1 = ds1['psd'].to_pandas()
frequency_columns = df1.columns
df1['station'] = 'NRS11'
df2 = ds2['psd'].to_pandas()
df2.columns = frequency_columns
df2['station'] = 'MARS'
df = pd.concat([df1, df2])
df['embedding'] = df[frequency_columns].values.tolist()
df['month'] = df.index.month
df['hour'] = df.index.hour
df.to_pickle('df_example_soundcoop.pkl')
df = df.dropna()
df1 = df1.dropna()
df2 = df2.dropna()
Then we perform the dimension reduction
umap_box = umap.UMAP(n_components=2, n_neighbors=20, min_dist=0.05)
umap_box.fit(df[frequency_columns].values)
embedding = umap_box.transform(df[frequency_columns])
We plot the results, coloring per station
ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],
s=2, alpha=0.9, hue=df['station'],
legend=True)
ax.set_facecolor('white')
plt.show()
We can perfom another dimension reduction, this time only for one station
umap_box = umap.UMAP(n_components=2, n_neighbors=20, min_dist=0.05)
umap_box.fit(df1[frequency_columns].values)
embedding = umap_box.transform(df1[frequency_columns])
And we can plot it coloring it by hour for example
ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],
s=2, alpha=0.9, hue=df1.index.hour,
legend=True)
ax.set_facecolor('white')
plt.show()
Or by month
ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],
s=2, alpha=0.9, hue=df1.index.month,
legend=True)
ax.set_facecolor('white')
plt.show()
umap_box = umap.UMAP(n_components=2, n_neighbors=20, min_dist=0.05)
umap_box.fit(df2[frequency_columns].values, y=df2.index.month)
embedding = umap_box.transform(df2[frequency_columns])
ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],
s=2, alpha=0.9, hue=df2.index.month,
legend=True)
ax.set_facecolor('white')
plt.show()