Notebook

Dimension reduction for data visualization¶

Non-linear dimension reductiona and clustering is a powerful tool to very fastly see where are the most significant differences between different spectra and what is driving these differences. We propose here to use UMAP as a dimension reduction algorithm and plot the 2D representation for different scenarios. We will start exactly like we did in the other notebook (data_analysis_with_pypam), so we can make sure we have the HMD downloaded.

0. Setup (install)¶

First we need to install all the packages which we need to exectue this notebook. You don't need to do this if you're using mybinder

In [ ]:

!pip install pvlib 
!pip install lifewatch-pypam==0.3.2 
!pip install minio

1. Download the data¶

We will download some processed HMB data stored in the cloud to give some examples of how can it be used. These data will be downloaded in this jupyterlab space, and you will be able to find them under the folder you specify here below, organized by station. Please change this line depending on where you want to store the data.

We first start importing the packages we need for this part of the notebook

In [ ]:

# Import the necessary packages
import minio
import os
import pathlib
from datetime import datetime
import re

⚠️ Downloading the data takes quite a while. When running this notebook on Binder, please consider the limitations and limit the maximum number of days to 10 days. When running on other environments you can load a full year if wanted (changed the start, end time and the max_days)

In [ ]:

local_path = '../data'
max_days = 10
start_date_time = '2021-01-01'
end_date_time = '2021-02-01'

In [ ]:

# Update end_date_time if defined temporal range exceeds max_days
time_delta = datetime.fromisoformat(end_date_time) - datetime.fromisoformat(start_date_time)
if time_delta.days > max_days:
    end_date_time = str(datetime.fromisoformat(start_date_time) + timedelta(days=max_days))
    print(f'end_date_time updated to {end_date_time}')

We will then start defining a function to download the data ONLY from the period we are interested in which can be then reused for different stations (then we don't need to write as much!)

In [ ]:

# First we create a function to download data 
def download_data_station(station_name, 
                          client_obj, 
                          bucket_str, 
                          prefix_str, 
                          data_path,
                          name_format,
                          start_datetime, 
                          end_datetime):
    start_datetime_obj = datetime.fromisoformat(start_datetime) 
    end_datetime_obj = datetime.fromisoformat(end_datetime)
    station_folder = pathlib.Path(data_path).joinpath(station_name)
    if not station_folder.exists():
        os.mkdir(station_folder)
    objects = list(client_obj.list_objects(bucket_str, prefix=prefix_str))
    ct = 0
    for i, obj in enumerate(objects):
        object_name = obj.object_name
        path_name = pathlib.Path(object_name).name
        if (not path_name.startswith('.')) & path_name.endswith('.nc'):
            match = re.findall(r"_(\d+)", path_name)[-1]
            file_date = datetime.strptime(match, name_format)
            if (file_date >= start_datetime_obj) & (file_date <= end_datetime_obj):
                download_path = data_path + '/' + station_name + '/' + pathlib.Path(object_name).name
                if os.path.isfile(download_path):
                    print('Already downloaded: ' + download_path)
                else:
                    print('Download ' + str(ct) + ' of ' + str(len(objects)) + ': ' + download_path)
                    object_data = client.get_object(bucket, object_name)
                    if not os.path.isdir(download_path):
                        with open(str(download_path), 'wb') as file_data:
                            for data in object_data:
                                file_data.write(data)
                    file_data.close()
            else: 
                print('Ignored, out of selected period or not a netCDF file ' + path_name)
        ct += 1

Then we will download the data for 2 stations. For this workshop we have chosen MARS and NRS11, but feel free to change the station names here if you wish to play around with other stations from the same providers

Let's start with MARS. We first need to describe where the data are located

In [ ]:

# Set up the download for MARS data
client = minio.Minio( "s3.us-west-2.amazonaws.com", secure=False)

bucket = 'pacific-sound-spectra'
prefix = '2021/'
station = 'MARS'
name_format = '%Y%m%d'
download_data_station(station, client, bucket, prefix, local_path, name_format=name_format, start_datetime=start_date_time, end_datetime=end_date_time)

Then we move to the NRS11 data. We repreat the same process just with different parameters. These are the data we can find at sanctsound, daily millidecade bands.

In [ ]:

# Set up the download for NRS11 data 
client = minio.Minio('storage.googleapis.com')
bucket = 'noaa-passive-bioacoustic'
station = 'NRS11'
prefix = 'soundcoop/%s/' % station
name_format = '%Y%m%d'
download_data_station(station, client, bucket, prefix, local_path, name_format=name_format, start_datetime=start_date_time, end_datetime=end_date_time)

Load the data¶

First we set some style guidelines for plotting

In [ ]:

import matplotlib.pyplot as plt
%matplotlib inline

# Clear out default notebook settings
%config InlineBackend.print_figure_kwargs = {'bbox_inches':None}

# Set figure size and layout
plt.rcParams["figure.figsize"] = [12.00, 5.00]
plt.rcParams["figure.autolayout"] = True

In [ ]:

# Import pypam modules 
import pypam.utils
import pypam.plots

In [ ]:

# Import other tools
import pathlib
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

In [ ]:

RESAMPLE_RESOLUTION = '1h' # we could also set it to 1d for daily resolution for example
min_freq = 10
max_freq = 10000

In [ ]:

# Only get data from 2021
def load_data_from_station_year(station, year, data_vars):
    deployment_path = pathlib.Path(local_path).joinpath(station)
    print('loading station %s...' % station)
    aggregated_ds = pypam.utils.join_all_ds_output_deployment(deployment_path, data_vars=data_vars, datetime_coord='time',
                                                              join_only_if_contains='_%s' % year, load=True,
                                                              parallel=False, freq_band=[min_freq, max_freq],
                                                              freq_coord='frequency',
                                                              time_resample='1h',
                                                              drop_variables=['psd_image_colormap',
                                                                              'psd_image',
                                                                              'percentile_image',
                                                                              'percentile_image_colormap'])
    return aggregated_ds # this assigns an xarray dataset

⚠️ Be patient, loading two full years can take a while...

In [ ]:

# Load the data for each station
# Quality flag is only available for NRS11, not for MARS. We can decide to ignore it or load it by removing it from the data_vars list 
ds_collection = {}
ds_collection['MARS'] = load_data_from_station_year('MARS', 2021, data_vars=['psd'])
ds_collection['NRS11'] = load_data_from_station_year('NRS11', 2021, data_vars=['psd', 'quality_flag'])