Select Data in Columns for Data Science¶

Pivot the row-based data in a STOQS database to fit into a column-based dataframe

This Notebook explores options raised by this GitHub Issue. We want to be able to efficiently consume mass quantities of data from a STOQS database and have it organized for efficient data analysis and visualization using modern data frame orientied tools.

Executing this Notebook requires a personal STOQS server. It can be run from either a Docker installation or from a development Vagrant Virtual Machine.

Docker Instructions¶

Install and start the software as detailed in the README. (Note that on MacOS you will need to modify settings in your docker-compose.yml and .env files — look for comments referencing 'HOST_UID'.)

Then, from your $STOQS_HOME/docker directory start the Jupyter Notebook server - you can query from the remote database or from a copy that you've made to your local system:

Option A: Query from MBARI's master database¶

Start the Jupyter Notebook server pointing to MBARI's master STOQS database server. (Note: firewall rules limit unprivileged access to such resources):

docker-compose exec \
    -e DATABASE_URL=postgis://everyone:guest@kraken.shore.mbari.org:5432/stoqs \
    stoqs stoqs/manage.py shell_plus --notebook

Option B: Query from your local Docker Desktop¶

Restore the stoqs_canon_october2020 database from MBARI's server onto your local database and start the Jupyter Notebook server using the default DATABASE_URL, which should be your local system, also make sure that your Docker Desktop has at least 16 GB of RAM allocated to it:

cd $STOQS_HOME/docker
docker-compose exec stoqs createdb -U postgres stoqs_canon_october2020
curl -k https://stoqs.shore.mbari.org/media/pg_dumps/stoqs_canon_october2020.pg_dump | \
    docker exec -i stoqs pg_restore -Fc -U postgres -d stoqs_canon_october2020
docker-compose exec stoqs stoqs/manage.py shell_plus --notebook

Opening this Notebook¶

Following execution of the stoqs/manage.py shell_plus --notebook command a message is displayed giving a URL for you to use in a browser on your host, e.g.:

http://127.0.0.1:8888/?token=<a_token_generated_upon_server_start>

In the browser window opened to this URL navigate to this file (select_data_in_columns_for_data_science.ipynb) and open it. You will then be able to execute the cells and modify the code to suit your needs.

The information in the output cells result from execution on a 2019 MacBook Pro with a 2.4 GHz 8-Core Intel Core i9 processor, 32 GB 2667 MHz DDR4 RAM, running Docker Desktop 3.1.0 with 16 GB with 4 CPUs and 16 GB allocated.

In [1]:

import os
import time

# Prevent SynchronousOnlyOperation exceptions
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"

# Set do_plots to True for visualizations (do not commit with this setting)
do_plots = False
t_start = time.time()

# Use a recent database available at DATABASE_URL
db = 'stoqs_canon_october2020'

0. Perform a straight forward query using the STOQS data model.¶

Collect all the sea_water_temperature and sea_water_salinity data into dictionaries keyed by platform name. This is to examine the landscape of data we are querying.

In [2]:

# To make sure we collect temperatures and salinities that are properly associated
# we will first find all the Platforms that have T & S and then from each Measurement
# from the Platform collect the temperatures and salinities into lists for plotting.
# Assume that Platforms that have sea_water_salinity also have sea_water_temperature.
platforms = (ActivityParameter.objects.using(db)
                              .filter(parameter__standard_name='sea_water_salinity')
                              .values_list('activity__platform__name', flat=True)
                              .distinct().order_by('activity__platform__name'))
temps = {}
salts = {}
for platform in platforms:
    print(f"Collecting data for: {platform:23}", end=' ')
    mps = (MeasuredParameter.objects.using(db)
           .filter(measurement__instantpoint__activity__platform__name=platform))
    
    temps[platform] = (mps.filter(parameter__standard_name='sea_water_temperature')
                          .values_list('datavalue', flat=True))
    salts[platform] = (mps.filter(parameter__standard_name='sea_water_salinity')
                          .values_list('datavalue', flat=True))
    print(f"#temps: {len(temps[platform]):6}  #salts: {len(salts[platform]):6}", end='')
    if len(temps[platform]) != len(salts[platform]):
        print(' - not equal')
    else:
        print()
print('Done')

Collecting data for: dorado                  #temps: 169159  #salts: 169159
Collecting data for: M1_Mooring              #temps:   6600  #salts:   6600
Collecting data for: makai                   #temps: 154793  #salts: 154797 - not equal
Collecting data for: makai_ESPmv1_filtering  #temps:  29857  #salts:  29857
Collecting data for: makai_Sipper            #temps:    595  #salts:    595
Collecting data for: NPS_Glider_29           #temps:  33143  #salts:  33143
Collecting data for: NPS_Glider_34           #temps:  32759  #salts:  32759
Collecting data for: pontus                  #temps: 132945  #salts: 132947 - not equal
Collecting data for: wg_Tiny_Glider          #temps:    386  #salts:    386
Done

In [3]:

if do_plots:
    # Make a T/S plot of data from all the platforms
    import pylab as plt
    plt.rcParams['figure.figsize'] = (18, 6)
    for platform in temps.keys():
        if len(temps[platform]) == len(salts[platform]):
            plt.scatter(salts[platform], temps[platform], s=1, label=platform)
    plt.xlabel('Salinty')
    plt.ylabel('Temperature (DegC)')
    plt.legend();

Approach 1. Use the same kind of self-join query used for selecting data for Parameter-Parameter plots.¶

A sample SQL statement was copied from the STOQS UI and then modified to select sea_water_temperature and sea_water_salinity from all platforms.

In [4]:

sql_multp = '''SELECT DISTINCT 
                stoqs_platform.name,
                stoqs_instantpoint.timevalue,
                stoqs_measurement.depth,
                mp_salt.datavalue AS salt,
                mp_temp.datavalue AS temp
FROM stoqs_measuredparameter
INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)
INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)
INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)
INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)
INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id
INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id
INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id
INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id
INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id
INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id
WHERE (p_salt.standard_name = 'sea_water_salinity')
  AND (p_temp.standard_name = 'sea_water_temperature')
  AND stoqs_platform.name IN ({})
ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth'''

In [5]:

# Build the SQL with optional selection of platforms to use
db = 'stoqs_canon_october2020'
platforms = (ActivityParameter.objects.using(db)
                              .filter(parameter__standard_name='sea_water_salinity')
                              .values_list('activity__platform__name', flat=True)
                              .order_by('activity__platform__name').distinct())
plats = ''
plat_list = []
for platform in platforms:
    if platform == 'M1_Mooring' or platform == 'makai' or platform == 'pontus':
        # Continue to omit some platforms for shorter execution times
        continue
    plats += f"'{platform}',"
    plat_list.append(platform)
plats = plats[:-2] + "'"
sql = sql_multp.format(plats)
print(sql)

SELECT DISTINCT 
                stoqs_platform.name,
                stoqs_instantpoint.timevalue,
                stoqs_measurement.depth,
                mp_salt.datavalue AS salt,
                mp_temp.datavalue AS temp
FROM stoqs_measuredparameter
INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)
INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)
INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)
INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)
INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id
INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id
INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id
INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id
INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id
INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id
WHERE (p_salt.standard_name = 'sea_water_salinity')
  AND (p_temp.standard_name = 'sea_water_temperature')
  AND stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider')
ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth

In [6]:

import pandas as pd
from django.db import connections

# It takes about 15 seconds to read about 0.5 million rows from the local STOQS database.
%time df1 = pd.read_sql_query(sql, connections[db], index_col=['name', 'timevalue', 'depth'])
##%time df1 = pd.read_sql_query(sql, connections[db])
print(df1.shape)
df1.head()

CPU times: user 614 ms, sys: 91.6 ms, total: 706 ms
Wall time: 8.11 s
(266285, 2)

Out[6]:

			salt	temp
name	timevalue	depth
wg_Tiny_Glider	2020-10-05 23:20:00	-1.0	33.499126	14.309250
	2020-10-05 23:20:00	0.5	33.499126	14.309250
	2020-10-05 23:30:00	-1.0	33.495251	14.408250
	2020-10-05 23:30:00	0.5	33.495251	14.408250
	2020-10-05 23:40:00	-1.0	33.491249	14.349625

In [7]:

# Writing the Parquet file takes about 0.6 seconds
%time df1.to_parquet('all_plats.parquet')

CPU times: user 132 ms, sys: 37.2 ms, total: 170 ms
Wall time: 233 ms

In [8]:

# Reading the Parquest file takes about 0.4 seconds
%time df1b = pd.read_parquet('all_plats.parquet')
df1b.shape

CPU times: user 164 ms, sys: 20.9 ms, total: 185 ms
Wall time: 226 ms

Out[8]:

(266285, 2)

In [9]:

df1

Out[9]:

			salt	temp
name	timevalue	depth
wg_Tiny_Glider	2020-10-05 23:20:00	-1.000000	33.499126	14.309250
	2020-10-05 23:20:00	0.500000	33.499126	14.309250
	2020-10-05 23:30:00	-1.000000	33.495251	14.408250
	2020-10-05 23:30:00	0.500000	33.495251	14.408250
	2020-10-05 23:40:00	-1.000000	33.491249	14.349625
...	...	...	...	...
dorado	2020-10-28 14:42:43	1.638631	33.487351	14.753714
	2020-10-28 14:42:44	1.356102	33.488457	14.751194
	2020-10-28 14:42:46	0.791729	33.489155	14.750566
	2020-10-28 14:42:47	0.513772	33.489155	14.744095
	2020-10-28 14:42:48	0.169383	33.489155	14.743799

266285 rows × 2 columns

In [10]:

# Datashader plots must be left justified on last line, use this variable to do that
ts_points = None
if do_plots:
    # See: http://holoviews.org/user_guide/Large_Data.html
    #      https://stackoverflow.com/a/18835121/1281657
    import colorcet
    import holoviews as hv
    from holoviews.operation.datashader import rasterize
    hv.extension("bokeh")
    ropts = dict(height=380, width=300, colorbar=True, colorbar_position="bottom", cmap=colorcet.fire)
    plots = [(rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p], kdims=['salt', 'temp']))
              .opts(**ropts).relabel(p)) for p in plat_list]
    ts_points = hv.Layout(plots).cols(3)
ts_points

This approach could be used in a general way to extract all Parameters for each Platform by dynamically generating the SQL (with dozens more self joins) and executing it. We do need more scalable methods than .read_sql_query() and .to_parquet(), which need to read and write all the data in to and out of allocated random access memory. This is why at least a resource of 16 GB of RAM is needed in Docker Desktop for this query - larger data requests would require more memory - this is not scalable.

This SQL is not forgiving in terms of ad hoc modification; for example, adding a column to the SELECT can increase the volume of results by unexpectedly returning a type of cross join with repeated salt and temp values.

This sort of self-join query returns a lot of duplicate records (kind of a cross join) for the 'M1_Mooring' platform which has a 'stationprofile' CF featureType, resulting in a different relational cardinality that would require special treatment.

Approach 2. Use Brent's trimSTOQS program to convert the MeasuredParameter Data Access output:¶

In [11]:

# It takes about 5 minutes to read in 0.17 million dorado CSV rows and convert using trimSTOQS
##! time wget https://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=dorado \
##    -q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, > october2020_dorado_parms.cvs
##df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_dorado_parms.cvs')

# It takes about 40 seconds (on a fast network) to read in just 0.033 million NPS_Glider_29 CSV rows and convert using trimSTOQS
! time wget http://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=NPS_Glider_29 \
    -q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, \
    > /srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs
    
df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs')
print(df2.shape)
df2.head()

0.12user 0.80system 0:11.25elapsed 8%CPU (0avgtext+0avgdata 6744maxresident)k
0inputs+0outputs (0major+363minor)pagefaults 0swaps
(33176, 11)

Out[11]:

	timevalue	depth	geom.x	geom.y	altitude	TEMP (Celsius)	PSAL (0.001)	OXYG (ml/L)	sigmat	spice	NPS_Glider_29
0	2020-10-06 19:07:40.800002	0.000000	-121.853996	36.803799	280.567732	13.956	33.462002	6.002	25.007821	0.682620	NaN
1	2020-10-06 19:07:57.931036	0.794097	-121.854050	36.803795	280.170345	13.845	33.466999	5.918	25.034645	0.663049	NaN
2	2020-10-06 19:08:15.062071	1.885975	-121.854103	36.803787	279.502588	13.731	33.467999	5.815	25.058888	0.639810	NaN
3	2020-10-06 19:08:32.193105	2.878586	-121.854149	36.803780	278.858728	13.626	33.472000	5.716	25.083488	0.620954	NaN
4	2020-10-06 19:08:49.324140	3.970453	-121.854202	36.803772	278.152412	13.590	33.469002	5.624	25.088541	0.611061	NaN

The advantage of this approach is that all parameters get transformed into the columns we want. The disadvantage is that it takes a long time to extract the data in CSV format. Approach 1 reads at a rate of about 30,000 rows/sec, approach 2 reads at a rate of 1000 rows/sec - orders of magnitude slower.

3. Do a direct Postgresql query to transform the data, perhaps using the crosstab() function.¶

Need to do this on the database first:

% docker-compose exec postgis psql -U postgres  
postgres=# \c stoqs_canon_october2020
stoqs_canon_october2020=# CREATE EXTENSION IF NOT EXISTS tablefunc;
CREATE EXTENSION

In [12]:

# Base query that's similar to the one behind the api/measuredparameter.csv request
sql_base = '''SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name,
       stoqs_instantpoint.timevalue, stoqs_measurement.depth, 
       ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude,
       stoqs_parameter.name, standard_name, datavalue 
FROM public.stoqs_measuredparameter
INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)
INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)
INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)
INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)
INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id)
WHERE stoqs_platform.name IN ({})
ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name'''
sql = sql_base.format(plats)
print(sql)

SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name,
       stoqs_instantpoint.timevalue, stoqs_measurement.depth, 
       ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude,
       stoqs_parameter.name, standard_name, datavalue 
FROM public.stoqs_measuredparameter
INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)
INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)
INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)
INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)
INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id)
WHERE stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider')
ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name

In [13]:

# Identify the columns used as the context (index) for the measurements
context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']

# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows
%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)
print(df3a.shape)
df3a.head()

CPU times: user 12.4 s, sys: 2.39 s, total: 14.8 s
Wall time: 41.8 s
(3683189, 4)

Out[13]:

					activity__name	name	standard_name	datavalue
platform	timevalue	depth	latitude	longitude
dorado	2020-10-08 20:08:11	-0.028264	36.796672	-121.849978	Dorado389_2020_282_01_282_01_decim.nc	altitude	height_above_sea_floor	2.707653e+02
				-121.849978	Dorado389_2020_282_01_282_01_decim.nc	bbp420 (m-1)	None	1.269240e-03
				-121.849978	Dorado389_2020_282_01_282_01_decim.nc	bbp700 (m-1)	None	8.974295e-03
				-121.849978	Dorado389_2020_282_01_282_01_decim.nc	biolume	None	3.691703e+09
				-121.849978	Dorado389_2020_282_01_282_01_decim.nc	fl700_uncorr	None	1.431676e-03

In [14]:

context = ['platform', 'activity__name', 'timevalue', 'depth', 'latitude', 'longitude']

# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows
%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)
print(df3a.shape)
df3a.head()

CPU times: user 14.6 s, sys: 1.59 s, total: 16.2 s
Wall time: 43.1 s
(3683189, 3)

Out[14]:

						name	standard_name	datavalue
platform	activity__name	timevalue	depth	latitude	longitude
dorado	Dorado389_2020_282_01_282_01_decim.nc	2020-10-08 20:08:11	-0.028264	36.796672	-121.849978	altitude	height_above_sea_floor	2.707653e+02
					-121.849978	bbp420 (m-1)	None	1.269240e-03
					-121.849978	bbp700 (m-1)	None	8.974295e-03
					-121.849978	biolume	None	3.691703e+09
					-121.849978	fl700_uncorr	None	1.431676e-03

In [15]:

import pandas.io.sql as sqlio
import psycopg2

# Use psycopg2 for direct from Postgres query - still explodes stoqs container RAM as this Notebook runs there
conn = psycopg2.connect("host='{}' port={} dbname='{}' user={} password={}".format(
                        'postgis', 5432, 'stoqs_canon_october2020', 'stoqsadm', 'CHANGEME'))
# Takes about 5 minutes to read 13.5 million rows
%time df3b = sqlio.read_sql_query(sql, conn, index_col=context)
print(df3b.shape)
df3b.head()

CPU times: user 27.3 s, sys: 1.51 s, total: 28.8 s
Wall time: 56.6 s
(3683189, 3)

Out[15]:

						name	standard_name	datavalue
platform	activity__name	timevalue	depth	latitude	longitude
dorado	Dorado389_2020_282_01_282_01_decim.nc	2020-10-08 20:08:11+00:00	-0.028264	36.796672	-121.849978	altitude	height_above_sea_floor	2.707653e+02
					-121.849978	bbp420 (m-1)	None	1.269240e-03
					-121.849978	bbp700 (m-1)	None	8.974295e-03
					-121.849978	biolume	None	3.691703e+09
					-121.849978	fl700_uncorr	None	1.431676e-03

In [16]:

# Setting chunksize doesn't help reduce memory requirements in the stoqs or stoqs-postgis containers
# See: https://stackoverflow.com/a/31843091/1281657
#      https://github.com/pandas-dev/pandas/issues/12265#issuecomment-181809005
#      https://github.com/pandas-dev/pandas/issues/35689
df3c = pd.DataFrame()
# Still takes about 2.5 minutes to read 13.5 rows (chunking happens only on client in Pandas)
##%time chunker = pd.read_sql_query(sql, connections[db], index_col=context, chunksize=1000000)
##for chunk in chunker:
    ##print(chunk.shape)
    ##df3c.add(chunk.pivot_table(index=context, columns='name', values='datavalue'))
print("It would be nice if chunksize helped with memory usage in docker, but it doesn't.")

It would be nice if chunksize helped with memory usage in docker, but it doesn't.

The syntax of crosstab() is arcane and it will take some work to figure out a way to preserve datetime objects as they are read into a DataFrame. It's likely that performing a pivot on the data closer to the database will be more performant than say doing it after reading records into a DataFrame.

Approach 4. Use Pandas do a pivot on data read into a DataFrame¶

Similar to Approach 2, but this may be more efficient as conversion to and from CSV text format is avoided.

In [17]:

# Identify the columns used as the index for the pivot
context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']

%time df4 = df3a.pivot_table(index=context, columns='name', values='datavalue')
print(df4.shape)
df4.head()

CPU times: user 4.9 s, sys: 53.1 ms, total: 4.96 s
Wall time: 5.02 s
(312470, 51)

Out[17]:

				name	O2_conc (umol/kg)	OXYG	OXYG (ml/L)	PAR (umol/s/m2)	PSAL (0.001)	TEMP (Celsius)	air_temp (C)	altitude	atm_press (mbar)	avg_wind_spd (m/s)	...	salinity	salinity (psu)	sepCountList (count)	sigmat	spice	temperature (Celsius)	temperature (degC)	water_temp (C)	wind_dir (deg (True))	yaw (degree)
platform	timevalue	depth	latitude	longitude
NPS_Glider_29	2020-10-06 19:07:40.800002	0.000000	36.803799	-121.853996	NaN	NaN	6.002	NaN	33.462002	13.956	NaN	280.567732	NaN	NaN	...	NaN	NaN	NaN	25.007821	0.682620	NaN	NaN	NaN	NaN	NaN
	2020-10-06 19:07:57.931036	0.794097	36.803795	-121.854050	NaN	NaN	5.918	NaN	33.466999	13.845	NaN	280.170345	NaN	NaN	...	NaN	NaN	NaN	25.034645	0.663049	NaN	NaN	NaN	NaN	NaN
	2020-10-06 19:08:15.062071	1.885975	36.803787	-121.854103	NaN	NaN	5.815	NaN	33.467999	13.731	NaN	279.502588	NaN	NaN	...	NaN	NaN	NaN	25.058888	0.639810	NaN	NaN	NaN	NaN	NaN
	2020-10-06 19:08:32.193105	2.878586	36.803780	-121.854149	NaN	NaN	5.716	NaN	33.472000	13.626	NaN	278.858728	NaN	NaN	...	NaN	NaN	NaN	25.083488	0.620954	NaN	NaN	NaN	NaN	NaN
	2020-10-06 19:08:49.324140	3.970453	36.803772	-121.854202	NaN	NaN	5.624	NaN	33.469002	13.590	NaN	278.152412	NaN	NaN	...	NaN	NaN	NaN	25.088541	0.611061	NaN	NaN	NaN	NaN	NaN

5 rows × 51 columns

This approach looks promising. Some advantages:

The SQL query is a simple inner join of the tables - similar to that used for MeasuredParameter Data Access
No complicated self joins are needed
It will work for any Parameter names from any platforms
Missing values are preserved as None or NaN
Pandas pivot_table() method is efficient, taking only about 5 seconds

Some disadvantages:

This uses the stoqs Docker image and the Django api - it could be closer to the database
More than 16 GB needs to be resourced to Docker Desktop to read the entire stoqs_canon_october2020 db
When memory is exhausted there is no error message provided; it annoyingly quits silently

These experiments now lead to the script stoqs/contrib/parquet/extract_columns.py which will implement this capability at the command line. This notebook can still serve as a "playground" for testing out various ways to get STOQS data into modern data science tools.

In [18]:

if do_plots:
    # See: https://datashader.org/getting_started/Pipeline.html
    import holoviews as hv
    from holoviews.operation.datashader import datashade
    hv.extension("bokeh")
    pts1 = hv.Points(df1, kdims=['salt', 'temp'])
    pts2 = hv.Points(df2, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])
    pts4a = hv.Points(df4, kdims=['salinity', 'temperature (Celsius)'])
    pts4b = hv.Points(df4, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])
    ts_points = ( datashade(pts1, cmap=colorcet.fire).opts(title='df1')
                + datashade(pts2, cmap=colorcet.fire).opts(title='df2')
                + datashade(pts4a, cmap=colorcet.fire).opts(title='df4a')
                + datashade(pts4b, cmap=colorcet.fire).opts(title='df4b'))
ts_points

In [19]:

if do_plots:
    # See: http://holoviews.org/user_guide/Large_Data.html
    from holoviews.operation.datashader import rasterize
    ##ropts = dict(tools=["pan,wheel_zoom,box_zoom"], height=380, width=330, colorbar=True, colorbar_position="bottom")
    ropts = dict(height=380, width=330, colorbar=True, colorbar_position="bottom")

    ts_points = hv.Layout([rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p],kdims=['temp', 'salt'])).opts(**ropts).relabel(p)for p in plat_list])

ts_points

In [20]:

print(f"Time to execute this notebook: {(time.time() - t_start):.1f} seconds")

Time to execute this notebook: 176.5 seconds

Monitoring with docker stats shows that executing this notebook required 7.5 GB of memory by the stoqs container.

In [ ]: