Pivot the row-based data in a STOQS database to fit into a column-based dataframe
This Notebook explores options raised by this GitHub Issue. We want to be able to efficiently consume mass quantities of data from a STOQS database and have it organized for efficient data analysis and visualization using modern data frame orientied tools.
Executing this Notebook requires a personal STOQS server. It can be run from either a Docker installation or from a development Vagrant Virtual Machine.
Install and start the software as
detailed in the README. (Note that on MacOS you will need to modify settings in your docker-compose.yml
and .env
files — look for comments referencing 'HOST_UID'.)
Then, from your $STOQS_HOME/docker
directory start the Jupyter Notebook server - you can query from the remote database or from a copy that you've made to your local system:
Start the Jupyter Notebook server pointing to MBARI's master STOQS database server. (Note: firewall rules limit unprivileged access to such resources):
docker-compose exec \
-e DATABASE_URL=postgis://everyone:guest@kraken.shore.mbari.org:5432/stoqs \
stoqs stoqs/manage.py shell_plus --notebook
Restore the stoqs_canon_october2020
database from MBARI's server onto your local database and start the Jupyter Notebook server using the default DATABASE_URL, which should be your local system, also make sure that your Docker Desktop has at least 16 GB of RAM allocated to it:
cd $STOQS_HOME/docker
docker-compose exec stoqs createdb -U postgres stoqs_canon_october2020
curl -k https://stoqs.shore.mbari.org/media/pg_dumps/stoqs_canon_october2020.pg_dump | \
docker exec -i stoqs pg_restore -Fc -U postgres -d stoqs_canon_october2020
docker-compose exec stoqs stoqs/manage.py shell_plus --notebook
Following execution of the stoqs/manage.py shell_plus --notebook
command a message is displayed giving a URL for you to use in a browser on your host, e.g.:
http://127.0.0.1:8888/?token=<a_token_generated_upon_server_start>
In the browser window opened to this URL navigate to this file (select_data_in_columns_for_data_science.ipynb
) and open it. You will then be able to execute the cells and modify the code to suit your needs.
The information in the output cells result from execution on a 2019 MacBook Pro with a 2.4 GHz 8-Core Intel Core i9 processor, 32 GB 2667 MHz DDR4 RAM, running Docker Desktop 3.1.0 with 16 GB with 4 CPUs and 16 GB allocated.
import os
import time
# Prevent SynchronousOnlyOperation exceptions
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"
# Set do_plots to True for visualizations (do not commit with this setting)
do_plots = False
t_start = time.time()
# Use a recent database available at DATABASE_URL
db = 'stoqs_canon_october2020'
Collect all the sea_water_temperature and sea_water_salinity data into dictionaries keyed by platform name. This is to examine the landscape of data we are querying.
# To make sure we collect temperatures and salinities that are properly associated
# we will first find all the Platforms that have T & S and then from each Measurement
# from the Platform collect the temperatures and salinities into lists for plotting.
# Assume that Platforms that have sea_water_salinity also have sea_water_temperature.
platforms = (ActivityParameter.objects.using(db)
.filter(parameter__standard_name='sea_water_salinity')
.values_list('activity__platform__name', flat=True)
.distinct().order_by('activity__platform__name'))
temps = {}
salts = {}
for platform in platforms:
print(f"Collecting data for: {platform:23}", end=' ')
mps = (MeasuredParameter.objects.using(db)
.filter(measurement__instantpoint__activity__platform__name=platform))
temps[platform] = (mps.filter(parameter__standard_name='sea_water_temperature')
.values_list('datavalue', flat=True))
salts[platform] = (mps.filter(parameter__standard_name='sea_water_salinity')
.values_list('datavalue', flat=True))
print(f"#temps: {len(temps[platform]):6} #salts: {len(salts[platform]):6}", end='')
if len(temps[platform]) != len(salts[platform]):
print(' - not equal')
else:
print()
print('Done')
Collecting data for: dorado #temps: 169159 #salts: 169159 Collecting data for: M1_Mooring #temps: 6600 #salts: 6600 Collecting data for: makai #temps: 154793 #salts: 154797 - not equal Collecting data for: makai_ESPmv1_filtering #temps: 29857 #salts: 29857 Collecting data for: makai_Sipper #temps: 595 #salts: 595 Collecting data for: NPS_Glider_29 #temps: 33143 #salts: 33143 Collecting data for: NPS_Glider_34 #temps: 32759 #salts: 32759 Collecting data for: pontus #temps: 132945 #salts: 132947 - not equal Collecting data for: wg_Tiny_Glider #temps: 386 #salts: 386 Done
if do_plots:
# Make a T/S plot of data from all the platforms
import pylab as plt
plt.rcParams['figure.figsize'] = (18, 6)
for platform in temps.keys():
if len(temps[platform]) == len(salts[platform]):
plt.scatter(salts[platform], temps[platform], s=1, label=platform)
plt.xlabel('Salinty')
plt.ylabel('Temperature (DegC)')
plt.legend();
A sample SQL statement was copied from the STOQS UI and then modified to select sea_water_temperature and sea_water_salinity from all platforms.
sql_multp = '''SELECT DISTINCT
stoqs_platform.name,
stoqs_instantpoint.timevalue,
stoqs_measurement.depth,
mp_salt.datavalue AS salt,
mp_temp.datavalue AS temp
FROM stoqs_measuredparameter
INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)
INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)
INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)
INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)
INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id
INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id
INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id
INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id
INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id
INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id
WHERE (p_salt.standard_name = 'sea_water_salinity')
AND (p_temp.standard_name = 'sea_water_temperature')
AND stoqs_platform.name IN ({})
ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth'''
# Build the SQL with optional selection of platforms to use
db = 'stoqs_canon_october2020'
platforms = (ActivityParameter.objects.using(db)
.filter(parameter__standard_name='sea_water_salinity')
.values_list('activity__platform__name', flat=True)
.order_by('activity__platform__name').distinct())
plats = ''
plat_list = []
for platform in platforms:
if platform == 'M1_Mooring' or platform == 'makai' or platform == 'pontus':
# Continue to omit some platforms for shorter execution times
continue
plats += f"'{platform}',"
plat_list.append(platform)
plats = plats[:-2] + "'"
sql = sql_multp.format(plats)
print(sql)
SELECT DISTINCT stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, mp_salt.datavalue AS salt, mp_temp.datavalue AS temp FROM stoqs_measuredparameter INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id) INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id) INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id) INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id) INNER JOIN stoqs_measurement m_salt ON m_salt.instantpoint_id = stoqs_instantpoint.id INNER JOIN stoqs_measuredparameter mp_salt ON mp_salt.measurement_id = m_salt.id INNER JOIN stoqs_parameter p_salt ON mp_salt.parameter_id = p_salt.id INNER JOIN stoqs_measurement m_temp ON m_temp.instantpoint_id = stoqs_instantpoint.id INNER JOIN stoqs_measuredparameter mp_temp ON mp_temp.measurement_id = m_temp.id INNER JOIN stoqs_parameter p_temp ON mp_temp.parameter_id = p_temp.id WHERE (p_salt.standard_name = 'sea_water_salinity') AND (p_temp.standard_name = 'sea_water_temperature') AND stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider') ORDER BY stoqs_instantpoint.timevalue, stoqs_measurement.depth
import pandas as pd
from django.db import connections
# It takes about 15 seconds to read about 0.5 million rows from the local STOQS database.
%time df1 = pd.read_sql_query(sql, connections[db], index_col=['name', 'timevalue', 'depth'])
##%time df1 = pd.read_sql_query(sql, connections[db])
print(df1.shape)
df1.head()
CPU times: user 614 ms, sys: 91.6 ms, total: 706 ms Wall time: 8.11 s (266285, 2)
salt | temp | |||
---|---|---|---|---|
name | timevalue | depth | ||
wg_Tiny_Glider | 2020-10-05 23:20:00 | -1.0 | 33.499126 | 14.309250 |
0.5 | 33.499126 | 14.309250 | ||
2020-10-05 23:30:00 | -1.0 | 33.495251 | 14.408250 | |
0.5 | 33.495251 | 14.408250 | ||
2020-10-05 23:40:00 | -1.0 | 33.491249 | 14.349625 |
# Writing the Parquet file takes about 0.6 seconds
%time df1.to_parquet('all_plats.parquet')
CPU times: user 132 ms, sys: 37.2 ms, total: 170 ms Wall time: 233 ms
# Reading the Parquest file takes about 0.4 seconds
%time df1b = pd.read_parquet('all_plats.parquet')
df1b.shape
CPU times: user 164 ms, sys: 20.9 ms, total: 185 ms Wall time: 226 ms
(266285, 2)
df1
salt | temp | |||
---|---|---|---|---|
name | timevalue | depth | ||
wg_Tiny_Glider | 2020-10-05 23:20:00 | -1.000000 | 33.499126 | 14.309250 |
0.500000 | 33.499126 | 14.309250 | ||
2020-10-05 23:30:00 | -1.000000 | 33.495251 | 14.408250 | |
0.500000 | 33.495251 | 14.408250 | ||
2020-10-05 23:40:00 | -1.000000 | 33.491249 | 14.349625 | |
... | ... | ... | ... | ... |
dorado | 2020-10-28 14:42:43 | 1.638631 | 33.487351 | 14.753714 |
2020-10-28 14:42:44 | 1.356102 | 33.488457 | 14.751194 | |
2020-10-28 14:42:46 | 0.791729 | 33.489155 | 14.750566 | |
2020-10-28 14:42:47 | 0.513772 | 33.489155 | 14.744095 | |
2020-10-28 14:42:48 | 0.169383 | 33.489155 | 14.743799 |
266285 rows × 2 columns
# Datashader plots must be left justified on last line, use this variable to do that
ts_points = None
if do_plots:
# See: http://holoviews.org/user_guide/Large_Data.html
# https://stackoverflow.com/a/18835121/1281657
import colorcet
import holoviews as hv
from holoviews.operation.datashader import rasterize
hv.extension("bokeh")
ropts = dict(height=380, width=300, colorbar=True, colorbar_position="bottom", cmap=colorcet.fire)
plots = [(rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p], kdims=['salt', 'temp']))
.opts(**ropts).relabel(p)) for p in plat_list]
ts_points = hv.Layout(plots).cols(3)
ts_points
This approach could be used in a general way to extract all Parameters for each Platform by dynamically generating the SQL (with dozens more self joins) and executing it. We do need more scalable methods than .read_sql_query()
and .to_parquet()
, which need to read and write all the data in to and out of allocated random access memory. This is why at least a resource of 16 GB of RAM is needed in Docker Desktop for this query - larger data requests would require more memory - this is not scalable.
This SQL is not forgiving in terms of ad hoc modification; for example, adding a column to the SELECT can increase the volume of results by unexpectedly returning a type of cross join with repeated salt and temp values.
This sort of self-join query returns a lot of duplicate records (kind of a cross join) for the 'M1_Mooring' platform which has a 'stationprofile' CF featureType, resulting in a different relational cardinality that would require special treatment.
# It takes about 5 minutes to read in 0.17 million dorado CSV rows and convert using trimSTOQS
##! time wget https://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=dorado \
## -q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, > october2020_dorado_parms.cvs
##df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_dorado_parms.cvs')
# It takes about 40 seconds (on a fast network) to read in just 0.033 million NPS_Glider_29 CSV rows and convert using trimSTOQS
! time wget http://stoqs.mbari.org/stoqs_canon_october2020/api/measuredparameter.csv?measurement__instantpoint__activity__platform__name=NPS_Glider_29 \
-q -O - | /srv/stoqs/contrib/trimSTOQS/trimSTOQS parameter__name --separator=, \
> /srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs
df2 = pd.read_csv('/srv/stoqs/contrib/trimSTOQS/october2020_NPS_Glider_29_parms.cvs')
print(df2.shape)
df2.head()
0.12user 0.80system 0:11.25elapsed 8%CPU (0avgtext+0avgdata 6744maxresident)k 0inputs+0outputs (0major+363minor)pagefaults 0swaps (33176, 11)
timevalue | depth | geom.x | geom.y | altitude | TEMP (Celsius) | PSAL (0.001) | OXYG (ml/L) | sigmat | spice | NPS_Glider_29 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-10-06 19:07:40.800002 | 0.000000 | -121.853996 | 36.803799 | 280.567732 | 13.956 | 33.462002 | 6.002 | 25.007821 | 0.682620 | NaN |
1 | 2020-10-06 19:07:57.931036 | 0.794097 | -121.854050 | 36.803795 | 280.170345 | 13.845 | 33.466999 | 5.918 | 25.034645 | 0.663049 | NaN |
2 | 2020-10-06 19:08:15.062071 | 1.885975 | -121.854103 | 36.803787 | 279.502588 | 13.731 | 33.467999 | 5.815 | 25.058888 | 0.639810 | NaN |
3 | 2020-10-06 19:08:32.193105 | 2.878586 | -121.854149 | 36.803780 | 278.858728 | 13.626 | 33.472000 | 5.716 | 25.083488 | 0.620954 | NaN |
4 | 2020-10-06 19:08:49.324140 | 3.970453 | -121.854202 | 36.803772 | 278.152412 | 13.590 | 33.469002 | 5.624 | 25.088541 | 0.611061 | NaN |
The advantage of this approach is that all parameters get transformed into the columns we want. The disadvantage is that it takes a long time to extract the data in CSV format. Approach 1 reads at a rate of about 30,000 rows/sec, approach 2 reads at a rate of 1000 rows/sec - orders of magnitude slower.
Need to do this on the database first:
% docker-compose exec postgis psql -U postgres
postgres=# \c stoqs_canon_october2020
stoqs_canon_october2020=# CREATE EXTENSION IF NOT EXISTS tablefunc;
CREATE EXTENSION
# Base query that's similar to the one behind the api/measuredparameter.csv request
sql_base = '''SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name,
stoqs_instantpoint.timevalue, stoqs_measurement.depth,
ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude,
stoqs_parameter.name, standard_name, datavalue
FROM public.stoqs_measuredparameter
INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id)
INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id)
INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id)
INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id)
INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id)
WHERE stoqs_platform.name IN ({})
ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name'''
sql = sql_base.format(plats)
print(sql)
SELECT stoqs_platform.name as platform, stoqs_activity.name as activity__name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, ST_X(stoqs_measurement.geom) as longitude, ST_Y(stoqs_measurement.geom) as latitude, stoqs_parameter.name, standard_name, datavalue FROM public.stoqs_measuredparameter INNER JOIN stoqs_measurement ON (stoqs_measuredparameter.measurement_id = stoqs_measurement.id) INNER JOIN stoqs_instantpoint ON (stoqs_measurement.instantpoint_id = stoqs_instantpoint.id) INNER JOIN stoqs_activity ON (stoqs_instantpoint.activity_id = stoqs_activity.id) INNER JOIN stoqs_platform ON (stoqs_activity.platform_id = stoqs_platform.id) INNER JOIN stoqs_parameter ON (stoqs_measuredparameter.parameter_id = stoqs_parameter.id) WHERE stoqs_platform.name IN ('dorado','makai_ESPmv1_filtering','makai_Sipper','NPS_Glider_29','NPS_Glider_34','wg_Tiny_Glider') ORDER BY stoqs_platform.name, stoqs_instantpoint.timevalue, stoqs_measurement.depth, stoqs_parameter.name
# Identify the columns used as the context (index) for the measurements
context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']
# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows
%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)
print(df3a.shape)
df3a.head()
CPU times: user 12.4 s, sys: 2.39 s, total: 14.8 s Wall time: 41.8 s (3683189, 4)
activity__name | name | standard_name | datavalue | |||||
---|---|---|---|---|---|---|---|---|
platform | timevalue | depth | latitude | longitude | ||||
dorado | 2020-10-08 20:08:11 | -0.028264 | 36.796672 | -121.849978 | Dorado389_2020_282_01_282_01_decim.nc | altitude | height_above_sea_floor | 2.707653e+02 |
-121.849978 | Dorado389_2020_282_01_282_01_decim.nc | bbp420 (m-1) | None | 1.269240e-03 | ||||
-121.849978 | Dorado389_2020_282_01_282_01_decim.nc | bbp700 (m-1) | None | 8.974295e-03 | ||||
-121.849978 | Dorado389_2020_282_01_282_01_decim.nc | biolume | None | 3.691703e+09 | ||||
-121.849978 | Dorado389_2020_282_01_282_01_decim.nc | fl700_uncorr | None | 1.431676e-03 |
context = ['platform', 'activity__name', 'timevalue', 'depth', 'latitude', 'longitude']
# It takes about 1 minute to read all the Parameters for the selected platforms - about 13.5 million rows
%time df3a = pd.read_sql_query(sql, connections[db], index_col=context)
print(df3a.shape)
df3a.head()
CPU times: user 14.6 s, sys: 1.59 s, total: 16.2 s Wall time: 43.1 s (3683189, 3)
name | standard_name | datavalue | ||||||
---|---|---|---|---|---|---|---|---|
platform | activity__name | timevalue | depth | latitude | longitude | |||
dorado | Dorado389_2020_282_01_282_01_decim.nc | 2020-10-08 20:08:11 | -0.028264 | 36.796672 | -121.849978 | altitude | height_above_sea_floor | 2.707653e+02 |
-121.849978 | bbp420 (m-1) | None | 1.269240e-03 | |||||
-121.849978 | bbp700 (m-1) | None | 8.974295e-03 | |||||
-121.849978 | biolume | None | 3.691703e+09 | |||||
-121.849978 | fl700_uncorr | None | 1.431676e-03 |
import pandas.io.sql as sqlio
import psycopg2
# Use psycopg2 for direct from Postgres query - still explodes stoqs container RAM as this Notebook runs there
conn = psycopg2.connect("host='{}' port={} dbname='{}' user={} password={}".format(
'postgis', 5432, 'stoqs_canon_october2020', 'stoqsadm', 'CHANGEME'))
# Takes about 5 minutes to read 13.5 million rows
%time df3b = sqlio.read_sql_query(sql, conn, index_col=context)
print(df3b.shape)
df3b.head()
CPU times: user 27.3 s, sys: 1.51 s, total: 28.8 s Wall time: 56.6 s (3683189, 3)
name | standard_name | datavalue | ||||||
---|---|---|---|---|---|---|---|---|
platform | activity__name | timevalue | depth | latitude | longitude | |||
dorado | Dorado389_2020_282_01_282_01_decim.nc | 2020-10-08 20:08:11+00:00 | -0.028264 | 36.796672 | -121.849978 | altitude | height_above_sea_floor | 2.707653e+02 |
-121.849978 | bbp420 (m-1) | None | 1.269240e-03 | |||||
-121.849978 | bbp700 (m-1) | None | 8.974295e-03 | |||||
-121.849978 | biolume | None | 3.691703e+09 | |||||
-121.849978 | fl700_uncorr | None | 1.431676e-03 |
# Setting chunksize doesn't help reduce memory requirements in the stoqs or stoqs-postgis containers
# See: https://stackoverflow.com/a/31843091/1281657
# https://github.com/pandas-dev/pandas/issues/12265#issuecomment-181809005
# https://github.com/pandas-dev/pandas/issues/35689
df3c = pd.DataFrame()
# Still takes about 2.5 minutes to read 13.5 rows (chunking happens only on client in Pandas)
##%time chunker = pd.read_sql_query(sql, connections[db], index_col=context, chunksize=1000000)
##for chunk in chunker:
##print(chunk.shape)
##df3c.add(chunk.pivot_table(index=context, columns='name', values='datavalue'))
print("It would be nice if chunksize helped with memory usage in docker, but it doesn't.")
It would be nice if chunksize helped with memory usage in docker, but it doesn't.
The syntax of crosstab() is arcane and it will take some work to figure out a way to preserve datetime objects as they are read into a DataFrame. It's likely that performing a pivot on the data closer to the database will be more performant than say doing it after reading records into a DataFrame.
Similar to Approach 2, but this may be more efficient as conversion to and from CSV text format is avoided.
# Identify the columns used as the index for the pivot
context = ['platform', 'timevalue', 'depth', 'latitude', 'longitude']
%time df4 = df3a.pivot_table(index=context, columns='name', values='datavalue')
print(df4.shape)
df4.head()
CPU times: user 4.9 s, sys: 53.1 ms, total: 4.96 s Wall time: 5.02 s (312470, 51)
name | O2_conc (umol/kg) | OXYG | OXYG (ml/L) | PAR (umol/s/m2) | PSAL (0.001) | TEMP (Celsius) | air_temp (C) | altitude | atm_press (mbar) | avg_wind_spd (m/s) | ... | salinity | salinity (psu) | sepCountList (count) | sigmat | spice | temperature (Celsius) | temperature (degC) | water_temp (C) | wind_dir (deg (True)) | yaw (degree) | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
platform | timevalue | depth | latitude | longitude | |||||||||||||||||||||
NPS_Glider_29 | 2020-10-06 19:07:40.800002 | 0.000000 | 36.803799 | -121.853996 | NaN | NaN | 6.002 | NaN | 33.462002 | 13.956 | NaN | 280.567732 | NaN | NaN | ... | NaN | NaN | NaN | 25.007821 | 0.682620 | NaN | NaN | NaN | NaN | NaN |
2020-10-06 19:07:57.931036 | 0.794097 | 36.803795 | -121.854050 | NaN | NaN | 5.918 | NaN | 33.466999 | 13.845 | NaN | 280.170345 | NaN | NaN | ... | NaN | NaN | NaN | 25.034645 | 0.663049 | NaN | NaN | NaN | NaN | NaN | |
2020-10-06 19:08:15.062071 | 1.885975 | 36.803787 | -121.854103 | NaN | NaN | 5.815 | NaN | 33.467999 | 13.731 | NaN | 279.502588 | NaN | NaN | ... | NaN | NaN | NaN | 25.058888 | 0.639810 | NaN | NaN | NaN | NaN | NaN | |
2020-10-06 19:08:32.193105 | 2.878586 | 36.803780 | -121.854149 | NaN | NaN | 5.716 | NaN | 33.472000 | 13.626 | NaN | 278.858728 | NaN | NaN | ... | NaN | NaN | NaN | 25.083488 | 0.620954 | NaN | NaN | NaN | NaN | NaN | |
2020-10-06 19:08:49.324140 | 3.970453 | 36.803772 | -121.854202 | NaN | NaN | 5.624 | NaN | 33.469002 | 13.590 | NaN | 278.152412 | NaN | NaN | ... | NaN | NaN | NaN | 25.088541 | 0.611061 | NaN | NaN | NaN | NaN | NaN |
5 rows × 51 columns
This approach looks promising. Some advantages:
Some disadvantages:
These experiments now lead to the script stoqs/contrib/parquet/extract_columns.py which will implement this capability at the command line. This notebook can still serve as a "playground" for testing out various ways to get STOQS data into modern data science tools.
if do_plots:
# See: https://datashader.org/getting_started/Pipeline.html
import holoviews as hv
from holoviews.operation.datashader import datashade
hv.extension("bokeh")
pts1 = hv.Points(df1, kdims=['salt', 'temp'])
pts2 = hv.Points(df2, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])
pts4a = hv.Points(df4, kdims=['salinity', 'temperature (Celsius)'])
pts4b = hv.Points(df4, kdims=['PSAL (0.001)', 'TEMP (Celsius)'])
ts_points = ( datashade(pts1, cmap=colorcet.fire).opts(title='df1')
+ datashade(pts2, cmap=colorcet.fire).opts(title='df2')
+ datashade(pts4a, cmap=colorcet.fire).opts(title='df4a')
+ datashade(pts4b, cmap=colorcet.fire).opts(title='df4b'))
ts_points
if do_plots:
# See: http://holoviews.org/user_guide/Large_Data.html
from holoviews.operation.datashader import rasterize
##ropts = dict(tools=["pan,wheel_zoom,box_zoom"], height=380, width=330, colorbar=True, colorbar_position="bottom")
ropts = dict(height=380, width=330, colorbar=True, colorbar_position="bottom")
ts_points = hv.Layout([rasterize(hv.Points(df1.iloc[df1.index.get_level_values('name') == p],kdims=['temp', 'salt'])).opts(**ropts).relabel(p)for p in plat_list])
ts_points
print(f"Time to execute this notebook: {(time.time() - t_start):.1f} seconds")
Time to execute this notebook: 176.5 seconds
Monitoring with docker stats
shows that executing this notebook required 7.5 GB of memory by the stoqs container.