Visualize Parquet data

Use the Measured Parameter Data Access Parquet format to visualize STOQS data

Executing this Notebook requires a personal STOQS server - these instructions are for a Docker installation. This Notebook builds on issues raised in

Docker Instructions

Install and start the software as detailed in the README. (Note that on MacOS you will need to modify settings in your docker-compose.yml and .env files — look for comments referencing 'HOST_UID'.)

Then, from your $STOQS_HOME/docker directory start the Jupyter Notebook server - you can query from the remote database or from a copy that you've made to your local system:

Option A: Query from MBARI's master database

Start the Jupyter Notebook server pointing to MBARI's master STOQS database server. (Note: firewall rules limit unprivileged access to such resources):

docker-compose exec \
    -e DATABASE_URL=postgis://everyone:[email protected]:5432/stoqs \
    stoqs stoqs/ shell_plus --notebook

Option B: Query from your local Docker Desktop

Restore a database of your choice from, for example below is how to make a local copy of the stoqs_september2013 database from MBARI's server onto your local database and then start the Jupyter Notebook server using the default DATABASE_URL (which should be your local system) also make sure that your Docker Desktop has at least 16 GB of RAM allocated to it:

cd $STOQS_HOME/docker
docker-compose exec stoqs createdb -U postgres stoqs_september2013
curl -k | \
    docker exec -i stoqs pg_restore -Fc -U postgres -d stoqs_september2013
docker-compose exec stoqs stoqs/ shell_plus --notebook

Opening this Notebook

Following execution of the stoqs/ shell_plus --notebook command a message is displayed giving a URL for you to use in a browser on your host, e.g.:<a_token_generated_upon_server_start>

In the browser window opened to this URL navigate to this file (visualize_parquet.ipynb) and open it. You will then be able to execute the cells and modify the code to suit your needs.

The information in the output cells result from execution on a 2019 MacBook Pro with a 2.4 GHz 8-Core Intel Core i9 processor, 32 GB 2667 MHz DDR4 RAM, running Docker Desktop 3.1.0 with 16 GB with 4 CPUs and 16 GB allocated.

In a browser navigate to and make selections as shown in this screen grab: Constructing a parquet download URL

We will attempt to recreate this image from Issue 227: biplot

but this time using Datashader which can handle a lot more data.

In [1]:
import time
t_start = time.time()

# Issuing a STOQS api request from inside the stoqs container - where this
# notebook is running - is not really possible.  For testing with a 
# host='localhost' url you need to make that request from your system and 
# then copy the .parquet file to this directory: stoqs/contrib/notebooks.

# We have make two downloads as lrauvs and dorado have different Parameter names.
# It's theoretically possible to download all Parameter names from all three
# platforms, but that exceeds the container's RAM in my 16 GB Docker machine.
# It's more efficient to download just what we need.

##host = 'localhost'
host = ''
url_dorado = (f'https://{host}/stoqs_september2013/api/measuredparameter.parquet?'
##print(url_dorado)   # Uncoment for 'localhost' download from system browser
url_lrauvs  = (f'https://{host}/stoqs_september2013/api/measuredparameter.parquet?'
##print(url_lrauv)   # Uncoment for 'localhost' download from system browser

!time wget --no-check-certificate -O stoqs_september2013_dorado.parquet "{url_dorado}"
!time wget --no-check-certificate -O stoqs_september2013_lrauvs.parquet "{url_lrauvs}"
--2021-03-17 17:52:07--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8334441 (7.9M) [application/octet-stream]
Saving to: ‘stoqs_september2013_dorado.parquet’

stoqs_september2013 100%[===================>]   7.95M  4.80MB/s    in 1.7s    

2021-03-17 17:52:16 (4.80 MB/s) - ‘stoqs_september2013_dorado.parquet’ saved [8334441/8334441]

0.01user 0.11system 0:08.53elapsed 1%CPU (0avgtext+0avgdata 6860maxresident)k
0inputs+0outputs (0major+364minor)pagefaults 0swaps
--2021-03-17 17:52:16--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60694551 (58M) [application/octet-stream]
Saving to: ‘stoqs_september2013_lrauvs.parquet’

stoqs_september2013 100%[===================>]  57.88M  4.42MB/s    in 14s     

2021-03-17 17:53:46 (4.18 MB/s) - ‘stoqs_september2013_lrauvs.parquet’ saved [60694551/60694551]

0.14user 0.76system 1:29.96elapsed 1%CPU (0avgtext+0avgdata 6948maxresident)k
0inputs+0outputs (0major+363minor)pagefaults 0swaps
In [2]:
import pandas as pd

%time df_dorado = pd.read_parquet('stoqs_september2013_dorado.parquet')
print(f"dorado data: {df_dorado.shape}")

%time df_lrauvs = pd.read_parquet('stoqs_september2013_lrauvs.parquet')
print(f"lrauv data: {df_lrauvs.shape}")

# Combine into single DataFrame for more generalized follow-on processing
df = df_dorado.append(df_lrauvs)
CPU times: user 152 ms, sys: 32 ms, total: 184 ms
Wall time: 226 ms
dorado data: (120505, 4)
CPU times: user 1.7 s, sys: 204 ms, total: 1.9 s
Wall time: 2.05 s
lrauv data: (1531962, 4)
In [3]:
# Commit with do_plots = False, change to True for plots, but don't check it in that way
do_plots = False
plots = None
if do_plots:
    import colorcet
    import holoviews as hv
    from holoviews.operation.datashader import datashade


    pts_dorado = hv.Points(df, kdims=['bbp420', 'fl700_uncorr'])
    pts_daphne = hv.Points(df, kdims=['bb470', 'chlorophyll'])
    pts_tethys = hv.Points(df, kdims=['bb470', 'chlorophyll'])

    plots = ( datashade(pts_dorado,'dorado')
            + datashade(pts_daphne,'daphne') 
            + datashade(pts_tethys,'tethys') )
In [4]:
print(f"Time to execute this notebook: {(time.time() - t_start):.1f} seconds")
Time to execute this notebook: 103.8 seconds