Notebook

Classify Data¶

Create a classifier for different kinds of plankton using supervised machine learning

Executing this Notebook requires a personal STOQS database. Follow the steps to build your own development system — this will take about an hour and depends on a good connection to the Internet. Once your server is up log into it (after a cd ~/Vagrants/stoqsvm) and activate your virtual environment with the usual commands:

vagrant ssh -- -X
cd /vagrant/dev/stoqsgit
source venv-stoqs/bin/activate

Then load the stoqs_september2013 database with the commands:

cd stoqs
ln -s mbari_campaigns.py campaigns.py
export DATABASE_URL=postgis://stoqsadm:CHANGEME@127.0.0.1:5432/stoqs
loaders/load.py --db stoqs_september2013
loaders/load.py --db stoqs_september2013 --updateprovenance

Loading this database can take over a day as there are over 40 million measurments from 22 different platforms. You may want to edit the stoqs/loaders/CANON/loadCANON_september2013.py file and comment all but the loadDorado() method calls at the end of the file. You can also set a stride value or use the --test option to create a stoqs_september2013_t database, in which case you'll need to set the STOQS_CAMPAIGNS envrironment variable:

export STOQS_CAMPAIGNS=stoqs_september2013_t

Use the stoqs/contrib/analysis/classify.py script to create some labeled data that we will learn from:

contrib/analysis/classify.py --createLabels --groupName Plankton \
    --database stoqs_september2013 --platform dorado \
    --start 20130916T124035 --end 20130919T233905 \
    --inputs bbp700 fl700_uncorr --discriminator salinity \
    --labels diatom dino1 dino2 sediment \
    --mins 33.33 33.65 33.70 33.75 --maxes 33.65 33.70 33.75 33.93 --clobber -v

A little explanation is probably warranted here. The Dorado missions on 16-19 September 2013 sampled distinct water types in Monterey Bay that are easily identified by ranges of salinity. These water types contain different kinds of particles as identified by bbp700 (backscatter) and fl700_uncorr (chlorophyll). The previous command "labeled" MeasuredParameters in the database according to our understanding of the optical properties of diatoms, dinoflagellates, and sediment. This works for this data set because of the particular oceanographic conditions at the time.

This Notebook demonstrates creating a classification algortithm from these labeled data and addresses Issue 227 on GitHub. To be able to execute the cells and experiment with different algortithms and parameters launch Jupyter Notebook with:

cd contrib/notebooks
../../manage.py shell_plus --notebook

navigate to this file and open it. You will then be able to execute the cells and experiment with different settings and code.

Use code from the classify module to read data from the database:

In [1]:

from contrib.analysis.classify import Classifier
c = Classifier()

Build up command-line parameters so that we can call methods on our Classifier() object c

In [2]:

from argparse import Namespace
ns = Namespace()
ns.database = 'stoqs_september2013_t'
ns.classifier='Decision_Tree'
ns.inputs=['bbp700', 'fl700_uncorr']
ns.labels=['diatom', 'dino1', 'dino2', 'sediment']
ns.test_size=0.4
ns.train_size=0.4
ns.verbose=True
c.args = ns

Load the labeled data, normalize, and and split into train and test sets (borrowing from classify.py's createClassifier() method)

In [3]:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X, y = c.loadLabeledData('Labeled Plankton', classes=('diatom', 'sediment'))
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=c.args.test_size, train_size=c.args.train_size)

count = 100 for label = diatom
count = 66 for label = sediment

Setup plotting

In [4]:

%pylab inline
import pylab as plt
from matplotlib.colors import ListedColormap
plt.rcParams['figure.figsize'] = (27, 3)

Populating the interactive namespace from numpy and matplotlib

Plot classifier comparisons as in http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [5]:

for i, (name, clf) in enumerate(c.classifiers.items()):
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),
                         np.arange(y_min, y_max, .02))
    
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(1, len(c.classifiers) + 1, i + 1)

    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    if hasattr(clf, "decision_function"):
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

    # Plot also the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
    # and testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
               alpha=0.6)

    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(name)
    ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
            size=15, horizontalalignment='right')

In [ ]: