Notebook

Slide 0: Instructions¶

These are not ordinary slides: many are interactive and fully executable
You'll want to be on a laptop or desktop. (For mobile, wait until I finish writing the blog version of this.)
Make sure your browser is in full screen mode (F11) and zoom=100%
Press the space bar to advance to the next slide
To exit the slideshow and see the code, press Alt-R or click the "x"

The Joy of 3D 🎉
Visualizations for Teaching about Classification

Scott H. Hawley¶

Department of Chemistry & Physics
College of Sciences and Mathematics

Re:Teaching PHY/BSA 3895: "Deep Learning and AI Ethics," Fall 2021.

Scholarship of Teaching & Learning Symposium, Belmont University, April 28, 2021

In [1]:

import numpy as np 
from mrspuff.viz import *
from mrspuff.utils import calc_prob, one_hot, softmax
from mrspuff.scrape import exhibit_urls

Intro¶

If your field is related to any of these...

🎜👩‍⚕️⚕️⚖️👮📚🎹👩‍🎤🛒☣️🧑‍🚀🧬🎥🧑‍🎓☢️🧳🧑‍🏫📦🎧🕵️♀️♂️🎅💲📖🧑‍🍳⚗️📡👩‍🏭🔬🖧🧫🧑‍⚖️📈🎮🧑‍🌾📉💉🧑‍🔬💊🗣️♟️🎸

...then automated classification increasingly "a big deal."

Motivation¶

Classification is fascinating to me. Physicists don't do it much but the rest of society relies on it.

I like & teach machine learning (ML), esp. deep learning (DL). Lots of ML is classification. How humans & machines do it differently is the topic of my popular-level eBook-in-progess that uses interactive visualizations (instead of "math").

Recently many research papers (e.g., ML-audio) use "contrastive losses" -- ??
Turns out you can relate these to traditional (DL-based) classification if you view the latter a certain way -- "view" being the operative word!

TODO:shorten We'll compare & contrast "traditional" machine learning (ML) classification with so-called "zero-shot" classifiers that rely on embedding semantically meaningful features as clusters in space by means of contrastive losses. These "zero-shot" methods are increasingly prevalent in the literature, and have the nice property that, unlike traditional ML classifiers, they don't need to be re-trained when new classes are added. If we want to understand zero-shot methods, we may consider traditional classification as an embedding method of it own.

Teaching ML: Viz, and Why 3D¶

In teaching ML, it's common to teach binary classification* and then jump immediately into multi-class problems with large numbers of classes. This misses an opportunity for visualization -- the case of 3 classes.

*see my blog post "Naughty by Numbers: Classifications at Christmas"

I said "no math," but: ${\rm softmax}(x_i) = {e^{x_i} \over \sum\limits_j e^{x_j}} $.
3D gives you all of softmax's complexity & you can still picture it!

Viz Matters for Teaching¶

Data/sound viz has been an big part of my career, further influenced by Yang Hann Kim's 2016 Rossing Prize Lecture in Acoustics Education on viz in teaching STEM.

Today I want to show you new work, including my "Triangle Diagram"!

Viz for > 3D = 😢¶

My blog post, "PCA from Scratch"

Humans can only visualize up to 3 dimensions, so for > 3 features we rely on projections via PCA or nonlinear embedding methods like t-SNE or UMAP. But with PCA data points tend to overlap, and other methods twist/distort/rip so global structure is lost.

In 3D the representations are exact!

New Code Library: mrspuff ¶

For teaching via visualization and running on Google Colab, and leveraging fast.ai.

In [2]:

labels = ['cat','dog','horse']
data = np.array([[0.7,0.2,0.1],[0.15,0.6,0.25],[0.05,0.15,0.8],[1,0,0],[0,1,0],[0,0,1]])

# made the following to generate images to be loaded in the next cell. and save them to ./images/
# You can ignore it
if False:
    from plotly.io import write_image
    for i in range(len(labels)):
        fig = image_and_bars(data[i], labels, CDH_SAMPLE_URLS[i]).show()
        fname = f'images/{labels[i]}_bars.png'
        try:
            write_image(fig, fname) 
        except ValueError:
            print("Sorry, you need the plotly-orca binary installed to auto-save images.")
            print("Click on the camera icon above to save manually to ",fname)

In [3]:

# save cropped versions; ignore it
if False:
    from PIL import Image
    for i in range(len(labels)):
        f = f'images/{labels[i]}'
        fin = f+'_bars.png'
        img = Image.open(fin)
        img = img.crop((15,25,250,200))
        img.save(f+'.png')

Dimensions and Embeddings¶

Traditionally, we use triplets of numbers to denote 3 classes.

"Ground truth" values are "one-hot encoded":

   cat: (1,0,0)           dog: (0,1,0)         horse: (0,0,1)

Model predictics class probabilities for images:

...the 3 numbers always have to add up to 1 (probability).

People who are not mathematicians, physicists, data scientists, etc. may be unaccustomed to this talk of "dimensions" when dealing with data. Let's dive in to the specific case of three-class classification. Say we're developing a computer program to guess ("predict") whether given image contains a cat, a dog, or a horse. Traditionally we produce a set of 3 probabilities for each class, say... TODO: describe one-hot encoding of target values

These numbers can be viewed as the strength of an attribute in an image, e.g. measures of cat-ness, dog-ness, and horse-ness (or measure of the likelihood of being a cat, dog, or horse, respectively), where a value of 1 means 100% of that property. Notice in each case, the three "class" probabilities add up to 1. This is always the case: probabilities always have to sum up to 1, i.e. 1 is "100% certainty" that gets split among the 3 classes. (This summing to 1 is an important property that we'll come back to in a bit.)

One thing that scientists like to do is take different variables and view them as coordinates of a single point in a multi-dimensional space. So for 3 classes we have 3 coordinates for 3 dimensions. We could make the "cat-ness" prediction probability be the "x" coordinate, and "dog-ness" be the "y" values, and "horse-ness" could be along the "z" axis. Then instead of drawing bar graphs, we could plot points in 3D space, where the coordinates of each point tell us the predictions:

All the 3D plots in this post can be rotated & zoomed with the mouse or your finger. Try it!

"Embedding": treat each triplet as the (x,y,z) coordinates of a point in 3D space:

In [4]:

# All the 3D plots in this talk are interactive. Use your mouse to rotate, etc!
TrianglePlot3D_Plotly(data, targ=None, labels=labels*2, show_bounds=False).do_plot()

(Here we also used the 3 class probabilities to set the R,G,B color values of the points. There's no new information contained in this; it just looks cool.)

What scientists tend to do is, even in cases where there are more then 3 variables (say, 10), we regard these as dimensions in some fancy abstract mathematical space where the laws may or may not conform to those of our universe -- for example, the idea of "distance" may be totally up for grabs. In cases where the number of values is infinite (say, as coefficients in a infinite series, or as a function of a continuous variable) we might even work in infinite dimensions! Often when we talk like this, it doesn't mean that we're actually picturing geometrical spaces in our heads -- we can't, for anything beyond 3 dimensions -- but it's a handy way of encapsulating a particular way of viewing the data or functions involved. And sometimes we do try to see what kinds of geometrical insights we can glean -- which is what we're going to do here!

Remember when we said that the individual class probabilities have to add up to 1? Look what happens when we plot a lot of such points...

Let's plot LOTS of points...

In [5]:

prob, targ = calc_prob(n=400)
TrianglePlot3D_Plotly(prob, targ=None, labels=labels, show_bounds=False).do_plot()

Note that even though these are points in 3D space, they make up a triangle which lies along a plane -- a 2D :subspace" of 3D. This is a consequence of having the "constraint" that all class probabilities add up to 1.

We can color the points by their expected class values by choosing the triangle point (or "pole") that they're nearest to -- i.e. by which "bar" is largest among the class probabilities. And we can include the boundaries between classes:

Let's color them by their target value / label, and show class boundaries:

In [6]:

TrianglePlot3D_Plotly(prob, targ=targ, labels=labels, show_bounds=True).do_plot()

3 Classes in 2D¶

Since points lie in a plane, we can change coord's & plot in 2D instead
(mouse hover = show image!)

Note: right now these images when you mouse over below are first scraped from DDG on-the-fly, and then rendered via requests when you mouse over. Work in progress. Real-time tracking while training FastAI is on the TODO list!

In [7]:

urls = exhibit_urls(targ, labels)
TrianglePlot2D_Bokeh(prob, targ=targ, labels=labels, show_bounds=True, urls=urls).do_plot()

Traditional Classification Training (Cartoon)¶

...tries to collapse all points to each "pole".

In [8]:

# generate and save images that we'll load in the next cell 
import matplotlib.pyplot as plt

# generate data along boundaries
def gen_bound(x, y, z, n=20, ind0=1): # ind0=1 skips the "first point"
    return np.linspace(np.array([x[0],y[0],z[0]]), np.array([x[1],y[1],z[1]]), num=n+ind0)[ind0:]
  
def gen_bound_data(n_per=20, ind0=0):
    bdata = np.zeros((n_per*3,3))
    bdata[:n_per] = gen_bound(x=[0.333,0.5], y=[0.333,0.5], z=[0.333,0], n=n_per, ind0=ind0) 
    bdata[n_per:2*n_per] = gen_bound(x=[0.333,0], y=[0.333,0.5], z=[0.333,0.5], n=n_per, ind0=ind0)
    bdata[-n_per:] = gen_bound(x=[0.333,0.5], y=[0.333,0], z=[0.333,0.5], n=n_per, ind0=ind0)
    return bdata

def gen_near_bound_data(n_per=50, scale=7, eps=0.01):
    bdata = gen_bound_data(n_per=n_per)
    lower, right, left = bdata[0:n_per,:], bdata[n_per:2*n_per,:], bdata[-n_per:,:]

    # shift data a bit
    lower_catty = softmax( scale*(lower+np.array([eps,0,0])) )
    lower_doggy = softmax( scale*(lower+np.array([0.0,eps,0])) )

    left_catty = softmax( scale*(left+np.array([eps,0,0])) )
    left_horsey = softmax( scale*(left+np.array([0,0,eps])) )

    right_horsey = softmax( scale*(right+np.array([0,0,eps])) )
    right_doggy = softmax( scale*(right+np.array([0,eps,0])) )

    return np.vstack((lower_catty, lower_doggy, left_catty, left_horsey, right_horsey, right_doggy))

# move boundary a bit toward the "correct" side
eps = 0.007
acc_data = gen_near_bound_data(eps=eps)
btarg = np.argmax(acc_data, axis=-1)
TrianglePlot2D_MPL(acc_data, targ=btarg, show_bounds=True, labels=labels, comment='100% Accuracy:').do_plot()
plt.savefig("images/acc_100.png")

# move boundary a bit toward the "wrong" side (keeping labels the same as before)
inacc_data = gen_near_bound_data(eps=-eps)
ibtarg = btarg.copy()
TrianglePlot2D_MPL(inacc_data, targ=ibtarg, show_bounds=True, labels=labels, comment='0% Accuracy:').do_plot()
plt.savefig("images/acc_0.png")

Loss vs. Accuracy¶

Loss: distance from target (continuous)
Accuracy: % of points on the correct side of decision boundary (discontinuous)

**Nearly identical losses:**

4 Classes in 3D: form a pyramid!¶

In [9]:

np.random.seed(1)
prob4, targ4 = calc_prob(n=500, s=2.7, dim=4)       # 4d probabilities
prob4, targ4 = np.vstack((np.eye(4),prob4)), np.hstack((np.arange(4),targ4)) # tack on poles b4 pca
prob3 = pca_proj(prob4)                     # use PCA for coordinate transformation to 3D hyperplane

In [10]:

plot = TrianglePlot3D_Plotly(prob3, targ=targ4, labels=labels+['bird'], show_labels=True, show_axes=False, poles_included=True)
plot.fig.update_layout(scene_camera=dict( eye=dict(x=1.5, y=1, z=0.7)))
plot.do_plot()

Metric-Based Embedding Methods, OTOH...¶

...map similar points near each other, dissimilar points far away. => Clusters:

In [11]:

import plotly.graph_objects as go
def noop(x): return x 

def plot_clusters(dim=3, nclasses=4, nper=100, func=noop):
    np.random.seed(6)
    clusters = np.zeros((nclasses*nper,dim))
    colors = ['red','green','blue','orange']+['black']*max(nclasses-4,0)
    labels = ['cat','dog','horse','bird']+['aux']*max(nclasses-4,0)
    fig = go.Figure()
    for i in range(nclasses):
        mean, cov = 0.8*np.random.rand(dim), 0.002*np.eye(dim)
        cluster = func(np.random.multivariate_normal(mean, cov, nper))
        clusters[i*nper:(i+1)*nper] = cluster
        fig.add_trace( go.Scatter3d(x=cluster[:,0], y=cluster[:,1], z=cluster[:,2], hovertext=labels[i], name=labels[i], \
            mode='markers', marker=dict(size=5, opacity=0.6, color=colors[i])))
    fig.update_layout(margin_t=0, scene_camera=dict( eye=dict(x=0.7, y=0.7, z=0.7)))
    fig.show(config = {'displayModeBar': False})
    return clusters

In [12]:

clusters = plot_clusters() 

Contrastive Loss¶

Like attracts like; "Opposites" repel:
Tends to group things in "semantically meaningful" ways.

This picture of springs is the essence of a "contrastive loss" function. Unlike traditional ML classification where the loss is based on the "distance" to a "target" (or "ground truth") value, with these metric based methods we send in two (or even 3) data points together, and then either let them attract or repel each other, and we do this over and over and over until we reach some stopping criterion. Eventually, what we'll have is a space that contains clusters of similar points, separated by a "margin" distance that we specify.

Siamese Networks¶

Identical twin network branches map to points ("feature vectors") in N-dim space:

Example of a Siamese Network (source: Sundin et al)

How well do they work?¶

Traditional vs. zero(/few)-shot methods: which one wins? It depends.

From high-scoring Kaggle competition entry using "entity embeddings":

"Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables."

Let's look at "PETS," via my mod of FastAI's tutorial on Siamese Networks...

Zero-Shot & Few-Shot¶

Cool thing: Embedding function tends to work for classes never seen before

So, for example, the embedding learned for grouping images of cats, dogs, and horses together would map images of birds to nearby points in the space. Then "all we have to do" if we want to predict a class is see whether a new instance is "nearby" (according to some distance measure we decide) to other similar points. We could even look at the "center points" of various clusters and regard these as the "class prototype" and use that in the future.

This fits (somewhat) with the notions of "prototypes" in human classification advanced by Eleanor Rosch in her revolutionary psychology work in the early 1970s. We can say more about this later. ;-)

This same method of contrastive losses and metrics is used not for classification per se but for things like photographic identity verification (an example that is given in Andrew Ng's Machine Learning course on Coursera): Say you want to have a facial recognition system (highly problematic for ethical reasons but it's a good example of the method so bear with me) for a company where there can be turnover in employees: You probably don't want to train a traditional classifier with separate a class for each employee because then you'd have to re-train it every time someone joins or leaves the company. Instead, you can store an image of each employee, and then when they appear in front of a camera for identity verification, you could compare the "distance" between the embedded data point for the new photo from the data point for the stored photo(s). If the distance is small enough, then you can have confidence it's the same person.

What's nice about this is that, after you've trained your embedding system, it can typically still be used to measure similarity between pairs of things it's never seen before, because in the process of training it was forced to learn "semantically meaningful" ways of grouping points together. This use of the linguistic work "semantic" is not accidental: the language model systems that rely on "word embeddings" can learn to group similar words together, and even have mathematical-like relationships in analogies (e.g., gender: "king - man + woman = queen", or countries-and-capitals: "Russia - Moscow + France = Paris") by treating the embedded data points as vectors that point from the origin of the coordinate system to the data point. We can say more about this and the distance metric they use ("cosine similarity") another time.

So in using metric-based learning for classification, we're essentially adopting this identity-verification app and applying it to entire classes instead of individuals.

Final thoughts¶

3D for teaching: robust yet viz-able, can help re. loss & accuracy
Embeddings are just mappings to points in space
Contrastive losses help give semantically meaningful embeddings
Note that real-world embeddings often require >> 3D (e.g. 128D)
mrspuff lib TODO:
- Live tracking of data points during training
- Real images for points
  - Add URLs to FastAI image data type / bunch / loader

Acknowledgements¶

Thanks to Zach Mueller, Tanishq Abraham and Isaac Flath for help with fastai!

References:¶

"Siamese and triplet learning with online pair/triplet mining" by Adam Bielski
Example of new `Pipeline` class to create data for a Siamese model," by Jeremy Howard
"Siamese Network & Triplet Loss" by Rohith Gandhi
pytorch-metric-learning by Kevin Musgrave
"Contrastive Loss Explained" by Brian Williams
"Bootstrap Your Own Latent (BYOL)" by Grill et al
"Siamese Neural Networks for One-shot Image Recognition" by Koch et al
"Few Shot Learning" by msiam
"Learning to learn, Low shot learning" by Katerina Fragkiadaki
"Embarassingly Simple": "We describe azero-shot learning approach that can be implemented in just one line of code, yet it is able to outperform state of the art approaches on standard datasets"

In [ ]: