This notebook is meant to be a fun and simple illustration of the Singular Value Decomposition, a tool that is incredibly useful for a lot of data applications. The use cases include dimensionality reduction, recommender systems and data compression. This notebook is an illustration of SVD for data compression, where we examine low-rank approximations of a photograph.
'''
Much thanks to this person for the tutorial on image processing
http://nbviewer.ipython.org/gist/frankcleary/4d2bd178708503b556b0
'''
import matplotlib.pyplot as plt
import numpy as np
import time
from PIL import Image
#
img = Image.open('C:/Users/kevin/Documents/GitHub/DS_course/datasets/panda.jpg').convert('LA')
imgmat = np.array(list(img.getdata(band=0)), float)
imgmat.shape = (img.size[1], img.size[0])
imgmat = np.matrix(imgmat)
We've converted the image to black and white and then represented the image as a matrix whose values indicate numeric pixel intensities. Let's plot it just to show that it works.
plt.imshow(imgmat, cmap='gray')
plt.show()
The SVD decomposition is easy with one line of python code. Remember, with a an $NxM$ matrix $X$, we want to decompose it to: $X=U \Sigma V^T$
where,
(Tip: The following function returns $V^T$ and not $V$)
U, sig, Vt = np.linalg.svd(imgmat)
Sometimes it is useful to look at the relative magnitude of the singular values to see how much redundency there is in the data. Recall that square root of the total sum-of-squares of the matrix entries (also known as the Frobenius norm
$\|X\|_F=\sqrt{\sum\limits_{i=1}^N \sum\limits_{j=1}^M x_{ij}^2}$) is equal to the same of the singular values.
I.e., $\|X\|_F=\sqrt{\sum\limits_{i=1}^M \sigma_i^2}$.
In the section below, we plot the square root of the total sum-of-squares of the singular values up to the value $k$, normalized by $\|X\|_F$ . I.e., $\frac{\sqrt{\sum\limits_{i=1}^k \sigma_i^2}}{\|X\|_F}$
#Plot the spectrum of the image and the total sum-squares captured by the k-th value
norm = np.sqrt(sum(sig*sig))
energy_k = [np.sqrt(k)/norm for k in np.cumsum(sig*sig)]
plt.figure()
ax1 = plt.subplot(211)
plt.plot(sig)
plt.title('Kth Singular Value')
plt.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
ax2 = plt.subplot(212)
plt.plot(energy_k)
plt.title('Normalized Cumulative Energy of Kth Singular Value')
ax2.set_xlabel('Kth Singular Value')
plt.show()
If we call the normalized value of $\sqrt{\sum\limits_{i=1}^k \sigma_i^2}$ the "cumulative energy" of the matrix, we can see that for this image, most of the "energy" or information in the matrix can be represented with just a few (i.e., way less than 200) singular values/vectors.
To illustrate this principle, we can form a low rank approximation of the matrix $X$, which we will call $X_k$. This is defined as: $X_k=U_k \Sigma_k V_k^T$, where for each of the latter 3 matrices we take the first $k$ columns.
A well known theorem called the Eckart-Young-Mirsky Theorem states that a rank-$k$ approximation using the SVD minimizes the sum-of-squares of the entries of $X-X_k$. Further, the sum-of-squares of the difference between $X$ and $X_k$ (i.e., the Frobenius norm $\|X-X_k\|_F$) is given by $\sqrt{\sum\limits_{i=k+1}^{min(M,N)} \sigma_i^2}$.
The bottom plot above then shows the normalized $1-\|X-X_k\|_F$ for each value $k$. We can see that for this given photo, we can get most of the information in the first dozen or so singular values. Let's see if our eyes agree with this!
#Now work on an approximate image
def plotApprox(U, sig, Vt, k):
reconstimg = np.matrix(U[:, :k]) * np.diag(sig[:k]) * np.matrix(Vt[:k, :])
plt.imshow(reconstimg, cmap='gray');
def plotSeq(U, sig, Vt, seq_k):
fig = plt.figure()
for i,k in enumerate(seq_k[:8]):
ax = fig.add_subplot(2,4,i+1)
plotApprox(U, sig , Vt, k)
plt.title('Rank {}'.format(k))
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
plotSeq(U,sig,Vt,[1,2,3,4,10,25,50,200])
plt.show()
#plotApprox(U, sig, Vt, 10)
It looks like with a rank of just 4 (out of 200) we can tell it's an animal (but not waht kind). At about 10-25 you might think that they are pandas, but at 50 it becomes clear that they are actually crocodiles.
And why do we bother with this? If you can get away with a low rank approximation where $k$ is way less than $M$, you can end up saving a considerable amount of space for storing, using, or sending data.