#!/usr/bin/env python # coding: utf-8 # # Weekend Project: Principal Component Analysis (PCA) # # The goal of principal component anaylsis (PCA) is to take a set of data in $D$-dimensional space, and compress it into as few dimensions as possible while still retaining the maximum amount of information. The reduced dimensions attempt to maximize retained information, measured as explained variance, through linear combinations of the original features. Successful applications of PCA allow us to greatly reduce the problem space, in particular for data with many superfluous dimensions where most of the variation can be explained through a few orthogonal linear combinations of features. # # Through PCA we are creating a matrix that will project our original data $x$ onto a $k$-dimensional subspace where (ideally) our data's important characteristics still remain. Our projection matrix $\mathbf{U}_a$ will be defined by the $k$-leftmost columns of the matrix $\mathbf{U}$ (more detail below). $k$ is our number of selected principal components. This process is analogous to lossy compression, where the projection onto a lower-dimensional subspace will provide significant computational benefit or allow for simpler modeling, but we will be unable to fully recover the original data after decompression. The limiting case is that when we use all the available dimensions, decompressing will fully recover the original data, but no dimensioniality reduction will have been accomplished. # # # ## Computation of PCA and Inverse PCA # # For now, it is only important to know that each column in $U$ is an eigenvector of the feature correlation matrix of $\mathbf{x}$, and that it is orthonormal. We justify this statement in the following section. # # For the block matrix # $$\mathbf{U} = \begin{bmatrix} # \overbrace{\mathbf{U}_{a}}^{\text{PC's}} & \overbrace{\mathbf{U}_{b}}^{\text{excluded PC's}} # \end{bmatrix} # $$ # We can create a projection $\mathbf{p}$ of our original data vector $\mathbf{x} \in R^{N \times 1}$ onto our new subspace through the operation # # $$ \mathbf{p}=\mathbf{x}^T \mathbf{U}_a \tag{1.1}$$ # # We posit that our transformation matrix, $U \in \mathbb{R}^D$, is an orthonormal basis, and therefore unitary, such that $\mathbf{U}\mathbf{U}^T=\mathbf{I}_D$ ($\mathbf{I}_D$ is the $D$-dimensional identity matrix). This property implies that by adding more and more PC's, $\mathbf{U}_a \rightarrow \mathbf{U}$ and $\mathbf{U}_a \mathbf{U}_a^T \rightarrow \mathbf{I}$. We will use this fact to compute the inverse PCA transformation of our projected data $\mathbf{p}$, through the operation: # # $$\hat{\mathbf{x}} = \mathbf{p}\mathbf{U}_a^T \tag{1.2}$$ # # Under a well-suited PCA use case, $\hat{\mathbf{x}}$, our reconstructed data point, will be close to $\mathbf{x}$ even when we only use a small number of columns to build $\mathbf{U}_a$. We can measure this distance, known as the 'reconstruction error', through the mean squared error (MSE) to our original data vector $\mathbf{x}$. # # $$MSE=\frac{\sum_{i=0}^{N}(\hat{x}_{i}-x_{i})^2}{N}$$ # # # ## Mathematical Derivation # # PCA can be achieved through the eigendecomposition of the feature correlation matrix, and this explained-variance-maximization is equivalent to the minimization of the MSE. The proof is as follows: # # Start with a vector $\mathbf{x}$ and its reconstruction $\text{PCA}(\mathbf{x}) = \mathbf{p} \rightarrow \text{Inverse PCA}(\mathbf{p}) = \hat{\mathbf{x}}$. To minimize $MSE(\hat{\mathbf{x}})$ we setup the following unconstrained optimization: # # $$\text{minimize} \quad {\frac{\sum_{i=1}^{N}||\hat{x_i}-x_i||^2}{N}}$$ # where $\hat{x_i}$ equals the projection of $x_i$ onto the unit vector $\mathbf{u} \in \mathbf{U}_a$ , multiplied by $\mathbf{u}$. ie: # $$\hat{x_i} = (x_i \cdot \mathbf{u}) \mathbf{u}$$ # Squared error for a single sample, therefore becomes, # # $$||\hat{x_i}-x_i||^2 = {||(x_i \cdot \mathbf{u}) \mathbf{u} - x_i||^2}$$ # $$= ((x_i \cdot \mathbf{u}) \mathbf{U}_a - x_i)((x_i \cdot \mathbf{u}) \mathbf{u} - x_i) $$ # $$= ((x_i \cdot \mathbf{u}) \mathbf{u})^2 \underbrace{- x_i \cdot (x_i \cdot \mathbf{U}_a) \mathbf{u} - (x_i \cdot \mathbf{u}) \mathbf{U}_a \cdot x_i}_{\text{rearrange dot products} \rightarrow -2(x_i \cdot \mathbf{u})(x_i \cdot \mathbf{u})=-2(x_i \cdot \mathbf{u})^2} + \underbrace{x_i\cdot x_i}_{=||x_i||^2} $$ # $$= (x_i \cdot \mathbf{u})^2 \underbrace{\mathbf{u} \cdot \mathbf{u}}_{||\mathbf{u}||^2=1^2=1} - 2(x_i \cdot \mathbf{u})^2 + ||x_i||^2$$ # $$= (x_i \cdot \mathbf{u})^2 - 2(x_i \cdot \mathbf{u})^2 + ||x_i||^2$$ # $$= ||x_i||^2 - (x_i \cdot \mathbf{u})^2$$ # # Over all terms, the mean squared error is then defined as # # $$\frac{\sum_{i=1}^{N}||x_i||^2 - (x_i \cdot \mathbf{u})^2}{N} $$ # $$ = \frac{\sum_{i=1}^{N}||x_i||^2}{N} - \frac{\sum_{i=1}^{N}(x_i \cdot \mathbf{u})^2}{N} \tag{2.1}$$ # # Note that the first term in eq $(2.1)$ is always going to be non-negative and is not going to depend on our choice of $\mathbf{U}_a$, meaning that the problem # $$\text{minimize} \quad {\frac{\sum_{i=1}^{N}||\hat{x_i}-x_i||^2}{N}}$$ # is equivalent to maximizing the second term in eq. $(2.1)$. # $$\text{maximize} \quad \frac{\sum_{i=1}^{N}(x_i \cdot \mathbf{u})^2}{N} \tag{2.2}$$ # Here we make note of the variance formula for a vector $\mathbf{v}$. $Var(\mathbf{v})=E[\mathbf{v}^2]-E[\mathbf{v}]^2$. For our vector of projections, we have # $\mathbf{p} = \mathbf{x} \mathbf{U}_a = \begin{bmatrix} # x_1 \cdot \mathbf{u}\\ # x_2 \cdot \mathbf{u}\\ # ... \\ # x_N \cdot \mathbf{u}\\ # \end{bmatrix} $ where # $E[\mathbf{p}] = \begin{bmatrix} # E[x_1 \cdot \mathbf{u}]\\ # E[x_2 \cdot \mathbf{u}]\\ # ... \\ # E[x_N \cdot \mathbf{u}]\\ # \end{bmatrix} = \begin{bmatrix} # E[x_1] \cdot \mathbf{u}\\ # E[x_2] \cdot \mathbf{u}\\ # ... \\ # E[x_N] \cdot \mathbf{u}\\ # \end{bmatrix} = \begin{bmatrix} # 0\\ # 0\\ # ... \\ # 0\\ # \end{bmatrix} $ # our variance is exactly equal to $E[\mathbf{p}^2]$, which is exactly what eq. $(2.2)$ is maximizing. # # ### Maximizing Variance/Minimizing MSE # For a single vector $\mathbf{u}$: # $$\text{maximize} \quad \sigma^2=\frac{\sum_{i=1}^{N}(x_i \cdot \mathbf{u})^2}{N}$$ # $$\textrm{s.t.} \quad \mathbf{u}^T\mathbf{u}=1$$ # Our constraint ensures $\mathbf{u}$, our projection vector, is of unit length. # We start by restating our cost function as $$\sigma^2=\frac{\sum_{i=1}^{N}(x_i \cdot u_i)^2}{N} = \frac{(x_i u)^T(x_i u)}{N} = \frac{u^Tx^Tx u}{N} \tag{2.3}$$ # We can make use of the fact that our data vector $\mathbf{x}$ is zero mean and that for a vector $\mathbf{y}$, its correlation matrix $R_y$ $$ = E[(\mathbf{y}-\mu_y)(\mathbf{y}-\mu_y)^T] = E[\mathbf{y}\mathbf{y}^T]$$ # Taking $\mathbf{y}=\mathbf{x}^T$ # $$R_{\mathbf{x}^T} = E[\mathbf{x}^T\mathbf{x}] = \frac{\mathbf{x}^T\mathbf{x}}{N}$$Thus, eq $(2.3)$ simplifies to # $$\mathbf{u}^TR_{\mathbf{x}^T}\mathbf{u}$$ # To optimize, we write the lagrangian as follows: # $$\mathbb{L} = \mathbf{u}^TR_{\mathbf{x}^T}\mathbf{u} - \nu(\mathbf{u}\mathbf{u}^T-1)$$ # where nu, $\nu$, is our lagrange multiplier for equality constraints. We differentiate with respect to $\mathbf{u}$ and set to zero to find the optimum # # $$\frac{\partial\mathbb{L}}{\partial{\mathbf{u}}}=\frac{\partial{\mathbf{u}^TR_{\mathbf{x}^T}\mathbf{u}}}{\partial{\mathbf{u}}}-\frac{\partial{(\nu \mathbf{u}^T\mathbf{u}-\nu)}}{\partial{\mathbf{u}}}=2R_{\mathbf{x}^T}\mathbf{u}-2\nu \mathbf{u}=0$$ # # $$\therefore R_{\mathbf{x}^T}\mathbf{u}=\nu \mathbf{u}$$ # # We know that for a scalar $\lambda$, a matrix $A$ and vector $\mathbf{v}$, if $A\mathbf{v}=\lambda \mathbf{v}$ then $\mathbf{v}$ is an eigenvector of $A$ with corresponding eigenvalue $\lambda$. It is trivial to see that the eigenvector of $R_{\mathbf{x}^T}$, $\mathbf{u}$ with largest corresponding eigenvalue $\nu$, maximizes our optimization problem. Q.E.D. # # Helpful resources: # - Boyd, Stephen, Stephen P. Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. # - https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf # In[1]: import pandas as pd import scipy.io import numpy as np import glob import os from PIL import Image from tqdm import tqdm import warnings import matplotlib.pyplot as plt import json with open('dir.json', 'r') as f: root = json.load(f)['root'] assert os.path.exists(root) # Code should work on any dataset of png images of the form. Simply update your label_names in the Dataset constructor below # # +-- root
# | +-- label_1
# | +-- +-- sample_c1_1.png
# | +-- +-- sample_c1_2.png
# | +-- +-- ...
# | +-- label_2
# | +-- +-- sample_c2_1.png
# | +-- +-- sample_c2_2.png
# | +-- +-- ...
# | ...
# | +-- label_N
# | +-- +-- sample_N2_1.png
# | +-- +-- sample_N2_2.png
# In[2]: class Dataset: def __init__( self, root, img_shape=200, label_names={ 'circle': 0, 'square': 1, 'star': 2, 'triangle': 3, }, downsample = True, downsample_granularity = 4 ): if img_shape%downsample_granularity: warnings.warn("Image shape is not divisible by downsample granularity. Sampling may be off-center") self.root = root self.fps = [] self.data = [] self.labels = [] self.downsample = downsample # center downsampling downsampled_idxs = np.arange(0, img_shape, downsample_granularity)+downsample_granularity//2 self.downsample_grid = tuple(np.meshgrid(downsampled_idxs, downsampled_idxs)) # store map of shapes self.label_names = label_names # with the appropriate directory structure, this will read all the data filepaths for label in self.label_names: fps_of_class_i = glob.glob(self.root + f'/{label}/**png') self.fps += glob.glob(self.root + f'/{label}/**png') # attach label to it self.labels += [self.label_names[label]] * len(fps_of_class_i) self.labels = np.array(self.labels) def downsample_img(self, img): """ Simple downsampling for computational gain """ return img[self.downsample_grid] def load_data(self): # read shape from file for fp in tqdm(self.fps): # convert png to np array img = np.asarray( Image.open(fp) ) if self.downsample: img = self.downsample_img(img) self.data.append( img.flatten() ) self.data = np.array(self.data) # ensure equal # of labels and data assert self.data.shape[0] == self.labels.shape[0] def sample(self, n=100): """ provide a size n random sample of data and labels from dataset args: - n (int): sample size """ indices = np.random.choice(len(self.data), n) return self.data[indices], self.labels[indices] def __getitem__(self, i): """ For easy indexing """ return self.data[i], self.labels[i] def show(self, i=None, img=None): """ show image of sample args: - i (int): index in dataset """ if i is None and img is None: i = np.random.choice(self.data.shape[0]) if img is None: img = self[i][0] square_img = img.reshape(int(np.sqrt(img.shape)), -1) Image.fromarray(square_img).show() return class PCA: def __call__(self, data): """ eq 1.1 in description Args: - data (np.array): data to be projected P = x @ U_a """ return abs(np.matmul(data, self.U_a)) def __inverse__(self, data): """ eq 1.2 in description. Invert the transformation and re-add the original data's mean Args: - data (np.array): P @ U_a^T = x @ U_a @ U_a^T projections to to be transformed """ return abs(np.matmul(data, self.U_a.T)) + self.data_mean def __init__(self, num_components=3, autofit=True): """ Constructor Args: - dataset (np.array): dataset with which to calculate mean and fit - num_components (int): number of selected principal components. 'k' in our derivation - autofit (bool): construct and fit at same time """ self.num_components = num_components def fit(self, dataset, verbose=True): """ Calculate covariance matrix R_{x^T}, and eigendecompose. Original drawback, very computationally expensive if not optimized, ie doing manually like I am MemoryError: Unable to allocate 23.8 GiB for an array with shape (40000, 40000) and data type complex128 We brute force solved this by downsampling. Args: - dataset (np.array): data to be de-meaned and fit - verbose (bool): print stuff """ self.data_mean = sample.mean(axis=0, keepdims=True) if verbose: print("Centering Data at 0 across all features") # make data zero-mean dataset = dataset - self.data_mean # transpose data because we're correlating features, not samples dataset = dataset.T if verbose: print("Performing Eigendecomposition on Covariance Matrix") # lambda eigenvalues, u eigenvectors self.L, self.U = np.linalg.eig(np.cov(dataset)) # all eigenvectors should be unit length if verbose: print("Norms of first and last eigenvector (should be ~ 1):") print(np.matmul(self.U[:, 0].T, self.U[:, 0])) print(np.matmul(self.U[:, -1].T, self.U[:, -1])) # k-leftmost columns of our eigenvector matrix self.U_a = self.U[:, :self.num_components] # In[3]: dataset = Dataset(root, downsample_granularity=4) # In[4]: dataset.load_data() # In[5]: sample = dataset.sample(1600)[0] pca = PCA(num_components=3) pca.fit(sample) # In[6]: plt.plot(abs(np.cumsum(pca.L)/np.sum(pca.L))) plt.title("Explained variance per added eigenvector") plt.show() # In[7]: import plotly.graph_objects as go fig = go.Figure(layout=dict(title = 'PCA Transformed Data')) for key, val in dataset.label_names.items(): sub_dataset = dataset[dataset[:][1] == val] pca_transformed = pca(sub_dataset[0]) fig.add_trace(go.Scatter3d( x = pca_transformed[:, 0], y = pca_transformed[:, 1], z = pca_transformed[:, 2], name = key, mode='markers' )) fig.show()