#!/usr/bin/env python # coding: utf-8 # In[2]: get_ipython().run_line_magic('matplotlib', 'inline') from IPython.external.mathjax import install_mathjax install_mathjax() # # Linear Algebra # # * The Matrix Cookbook(Petersen and Pedersen, 2006) or Shilov 1977 # ## Mathematical Object of Linear Algebra # # * Scalar # # - A single number # - lower-case variable names with Italics : $x$ # - $s \in \mathbb{R}$ # - $n \in \mathbb{N}$ # # * Vector # # - An array of numbers # - lower-case names written in bold typeface, for example : $\pmb{x}$ # - The first element of $\pmb{x}$ is $x_{1}$, The second element is $x_{2}$, and so on. # - If the length of vector is n and each element is in $\mathbb{R}$, then the vector lies in the set formed by taking the Cartesian product of $\mathbb{R}$ $n$ times, denoted as $\mathbb{R}^n$. # - $$\pmb{x} = \begin{bmatrix} x_{1} \\ x_{2} \\ \cdots \\ x_{n}\\ \end{bmatrix}$$ # - We can think of vectors as identifying points in space, which each element giving the coordinate along a different axis. # - If we need to access $x_{1}$, $x_{3}$, $x_{6}$, we define the set $S=\{1,3,5\}$ and write $\pmb{x}_{S}$. # - $x_{-1}$ is the vector containing all elements of $\pmb{x}$ except for $x_{1}$. # - $\pmb{x}_{S}$ is the vector containing all of the elements of $\pmb{x}$ except for $x_{1}$, $x_{3}$, $x_{6}$. # # * Matrices # # - a 2-D array of numbers # - upper-case variable names with bold typeface : $\pmb{A}$ # - $\pmb{A}^{m \times n}$ has a height of $m$ and a width of $n$. # - $A_{1,1}$ # - $A_{i,:}$ : the *i*-th *row* of $\pmb{A}$ # - $$\begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \end{bmatrix}$$ # - $$\pmb{A} = \begin{bmatrix} a_{1,1} & a_{1,2} \\ a_{2,1} & a_{2,2} \\ a_{3,1} & a_{3,2} \end{bmatrix} \Rightarrow \pmb{A}^{T} = \begin{bmatrix} a_{1,1} & a_{2,1} & a_{3,1} \\ a_{1,2} & a_{2,2} & a_{3,2} \end{bmatrix}$$ # # * Tensors # # - more than two axes. # - $A_{i,j,k}$ : the element of $\mathbf{A}$ # ## Operations # # * Adding # # - $\pmb{C}=\pmb{A}+\pmb{B} \Rightarrow C_{i,j} = A_{i,j} + B_{i,j}$ # - $\pmb{D}=a \cdot \pmb{B} + c \Rightarrow D_{i,j}=a \cdot B_{i,j} + c$ # # * Multiplying Matrices and Vectors # # - If $\pmb{A}$ is of shape $m \times n$ and $\pmb{B}$ is of shape $n \times p$, then $\pmb{C}$ is of shape $m \times p$. # - $\pmb{C}=\pmb{AB} \Rightarrow c_{i,j}=\Sigma_{k} a_{i,k} b_{k,j}$ # - *dot product* : $\pmb{x} \cdot \pmb{y} = \pmb{x}^{T} \pmb{y}$ # - $\pmb{A}(\pmb{B} + \pmb{C})=\pmb{AB} + \pmb{AC}$ # - $\pmb{A}(\pmb{BC})=(\pmb{AB})\pmb{C}$ # - $\pmb{AB} \neq \pmb{BA}$ # - $(\pmb{AB})^{T}=\pmb{B}^{T} \pmb{A}^{T}$ # - $$s \pmb{A}=\begin{bmatrix} sa_{1,1} & \dots & sa_{1,n} \\ \vdots & \ddots & \vdots \\ sa_{m,1} & \dots & sa_{m,n} \end{bmatrix}$$ # - $\pmb{Ax}=\pmb{b}$ # # $$\begin{align} \pmb{A}_{1,:}\pmb{x}&=b_{1} \\ \pmb{A}_{2,:}\pmb{x}&=b_{2} \\ &\dots \\ \pmb{A}_{m,:}\pmb{x}&=b_{m} \end{align}$$ # # more explicitly: # # $$\begin{align} a_{1,1}x_{1} + a_{1,2}x_{2} + \dots + a_{1,n}x_{n}=b_{1} \\ a_{2,1}x_{1} + a_{2,2}x_{2} + \dots + a_{2,n}x_{n}=b_{2} \\ \dots \\ a_{m,1}x_{1} + a_{m,2}x_{2} + \dots + a_{m,n}x_{n}=b_{m} \end{align}$$ # # ## Identity and Inverse Matrices # # * Identify Matrix : $\forall \pmb{x} \in \mathbb{R}^n$, $\pmb{I}_{n} \pmb{x} = \pmb{x}$ # * Inverse Matrix : $\pmb{A}^{-1} \pmb{A} = \pmb{I}_{n}$ # * $$\begin{align} \pmb{A} \pmb{x} &= \pmb{b} \\ \pmb{A}^{-1} \pmb{Ax} &= \pmb{A}^{-1} \pmb{b} \\ \pmb{I}_{n} \pmb{x} &= \pmb{A}^{-1} \pmb{b} \\ \pmb{x} &= \pmb{A}^{-1} \pmb{b} \end{align}$$ # ## Linear Dependence, Span, and Rank # # * $$\pmb{Ax} = \sum_{i} x_{i} \pmb{A}_{:,i}$$ # # In general, this kind of operation is called a *linear combination*. # # * $$\sum_{i} c_{i} \pmb{v}^{(i)}$$ # # * The *span* of a set of vectors is the set of all points obtainable by linear combination of the original vectors. # # * Determining whether $\pmb{Ax}=\pmb{b}$ has a solution thus amounts to testing whether $\pmb{b}$ is in the span of the columns of $\pmb{A}$. This particular span is known as the *column space* or the *range* of $\pmb{A}$. # # * In order for the system $\pmb{Ax}=\pmb{b}$ to have a solution for all values of $\pmb{b} \in \mathbb{R}^{m}$, we therefore require that the column space of $\pmb{A}$ be all of $\mathbb{R}^{m}$. # # * Having $n \le m$ is only a necessary condition for every point to have a solution. However, It's not a sufficient condition, because it's possible for some of the columns to be redundant, A $2 \times 2$ matrix where both of the columns are equal to each other. # # * Formally, this kind of redundancy is known as *linear dependence*. A set of vectors is *linearly independent* if no vector in the set is a linear combination of the other vectors. # ## Norms # # * *$L^{p}$ norm* # # $$||\pmb{x}||_{p} = \biggl( \sum_{i} |x_i|^p \biggr)^{\frac{1}{p}}$$ # # for $p \in \mathbb{R}$, $p \ge 1$ # # * Norms, including the $L^{p}$ norm, are functions mapping vectors to non-negative values, satisfying these properties that make them behave like distances between points: # # $$\begin{align} & f(\pmb{x})=0 \Rightarrow \pmb{x} = \pmb{0} \\ & f(\pmb{x}+\pmb{y}) \le f(\pmb{x}) + f(\pmb{y}), (\text{the triangle inequality}) \\ & \forall \alpha \in \mathbb{R}, f(\alpha \pmb{x}) = |\alpha|f(\pmb{x}) \end{align}$$ # # * The $L^{2}$ norm, with $p = 2$, is known as the *Euclidean norm*. It is simply the Euclidean distance from the origin to the point identified by $\pmb{x}$. # # * Also, the $L^{1}$ norm is commonly used in machine learning when the difference between zero and nonzero elements is very important. # # * The $l_{\infty}$ norm, also knwon as the max norm, # # $$||x||_{\infty}=\max_{i}|x_{i}|$$ # # * Sometimes we may also wisth to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the otherwise obscure *Frobenius norm* # # $$||A||_{F}=\sqrt{\sum_{i,j} a_{i,j}^2}$$ # # * $$\pmb{x}^T \pmb{y} = ||\pmb{x}||_{2} ||\pmb{y}||_{2} \cos{\theta}$$ # ## Special Kinds of Matrices and Vectors # # * Diagonal matrix $\pmb{D}$ is when $d_{i,j}=0$ for all $i \ne j$. For example, the identity matrix. # # * $diag(\pmb{v})$ : to denote a square diagonal matrix whose diagonal entries are given by the entries of the vector $\pmb{v}$. # # * A symmetric matrix is any matrix that is equal to its own transpose: # # $$\pmb{A}=\pmb{A}^T$$ # # * A unit vector is a vector with unit norm: # # $$||\pmb{x}||_{2}=1$$ # # A vector $\pmb{x}$ and a vector $\pmb{y}$ are *orthogonal* to each other if $\pmb{x}_{T} \pmb{y}=0$. # # * If the vectors are not only orthogonal but also have unit norm, we call them *orthonormal*. # # * An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal: # # $$\pmb{A}^T\pmb{A} = \pmb{A}\pmb{A}^T = \pmb{I}$$ # # This implies that # # $$\pmb{A}^{-1}=\pmb{A}^{T}$$ # # so orthogonal matrices are of interest because their inverse is very cheap to compute. # ## Eigendecomposition # # * *Eigendecomposition* : a kinds of matrix decomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues. # # * *Eigenvector* : An eigenvector of a square matrix $\pmb{A}$ is a non-zero vector $v$ such that multiplication by $\pmb{A}$ alters only the scale of $\pmb{v}$: # # $$\pmb{A} \pmb{v} = \pmb{\lambda} \pmb{v}$$ # # * The scalar $\lambda$ is known as the *eigenvalue corresponding to this eigenvector. # # * If eigenvectors $\{\pmb{v}^{(1)},\dots,\pmb{v}^{(n)}\}$ and corresponding eigenvalues $\{\lambda_1,\dots,\lambda_{n}\}$ by concatenating the eigenvectors into a matrix $\pmb{V}=[\pmb{v}^{(1)},\dots,\pmb{v}^{(n)}]$ (i.e. one column per eigenvector), and concatenating the eigenvalues into a vector $\lambda$, then the matrix # # $$\pmb{A}=\pmb{V} diag(\lambda) \pmb{V}^{-1}$$ # # has the desired eigenvalues and eigenvectors. # # * Every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues: # # $$\pmb{A}=\pmb{Q}\pmb{\Lambda}\pmb{Q}^{T}$$ # # where $\pmb{Q}$ is an orthogonal matrix composed of eigenvectors of $\pmb{A}$, and $\pmb{A}$ is a diagonal matrix, with $\lambda_{i,i}$ being the eigenvalue corresponding to $\pmb{Q}_{:,i}$. # # ## Singular Value Decomposition # # * The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values. # # * Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition. # # * **If a matrix is not square, the eigendecomposition is not defined, and we must use a singular value decomposition instead.** # # $$\pmb{A}=\pmb{U} \pmb{D} \pmb{V}^{T}$$ # # Suppose that $\pmb{A}$ is an $m \times n$ matrix. Then $\pmb{U}$ is defined to be an $m\times m$ matrix, $\pmb{D}$ to be an $m \times n$ matrix, and $\pmb{W}$ to be an $n \times n$ matrix. # # * The matrices $\pmb{U}$ and $\pmb{V}$ are both defined to be orthogonal matrices. The matrix $\pmb{D}$ is defined to be a diagonal matrix. Note that $\pmb{D}$ is not necessarily square. # # * The elements along the diagonal of $\pmb{D}$ are known as the *singular values* of the matrix $\pmb{A}$. The columns of $\pmb{U}$ are known as the *left-singular vectors*. The columns of $\pmb{V}$ are known as the *right-singular vectors*. # # * We can actually interpret the singular value decomposition of $\pmb{A}$ in terms of the eigendecomposition of functions of $\pmb{A}$. # # * The left-singular vectors of $\pmb{A}$ are the eigenvectors of $\pmb{A}\pmb{A}^{T}$. # # * The right-singular vectors of $\pmb{A}$ are the eigenvectors of $\pmb{A}^{T}\pmb{A}$. # # * The non-zero singular values of $\pmb{A}$ are the square roots of the eigenvalues of $\pmb{A}^{T}\pmb{A}$. # # ## The Moore-Penrose Pseudoinverse # # * Matrix inversion is not defined for matrices that are not square. # # * Suppose we want to make a left-inverse $\pmb{B}$ of a matrix $\pmb{A}$, so that we can solve a linear equation # # $$\pmb{Ax}=\pmb{y}$$ # # by left-multiplying each side to obtain # # $$\pmb{x}=\pmb{By}$$ # # * The *Moore-Penrose Pseudoinverse* allows us to make some headway in these cases. The pseudoinverse of $\pmb{A}$ is defined as a matrix # # $$\pmb{x}^{+}=\lim_{\alpha \searrow 0}(\pmb{A}^{T} \pmb{A}^{T} + \alpha \pmb{I})^{-1} \pmb{A}^{T}$$ # # * Practical algorithm for computing the pseudoinverse are not based on this definition, but rather the formla # # $$\pmb{A}^{+}=\pmb{V} \pmb{D}^{+} \pmb{U}^{T}$$ # # where $\pmb{U}$, $\pmb{D}$, and $\pmb{V}$ are the singular value decomposition of $\pmb{A}$, and the pseudoinverse $\pmb{D}^{+}$ of a diagonal matrix $\pmb{D}$ is obtained by taking the reciprocal of all of its non-zero elements. # # * When $\pmb{A}$ has more rows than columns, then solving a linear equation using pseudoinverse provides one of the many possible solutions. # # * Specifically, it provides the solution $\pmb{x}=\pmb{A}^{+}\pmb{y}$ with minimal Euclidean norm $||\pmb{x}||_{2}$ among all possible solutions. # # * When $\pmb{A}$ has more columns than rows, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\pmb{x}$ for which $\pmb{Ax}$ is as close as possible to $\pmb{y}$ in terms of Euclidean norm $||\pmb{Ax}-\pmb{y}||_{2}$. # ## The Trace Operator # # * $$Tr(\pmb{A})=\sum_{i}a_{i,i}$$ # # * Frobenius norm of a matrix: # # $$||\pmb{A}||_{F} = \sqrt{\pmb{A}^T \pmb{A}}$$ # # * $$Tr(\pmb{A})=Tr(\pmb{A}^T)$$ # # * The trace of a square matrix composed of many factors is also invariant to moving the last factor into the first position: # # $$Tr(\pmb{ABC})=Tr(\pmb{BCA})=Tr(\pmb{CAB})$$ # # or more generally, # # $$Tr(\prod^{n}_{i=1}\pmb{F}^{(i)})=Tr(\pmb{F}^{(n)}\prod^{n-1}_{i=1}\pmb{F}^{(i)})$$ # ## Determinant # # * The determinant of a square matrix, denoted $det(\pmb{A})$ is a function mapping matrices to real scalars. # # * The determinant is equal to the product of all the matrix's eigenvalues. # # * The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space. # # * If the determinant is 0, then space is contracted completely along at least on dimension, causing it to lose all of its volume. # # * If the determinant is 1, then the transformation is volume-preserving. # In[ ]: