#!/usr/bin/env python
# coding: utf-8

# In[2]:


get_ipython().run_line_magic('matplotlib', 'inline')
from IPython.external.mathjax import install_mathjax
install_mathjax()


# # Linear Algebra
# 
# * The Matrix Cookbook(Petersen and Pedersen, 2006) or Shilov 1977

# ## Mathematical Object of Linear Algebra
# 
# * Scalar
# 
#   - A single number
#   - lower-case variable names with Italics : $x$
#   - $s \in \mathbb{R}$
#   - $n \in \mathbb{N}$
# 
# * Vector
# 
#   - An array of numbers
#   - lower-case names written in bold typeface, for example : $\pmb{x}$
#   - The first element of $\pmb{x}$ is $x_{1}$, The second element is $x_{2}$, and so on.
#   - If the length of vector is n and each element is in $\mathbb{R}$, then the vector lies in the set formed by taking the Cartesian product of $\mathbb{R}$ $n$ times, denoted as $\mathbb{R}^n$.
#   - $$\pmb{x} = \begin{bmatrix} x_{1} \\ x_{2} \\ \cdots \\ x_{n}\\ \end{bmatrix}$$
#   - We can think of vectors as identifying points in space, which each element giving the coordinate along a different axis.
#   - If we need to access $x_{1}$, $x_{3}$, $x_{6}$, we define the set $S=\{1,3,5\}$ and write $\pmb{x}_{S}$.
#   - $x_{-1}$ is the vector containing all elements of $\pmb{x}$ except for $x_{1}$.
#   - $\pmb{x}_{S}$ is the vector containing all of the elements of $\pmb{x}$ except for $x_{1}$, $x_{3}$, $x_{6}$.
#   
# * Matrices
# 
#   - a 2-D array of numbers
#   - upper-case variable names with bold typeface : $\pmb{A}$
#   - $\pmb{A}^{m \times n}$ has a height of $m$ and a width of $n$.
#   - $A_{1,1}$
#   - $A_{i,:}$ : the *i*-th *row* of $\pmb{A}$
#   - $$\begin{bmatrix} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \end{bmatrix}$$
#   - $$\pmb{A} = \begin{bmatrix} a_{1,1} & a_{1,2} \\ a_{2,1} & a_{2,2} \\ a_{3,1} & a_{3,2} \end{bmatrix} \Rightarrow \pmb{A}^{T} = \begin{bmatrix} a_{1,1} & a_{2,1} & a_{3,1} \\ a_{1,2} & a_{2,2} & a_{3,2} \end{bmatrix}$$
#   
# * Tensors
# 
#   - more than two axes.
#   - $A_{i,j,k}$ : the element of $\mathbf{A}$

# ## Operations
# 
# * Adding
# 
#   - $\pmb{C}=\pmb{A}+\pmb{B} \Rightarrow C_{i,j} = A_{i,j} + B_{i,j}$
#   - $\pmb{D}=a \cdot \pmb{B} + c \Rightarrow D_{i,j}=a \cdot B_{i,j} + c$
#   
# * Multiplying Matrices and Vectors
# 
#   - If $\pmb{A}$ is of shape $m \times n$ and $\pmb{B}$ is of shape $n \times p$, then $\pmb{C}$ is of shape $m \times p$.
#   - $\pmb{C}=\pmb{AB} \Rightarrow c_{i,j}=\Sigma_{k} a_{i,k} b_{k,j}$
#   - *dot product* : $\pmb{x} \cdot \pmb{y} = \pmb{x}^{T} \pmb{y}$
#   - $\pmb{A}(\pmb{B} + \pmb{C})=\pmb{AB} + \pmb{AC}$
#   - $\pmb{A}(\pmb{BC})=(\pmb{AB})\pmb{C}$
#   - $\pmb{AB} \neq \pmb{BA}$
#   - $(\pmb{AB})^{T}=\pmb{B}^{T} \pmb{A}^{T}$
#   - $$s \pmb{A}=\begin{bmatrix} sa_{1,1} & \dots & sa_{1,n} \\ \vdots & \ddots & \vdots \\ sa_{m,1} & \dots & sa_{m,n} \end{bmatrix}$$
#   - $\pmb{Ax}=\pmb{b}$
#   
#   $$\begin{align} \pmb{A}_{1,:}\pmb{x}&=b_{1} \\ \pmb{A}_{2,:}\pmb{x}&=b_{2} \\ &\dots \\ \pmb{A}_{m,:}\pmb{x}&=b_{m} \end{align}$$
#   
#   more explicitly:
#   
#   $$\begin{align} a_{1,1}x_{1} + a_{1,2}x_{2} + \dots + a_{1,n}x_{n}=b_{1} \\ a_{2,1}x_{1} + a_{2,2}x_{2} + \dots + a_{2,n}x_{n}=b_{2} \\ \dots \\ a_{m,1}x_{1} + a_{m,2}x_{2} + \dots + a_{m,n}x_{n}=b_{m} \end{align}$$
# 

# ## Identity and Inverse Matrices
# 
# * Identify Matrix : $\forall \pmb{x} \in \mathbb{R}^n$, $\pmb{I}_{n} \pmb{x} = \pmb{x}$
# * Inverse Matrix : $\pmb{A}^{-1} \pmb{A} = \pmb{I}_{n}$
# * $$\begin{align} \pmb{A} \pmb{x} &= \pmb{b} \\ \pmb{A}^{-1} \pmb{Ax} &= \pmb{A}^{-1} \pmb{b} \\ \pmb{I}_{n} \pmb{x} &= \pmb{A}^{-1} \pmb{b} \\ \pmb{x} &= \pmb{A}^{-1} \pmb{b} \end{align}$$

# ## Linear Dependence, Span, and Rank
# 
# * $$\pmb{Ax} = \sum_{i} x_{i} \pmb{A}_{:,i}$$
# 
# In general, this kind of operation is called a *linear combination*.
# 
# * $$\sum_{i} c_{i} \pmb{v}^{(i)}$$
# 
# * The *span* of a set of vectors is the set of all points obtainable by linear combination of the original vectors.
# 
# * Determining whether $\pmb{Ax}=\pmb{b}$ has a solution thus amounts to testing whether $\pmb{b}$ is in the span of the columns of $\pmb{A}$. This particular span is known as the *column space* or the *range* of $\pmb{A}$.
# 
# * In order for the system $\pmb{Ax}=\pmb{b}$ to have a solution for all values of $\pmb{b} \in \mathbb{R}^{m}$, we therefore require that the column space of $\pmb{A}$ be all of $\mathbb{R}^{m}$.
# 
# * Having $n \le m$ is only a necessary condition for every point to have a solution. However, It's not a sufficient condition, because it's possible for some of the columns to be redundant, A $2 \times 2$ matrix where both of the columns are equal to each other.
# 
# * Formally, this kind of redundancy is known as *linear dependence*. A set of vectors is *linearly independent* if no vector in the set is a linear combination of the other vectors.

# ## Norms
# 
# * *$L^{p}$ norm*
# 
# $$||\pmb{x}||_{p} = \biggl( \sum_{i} |x_i|^p \biggr)^{\frac{1}{p}}$$ 
# 
# for $p \in \mathbb{R}$, $p \ge 1$
# 
# * Norms, including the $L^{p}$ norm, are functions mapping vectors to non-negative values, satisfying these properties that make them behave like distances between points:
# 
# $$\begin{align} & f(\pmb{x})=0 \Rightarrow \pmb{x} = \pmb{0} \\ & f(\pmb{x}+\pmb{y}) \le f(\pmb{x}) + f(\pmb{y}), (\text{the triangle inequality}) \\ & \forall \alpha \in \mathbb{R}, f(\alpha \pmb{x}) = |\alpha|f(\pmb{x}) \end{align}$$
# 
# * The $L^{2}$ norm, with $p = 2$, is known as the *Euclidean norm*. It is simply the Euclidean distance from the origin to the point identified by $\pmb{x}$.
# 
# * Also, the $L^{1}$ norm is commonly used in machine learning when the difference between zero and nonzero elements is very important.
# 
# * The $l_{\infty}$ norm, also knwon as the max norm,
# 
# $$||x||_{\infty}=\max_{i}|x_{i}|$$
# 
# * Sometimes we may also wisth to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the otherwise obscure *Frobenius norm*
# 
# $$||A||_{F}=\sqrt{\sum_{i,j} a_{i,j}^2}$$
# 
# * $$\pmb{x}^T \pmb{y} = ||\pmb{x}||_{2} ||\pmb{y}||_{2} \cos{\theta}$$

# ## Special Kinds of Matrices and Vectors
# 
# * Diagonal matrix $\pmb{D}$ is when $d_{i,j}=0$ for all $i \ne j$. For example, the identity matrix.
# 
# * $diag(\pmb{v})$ : to denote a square diagonal matrix whose diagonal entries are given by the entries of the vector $\pmb{v}$.
# 
# * A symmetric matrix is any matrix that is equal to its own transpose:
# 
# $$\pmb{A}=\pmb{A}^T$$
# 
# * A unit vector is a vector with unit norm:
# 
# $$||\pmb{x}||_{2}=1$$
# 
# A vector $\pmb{x}$ and a vector $\pmb{y}$ are *orthogonal* to each other if $\pmb{x}_{T} \pmb{y}=0$.
# 
# * If the vectors are not only orthogonal but also have unit norm, we call them *orthonormal*.
# 
# * An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal:
# 
# $$\pmb{A}^T\pmb{A} = \pmb{A}\pmb{A}^T = \pmb{I}$$
# 
# This implies that
# 
# $$\pmb{A}^{-1}=\pmb{A}^{T}$$
# 
# so orthogonal matrices are of interest because their inverse is very cheap to compute.

# ## Eigendecomposition
# 
# * *Eigendecomposition* : a kinds of matrix decomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.
# 
# * *Eigenvector* : An eigenvector of a square matrix $\pmb{A}$ is a non-zero vector $v$ such that multiplication by $\pmb{A}$ alters only the scale of $\pmb{v}$:
# 
# $$\pmb{A} \pmb{v} = \pmb{\lambda} \pmb{v}$$
# 
# * The scalar $\lambda$ is known as the *eigenvalue corresponding to this eigenvector.
# 
# * If eigenvectors $\{\pmb{v}^{(1)},\dots,\pmb{v}^{(n)}\}$ and corresponding eigenvalues $\{\lambda_1,\dots,\lambda_{n}\}$ by concatenating the eigenvectors into a matrix $\pmb{V}=[\pmb{v}^{(1)},\dots,\pmb{v}^{(n)}]$ (i.e. one column per eigenvector), and concatenating the eigenvalues into a vector $\lambda$, then the matrix 
# 
# $$\pmb{A}=\pmb{V} diag(\lambda) \pmb{V}^{-1}$$
# 
# has the desired eigenvalues and eigenvectors.
# 
# * Every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues:
# 
# $$\pmb{A}=\pmb{Q}\pmb{\Lambda}\pmb{Q}^{T}$$
# 
# where $\pmb{Q}$ is an orthogonal matrix composed of eigenvectors of $\pmb{A}$, and $\pmb{A}$ is a diagonal matrix, with $\lambda_{i,i}$ being the eigenvalue corresponding to $\pmb{Q}_{:,i}$.
# 

# ## Singular Value Decomposition
# 
# * The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values.
# 
# * Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition.
# 
# * **If a matrix is not square, the eigendecomposition is not defined, and we must use a singular value decomposition instead.**
# 
# $$\pmb{A}=\pmb{U} \pmb{D} \pmb{V}^{T}$$
# 
# Suppose that $\pmb{A}$ is an $m \times n$ matrix. Then $\pmb{U}$ is defined to be an $m\times m$ matrix, $\pmb{D}$ to be an $m \times n$ matrix, and $\pmb{W}$ to be an $n \times n$ matrix.
# 
# * The matrices $\pmb{U}$ and $\pmb{V}$ are both defined to be orthogonal matrices. The matrix $\pmb{D}$ is defined to be a diagonal matrix. Note that $\pmb{D}$ is not necessarily square.
# 
# * The elements along the diagonal of $\pmb{D}$ are known as the *singular values* of the matrix $\pmb{A}$. The columns of $\pmb{U}$ are known as the *left-singular vectors*. The columns of $\pmb{V}$ are known as the *right-singular vectors*.
# 
# * We can actually interpret the singular value decomposition of $\pmb{A}$ in terms of the eigendecomposition of functions of $\pmb{A}$. 
# 
# * The left-singular vectors of $\pmb{A}$ are the eigenvectors of $\pmb{A}\pmb{A}^{T}$.
# 
# * The right-singular vectors of $\pmb{A}$ are the eigenvectors of $\pmb{A}^{T}\pmb{A}$.
# 
# * The non-zero singular values of $\pmb{A}$ are the square roots of the eigenvalues of $\pmb{A}^{T}\pmb{A}$. 
# 

# ## The Moore-Penrose Pseudoinverse
# 
# * Matrix inversion is not defined for matrices that are not square.
# 
# * Suppose we want to make a left-inverse $\pmb{B}$ of a matrix $\pmb{A}$, so that we can solve a linear equation
# 
# $$\pmb{Ax}=\pmb{y}$$
# 
# by left-multiplying each side to obtain
# 
# $$\pmb{x}=\pmb{By}$$
# 
# * The *Moore-Penrose Pseudoinverse* allows us to make some headway in these cases. The pseudoinverse of $\pmb{A}$ is defined as a matrix
# 
# $$\pmb{x}^{+}=\lim_{\alpha \searrow 0}(\pmb{A}^{T} \pmb{A}^{T} + \alpha \pmb{I})^{-1} \pmb{A}^{T}$$
# 
# * Practical algorithm for computing the pseudoinverse are not based on this definition, but rather the formla
# 
# $$\pmb{A}^{+}=\pmb{V} \pmb{D}^{+} \pmb{U}^{T}$$
# 
# where $\pmb{U}$, $\pmb{D}$, and $\pmb{V}$ are the singular value decomposition of $\pmb{A}$, and the pseudoinverse $\pmb{D}^{+}$ of a diagonal matrix $\pmb{D}$ is obtained by taking the reciprocal of all of its non-zero elements.
# 
# * When $\pmb{A}$ has more rows than columns, then solving a linear equation using pseudoinverse provides one of the many possible solutions.
# 
# * Specifically, it provides the solution $\pmb{x}=\pmb{A}^{+}\pmb{y}$ with minimal Euclidean norm $||\pmb{x}||_{2}$ among all possible solutions.
# 
# * When $\pmb{A}$ has more columns than rows, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\pmb{x}$ for which $\pmb{Ax}$ is as close as possible to $\pmb{y}$ in terms of Euclidean norm $||\pmb{Ax}-\pmb{y}||_{2}$.

# ## The Trace Operator
# 
# * $$Tr(\pmb{A})=\sum_{i}a_{i,i}$$
# 
# * Frobenius norm of a matrix:
# 
# $$||\pmb{A}||_{F} = \sqrt{\pmb{A}^T \pmb{A}}$$
# 
# * $$Tr(\pmb{A})=Tr(\pmb{A}^T)$$
# 
# * The trace of a square matrix composed of many factors is also invariant to moving the last factor into the first position:
# 
# $$Tr(\pmb{ABC})=Tr(\pmb{BCA})=Tr(\pmb{CAB})$$
# 
# or more generally,
# 
# $$Tr(\prod^{n}_{i=1}\pmb{F}^{(i)})=Tr(\pmb{F}^{(n)}\prod^{n-1}_{i=1}\pmb{F}^{(i)})$$

# ## Determinant
# 
# * The determinant of a square matrix, denoted $det(\pmb{A})$ is a function mapping matrices to real scalars.
# 
# * The determinant is equal to the product of all the matrix's eigenvalues.
# 
# * The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space.
# 
# * If the determinant is 0, then space is contracted completely along at least on dimension, causing it to lose all of its volume.
# 
# * If the determinant is 1, then the transformation is volume-preserving.

# In[ ]: