$Mv_i=\lambda_iv_i$
$M^kv_i=M^{k-1}(Mv_i)=\lambda_iM^{k-1}v_i=\lambda_i^kv_i$
One possible way to prove the statement is to use the power method¹², which is an iterative algorithm for finding the dominant eigenvalue and eigenvector of a matrix. The idea is to start with a random vector $x_0\in\mathbb{R}^n$ and repeatedly multiply it by the matrix $M$, normalizing it after each step. That is, we define $x_{k+1}=\frac{Mx_k}{\|Mx_k\|}$ for $k=0,1,2,\dots$. The power method guarantees that, under some mild assumptions on $M$, the sequence $\{x_k\}$ will converge to a unit eigenvector of $M$ corresponding to the eigenvalue with the largest absolute value. Moreover, the convergence rate depends on the ratio of the largest and second largest eigenvalues of $M$. If this ratio is large, then the convergence is fast.
To formalize this statement, we need to define what it means for a vector to be "very much aligned" with another vector. A natural way to measure the alignment is by using the angle between them. If the angle is close to zero, then the vectors are almost parallel. If the angle is close to $\pi/2$, then the vectors are almost orthogonal. The angle between two unit vectors $u$ and $v$ can be computed by using the dot product: $\cos\theta=u\cdot v$. Therefore, we can say that $x_k$ is very much aligned with $v_1$ if $\cos\theta_k=|x_k\cdot v_1|$ is close to one.
To quantify how close $\cos\theta_k$ is to one, we can use some error tolerance $\epsilon>0$. For example, we can say that $x_k$ is $\epsilon$-aligned with $v_1$ if $\cos\theta_k\geq 1-\epsilon$. Then, we can state the following theorem:
Theorem: Let $M$ be an $n\times n$ matrix with a dominant eigenvalue $\lambda_1$ and a corresponding unit eigenvector $v_1$. Let $x_0$ be a random vector in $\mathbb{R}^n$ with unit norm. Let $\epsilon>0$ be an error tolerance. Then, there exists a positive integer $K$ such that for all $k\geq K$, the vector $x_k=\frac{M^kx_0}{\|M^kx_0\|}$ is $\epsilon$-aligned with $v_1$. That is, $$|x_k\cdot v_1|\geq 1-\epsilon$$
Proof: The proof of this theorem can be found in many textbooks on numerical analysis or linear algebra¹². Here, we only sketch the main steps. First, we write $x_0$ as a linear combination of the eigenvectors of $M$: $$x_0=c_1v_1+c_2v_2+\cdots+c_nv_n$$ where $c_i\neq 0$ for some $i>1$. Then, we multiply both sides by $M^k$ and use the fact that $Mv_i=\lambda_iv_i$ for each $i$: $$M^kx_0=c_1\lambda_1^kv_1+c_2\lambda_2^kv_2+\cdots+c_n\lambda_n^kv_n$$ Next, we divide both sides by $\lambda_1^k$ and use the fact that $\lambda_1$ is the dominant eigenvalue: $$\frac{M^kx_0}{\lambda_1^k}=c_1v_1+c_2(\frac{\lambda_2}{\lambda_1})^kv_2+\cdots+c_n(\frac{\lambda_n}{\lambda_1})^kv_n$$ Since $\left|\frac{\lambda_i}{\lambda_1}\right|<1$ for all $i>1$, the terms involving $\lambda_i^k$ will decay exponentially as $k$ increases. Therefore, for sufficiently large $k$, we have $$\frac{M^kx_0}{\lambda_1^k}\approx c_1v_1$$ Finally, we normalize both sides by their norms and use the fact that $\|v_1\|=1$: $$x_k=\frac{M^kx_0}{\|M^kx_0\|}\approx \frac{c_1v_1}{|c_1|\|v_1\|}= \pm v_1$$ This implies that $$|x_k\cdot v_1|\approx |(\pm v_1)\cdot v_1|= 1$$ Hence, for any given $\epsilon>0$, we can find a positive integer $K$ such that for all $k\geq K$, we have $$|x_k\cdot v_1|\geq 1-\epsilon$$ This completes the proof. $\blacksquare$
The result mentioned, which deals with the alignment of vectors under repeated matrix multiplication, has some relevance to the gradients in Recurrent Neural Networks (RNNs), especially in the context of gradient vanishing and exploding problems.
In RNNs, you have sequences of vectors, and you often perform repeated matrix-vector multiplications (hidden state updates) at each time step. These matrix-vector multiplications are analogous to the repeated matrix powers in the result you mentioned. Here's how the result might relate to gradients in RNNs:
Gradient Vanishing: In some RNN architectures, particularly vanilla RNNs, the gradients can vanish when backpropagating through time (BPTT). This means that as you backpropagate through multiple time steps, the gradients can become extremely small. The result you mentioned about alignment is somewhat related because it highlights that when you repeatedly multiply by a matrix with eigenvalues less than 1, the vector becomes aligned with the dominant eigenvector and can converge to zero. In the context of RNNs, this can lead to vanishing gradients, making it challenging to train the network effectively.
Gradient Exploding: On the flip side, if the matrix you repeatedly multiply by has eigenvalues greater than 1, the result can explode, leading to large values. In the context of RNNs, this corresponds to exploding gradients. This is also a problem because it can lead to numerical instability during training.
Solutions: To address these issues, various RNN architectures and optimization techniques have been proposed. LSTMs and GRUs, for example, have gating mechanisms that help mitigate gradient vanishing to some extent. Additionally, techniques like gradient clipping and using well-conditioned weight matrices can help prevent gradient explosion.
Weight Initialization: Weight initialization schemes, such as the Xavier/Glorot initialization, are designed to set the initial weight matrices of RNNs in a way that controls the eigenvalues and helps mitigate both vanishing and exploding gradients.
Orthogonal Initialization: In some cases, orthogonal weight initialization, which initializes the weight matrices as orthogonal matrices, can help maintain the alignment of vectors during repeated multiplications and mitigate the vanishing gradient problem.
In summary, the result about vector alignment under repeated matrix multiplication has implications for understanding the challenges of gradient vanishing and exploding in RNNs. Addressing these challenges is crucial for training RNNs effectively, and various architectural and initialization techniques have been developed to manage these issues.
In addition to gradient clipping, there are several other methods and techniques to cope with gradient explosion in recurrent neural networks (RNNs):
Weight Regularization:
Batch Normalization:
Gradient Clipping Variations:
Truncated Backpropagation Through Time (TBPTT):
Gradient Checkpointing:
Exploding Gradient Detection:
Gradient Scaling:
Curriculum Learning:
Alternative Architectures:
Gradient Clipping on a Per-Layer Basis:
Learning Rate Scheduling:
Gradient Masking:
Experimentation and model-specific considerations are often necessary to determine the most effective approach to coping with gradient explosion in RNNs, as the effectiveness of these methods can vary depending on the specific problem and architecture.