Review Random Matrix Theory

- toc: true
- badges: true
- comments: true
- categories: [machine learning, QSAR]

I came across with this paper about denoising activity data using random matrix theory (RMT). I never heard about RMT before and it seemed an interesting approach, so I thought I explore the method by trying myself.

Chemicals are often represented in so called chemical fingerprint. The chemical fingerprint is a vector of bit information where each bit represents a presence or an absence of certain chemical groups. We can assume the presence of a set of certain chemical groups is correlated with binding of ligand. Using the RMT framework, we can removes noise in the fingerprints and aim to obtain the set of fingerprints that is significant.

Let's dive in how it works.

The RMT approaches this by computing correlation matrix and taking eigenvalue vectors of each column. By taking the eigenvalues from the correlation matrix, **the components having insignificant correlation will have small eigenvalues whereas the components having significant correlation will have large eigenvalues.**

Let's suppose there are $N$ fingerprint vector ($\mathbf{f}$) with size $p$. They are arranged in a matrix, $A = [\mathbf{f}_1, \mathbf{f}_2, \cdots, \mathbf{f}_N] \in \mathbb{R}^{N \times p}$. The matrix is normalized by subtract column mean and divide the colunms with standard deviation. The correlation matrix of $N \times N$ is constructed as $C = A^T A / N$.

If the entries in the matrix $A$ is i.i.d, then the eigenvalues of $A$ follows the Marcenko–Pastur (MP) distribution.

$$\rho(\lambda) = \frac{\sqrt{(\lambda_+ - \lambda) ( \lambda - \lambda_-)}}{2 \pi \gamma \lambda}$$where $\lambda_{\pm} = \left({1 \pm \sqrt\gamma}\right)^2$ and $\lambda = p / N$. Let's numerically validate this.

In [7]:

```
#collapse-hide
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib
from io import BytesIO
import pandas as pd
import numpy as np
from IPython.display import SVG
# RDKit
import rdkit
from rdkit.Chem import PandasTools
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import rdRGroupDecomposition
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules
from rdkit.Chem import Draw
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw.MolDrawing import MolDrawing, DrawingOptions #Only needed if modifying defaults
DrawingOptions.bondLineWidth=1.8
IPythonConsole.ipython_useSVG=True
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.warning')
print(rdkit.__version__)
# misc
from tqdm.notebook import tqdm
from math import pi, sqrt
```

2020.03.2

In [8]:

```
N = 500
p = 100
A = np.random.normal(size=(N, p))
```

In [9]:

```
#collapse-hide
mean = np.mean(A, axis=0)
std = np.std(A, axis=0)
A = (A - mean) / std
C = np.dot(A.T, A) / A.shape[0] # correlation matrix
w, v = np.linalg.eig(C) # w = eigenvalues, v = eigenvectors
gamma = A.shape[1] / A.shape[0]
gamma_p = (1 + sqrt(gamma))**2
gamma_m = (1 - sqrt(gamma))**2
rho = lambda x: sqrt(np.clip(gamma_p - x, 0, None) * np.clip(x - gamma_m, 0, None)) / (2*pi*gamma*x)
x = np.arange(0., np.max(np.real(w)), 0.1)
y = np.array([rho(_) for _ in x])
plt.hist(np.real(w), density=True)
plt.plot(x, y)
plt.xlabel('Eigenvalue')
plt.ylabel('Prob.')
plt.show()
```

We can see the eigenvalue distribution from a random matrix (blue bar) follows the MP distribution (yellow curve). Now, if the input matrix contains significant correlation, those eigenvalues correspond to the correlation will deviates from the MP distribution.

Let's take molecular fingerprint data and perform the same analysis. If we select a set of compounds that binds to a particular receptor, and the ligands have similar chemical features, this will be reflected in the molecular fingerprint. The fingerprint vectors will be no longer random, and some eigenvalues will rise above the MP distribution.

The author used data from ChEMBL database, but I'll use data from DUD-E database. DUD-E database contains sets of active compounds as well as inactive compounds per target and widely used in a benchmark for drug activity classification. The inactive compounds are not meant to be "true" inactive, but they are synthetically selected to match physico-chemical property yet having different 3D geometry.

There was a recent publication where 3D deep neural network algorithm for scoring/classification actually learned the features in the dataset, so I thought it would be interesting to actually see this effect.

In [10]:

```
s = Chem.SmilesMolSupplier('files/dude/ampc/actives_final.ism')
mols_active = [m for m in s]
s = Chem.SmilesMolSupplier('files/dude/ampc/decoys_final.ism')
mols_decoy = [m for m in s]
print(len(mols_active), len(mols_decoy))
```

535 35749

In [11]:

```
#collapse-hide
p = 1024
fps_active = [np.array(AllChem.GetMorganFingerprintAsBitVect(m,3,p)) for m in mols_active]
fps_decoy = [np.array(AllChem.GetMorganFingerprintAsBitVect(m,3,p)) for m in mols_decoy]
plt.figure(figsize=(12, 8))
plt.imshow(fps_active, interpolation=None, cmap=plt.cm.gray)
plt.ylabel('Ligand')
plt.xlabel('Fingerprint')
plt.show()
```

In [12]:

```
A = np.array(fps_active)
# remove columns that are all zeros
mean = np.mean(A, axis=0)
std = np.std(A, axis=0)
column_indices = ~((mean == 0) & (std == 0))
A_reduced = A[:, column_indices]
# normalize
mean = np.mean(A_reduced, axis=0)
std = np.std(A_reduced, axis=0)
A_normed = (A_reduced - mean) / std
# correlation matrix
C = np.dot(A_normed.T, A_normed) / A.shape[0] # correlation matrix
w, v = np.linalg.eig(C) # w = eigenvalue, v = eigenvetors
```

We are ready to plot the eigenvalue distribution and MP distribution.

In [13]:

```
#collapse-hide
def MP_distribution(N, p):
"""return MP distribution function based on the shape of input matrix"""
gamma = p / N
gamma_p = (1 + sqrt(gamma))**2
gamma_m = (1 - sqrt(gamma))**2
rho = lambda x: sqrt(np.clip(gamma_p - x, 0, None) * np.clip(x - gamma_m, 0, None)) / (2*pi*gamma*x)
return rho
plt.hist(np.real(w), bins=50, density=True) # plot eigenvalue distribution
N, p = A_normed.shape
rho = MP_distribution(N, p) # MP distribution
w = np.real(w) # real values of eigenvalues
x = np.arange(0., np.max(w), 0.1)
y = np.array([rho(_) for _ in x])
plt.plot(x, y) # plot MP distribution
plt.ylabel('Probability')
plt.xlabel('Eigenvalues')
plt.ylim(0, 0.2)
plt.show()
```

In [14]:

```
def MP_threshold(N, p):
"""return MP threshold based on the shape of input matrix"""
gamma = p / N
gamma_p = (1 + sqrt(gamma))**2
return gamma_p
th = MP_threshold(N, p)
indices = np.argwhere(w > th).flatten()
print(len(indices))
```

43

There are total of 43 eigenvalues above the significance threshold. In other words, the eigenvectors corresponds with these 43 eigenvalues represents the chemical subspace that facilitate the binding of this receptor.

An "ideal" ligands will have fingerprint close to this chemical subspace we just discovered and the ligands lacks these fingerprint features, will lie farther away from this subspace.

Define a subspace consists of $m$ eigenvectors discovered above; $\mathbf{V} = (\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_m)$. For a new ligand with fingerprint vector $\mathbf{u}$, we can compute the projection of this vector onto the subspace $\mathbf{V}$. The projection is defined as

$$\mathbf{u}_p = \sum_i^m (\mathbf{v}_i \cdot \mathbf{u}) \mathbf{v}_i$$The distance between the $\mathbf{u}$ and the projection vector $\mathbf{u}_p$, defined as $ \left\Vert \mathbf{u} - \mathbf{u}_p \right\Vert$, is the measure of similarity between the new ligands and the ligands known to binds to the receptor. If we compute similarity measure using known active compounds and inactive compounds, then we should be able to separate them.

In [15]:

```
V = np.real(np.array([v[i] for i in indices]))
A_active = np.array(fps_active)[:, column_indices]
A_decoy = np.array(fps_decoy)[:, column_indices]
# normalize
u_active = (A_active - mean) / std
u_decoy = (A_decoy - mean) / std
# projection
u_p_active = np.dot(np.einsum('ij,kj->ki', V, u_active), V)
u_p_decoy = np.dot(np.einsum('ij,kj->ki', V, u_decoy), V)
# distance
dist_active = np.linalg.norm(u_p_active - u_active, axis=1)
dist_decoy = np.linalg.norm(u_p_decoy - u_decoy, axis=1)
```

In [16]:

```
#collapse-hide
# active
prob_active, bins_active = np.histogram(dist_active, bins=12, density=True)
x_active = (bins_active[1:] + bins_active[:-1]) / 2
plt.plot(x_active, prob_active)
# decoy
prob_decoy, bins_decoy = np.histogram(dist_decoy, bins=25, density=True)
x_decoy = (bins_decoy[1:] + bins_decoy[:-1]) / 2
plt.plot(x_decoy, prob_decoy)
plt.xlabel('Distance from V')
plt.ylabel('Probability')
plt.show()
```