Review Random Matrix Theory
I came across with this paper about denoising activity data using random matrix theory (RMT). I never heard about RMT before and it seemed an interesting approach, so I thought I explore the method by trying myself.
Chemicals are often represented in so called chemical fingerprint. The chemical fingerprint is a vector of bit information where each bit represents a presence or an absence of certain chemical groups. We can assume the presence of a set of certain chemical groups is correlated with binding of ligand. Using the RMT framework, we can removes noise in the fingerprints and aim to obtain the set of fingerprints that is significant.
Let's dive in how it works.
The RMT approaches this by computing correlation matrix and taking eigenvalue vectors of each column. By taking the eigenvalues from the correlation matrix, the components having insignificant correlation will have small eigenvalues whereas the components having significant correlation will have large eigenvalues.
Let's suppose there are $N$ fingerprint vector ($\mathbf{f}$) with size $p$. They are arranged in a matrix, $A = [\mathbf{f}_1, \mathbf{f}_2, \cdots, \mathbf{f}_N] \in \mathbb{R}^{N \times p}$. The matrix is normalized by subtract column mean and divide the colunms with standard deviation. The correlation matrix of $N \times N$ is constructed as $C = A^T A / N$.
If the entries in the matrix $A$ is i.i.d, then the eigenvalues of $A$ follows the Marcenko–Pastur (MP) distribution.
$$\rho(\lambda) = \frac{\sqrt{(\lambda_+ - \lambda) ( \lambda - \lambda_-)}}{2 \pi \gamma \lambda}$$where $\lambda_{\pm} = \left({1 \pm \sqrt\gamma}\right)^2$ and $\lambda = p / N$. Let's numerically validate this.
#collapse-hide
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib
from io import BytesIO
import pandas as pd
import numpy as np
from IPython.display import SVG
# RDKit
import rdkit
from rdkit.Chem import PandasTools
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import rdRGroupDecomposition
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules
from rdkit.Chem import Draw
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw.MolDrawing import MolDrawing, DrawingOptions #Only needed if modifying defaults
DrawingOptions.bondLineWidth=1.8
IPythonConsole.ipython_useSVG=True
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.warning')
print(rdkit.__version__)
# misc
from tqdm.notebook import tqdm
from math import pi, sqrt
2020.03.2
N = 500
p = 100
A = np.random.normal(size=(N, p))
Here's a matrix $A$ with randomly drawn values. Subtract mean and standard deviation to make the column mean as zero and the variance as 1. According to RMT, the eigen value distribution of the correlation matrix of $A$ will follow the MP distribution.
#collapse-hide
mean = np.mean(A, axis=0)
std = np.std(A, axis=0)
A = (A - mean) / std
C = np.dot(A.T, A) / A.shape[0] # correlation matrix
w, v = np.linalg.eig(C) # w = eigenvalues, v = eigenvectors
gamma = A.shape[1] / A.shape[0]
gamma_p = (1 + sqrt(gamma))**2
gamma_m = (1 - sqrt(gamma))**2
rho = lambda x: sqrt(np.clip(gamma_p - x, 0, None) * np.clip(x - gamma_m, 0, None)) / (2*pi*gamma*x)
x = np.arange(0., np.max(np.real(w)), 0.1)
y = np.array([rho(_) for _ in x])
plt.hist(np.real(w), density=True)
plt.plot(x, y)
plt.xlabel('Eigenvalue')
plt.ylabel('Prob.')
plt.show()
/Users/sunhwan/miniconda3/lib/python3.7/site-packages/ipykernel/__main__.py:12: RuntimeWarning: invalid value encountered in double_scalars
We can see the eigenvalue distribution from a random matrix (blue bar) follows the MP distribution (yellow curve). Now, if the input matrix contains significant correlation, those eigenvalues correspond to the correlation will deviates from the MP distribution.
Let's take molecular fingerprint data and perform the same analysis. If we select a set of compounds that binds to a particular receptor, and the ligands have similar chemical features, this will be reflected in the molecular fingerprint. The fingerprint vectors will be no longer random, and some eigenvalues will rise above the MP distribution.
The author used data from ChEMBL database, but I'll use data from DUD-E database. DUD-E database contains sets of active compounds as well as inactive compounds per target and widely used in a benchmark for drug activity classification. The inactive compounds are not meant to be "true" inactive, but they are synthetically selected to match physico-chemical property yet having different 3D geometry.
There was a recent publication where 3D deep neural network algorithm for scoring/classification actually learned the features in the dataset, so I thought it would be interesting to actually see this effect.
s = Chem.SmilesMolSupplier('files/dude/ampc/actives_final.ism')
mols_active = [m for m in s]
s = Chem.SmilesMolSupplier('files/dude/ampc/decoys_final.ism')
mols_decoy = [m for m in s]
print(len(mols_active), len(mols_decoy))
535 35749
I'll use the data for target AMPC for this post. There are total 535 active and 35749 decoy compounds in the dataset. Let's compute the Morgan fingerprint with radius 3 and size 1024 bits. Morgan fingerprint with radius 3 is equivalent to ECFP6 fingerprint.
#collapse-hide
p = 1024
fps_active = [np.array(AllChem.GetMorganFingerprintAsBitVect(m,3,p)) for m in mols_active]
fps_decoy = [np.array(AllChem.GetMorganFingerprintAsBitVect(m,3,p)) for m in mols_decoy]
plt.figure(figsize=(12, 8))
plt.imshow(fps_active, interpolation=None, cmap=plt.cm.gray)
plt.ylabel('Ligand')
plt.xlabel('Fingerprint')
plt.show()
The plot of fingerprints clearly shows some bits appeare more frequently than the others. Let's normalize the fingerprint vectors and compute the correlation matrix. Finally take the eigendecomposition of the correlation matrix.
A = np.array(fps_active)
# remove columns that are all zeros
mean = np.mean(A, axis=0)
std = np.std(A, axis=0)
column_indices = ~((mean == 0) & (std == 0))
A_reduced = A[:, column_indices]
# normalize
mean = np.mean(A_reduced, axis=0)
std = np.std(A_reduced, axis=0)
A_normed = (A_reduced - mean) / std
# correlation matrix
C = np.dot(A_normed.T, A_normed) / A.shape[0] # correlation matrix
w, v = np.linalg.eig(C) # w = eigenvalue, v = eigenvetors
We are ready to plot the eigenvalue distribution and MP distribution.
#collapse-hide
def MP_distribution(N, p):
"""return MP distribution function based on the shape of input matrix"""
gamma = p / N
gamma_p = (1 + sqrt(gamma))**2
gamma_m = (1 - sqrt(gamma))**2
rho = lambda x: sqrt(np.clip(gamma_p - x, 0, None) * np.clip(x - gamma_m, 0, None)) / (2*pi*gamma*x)
return rho
plt.hist(np.real(w), bins=50, density=True) # plot eigenvalue distribution
N, p = A_normed.shape
rho = MP_distribution(N, p) # MP distribution
w = np.real(w) # real values of eigenvalues
x = np.arange(0., np.max(w), 0.1)
y = np.array([rho(_) for _ in x])
plt.plot(x, y) # plot MP distribution
plt.ylabel('Probability')
plt.xlabel('Eigenvalues')
plt.ylim(0, 0.2)
plt.show()
/Users/sunhwan/miniconda3/lib/python3.7/site-packages/ipykernel/__main__.py:8: RuntimeWarning: invalid value encountered in double_scalars
Interesting! There are several eigenvalues deviates from the MP distribution. The RMT suggests the eigenvalues greater than $\left( 1 + \sqrt \gamma \right)^2$ are significant.
def MP_threshold(N, p):
"""return MP threshold based on the shape of input matrix"""
gamma = p / N
gamma_p = (1 + sqrt(gamma))**2
return gamma_p
th = MP_threshold(N, p)
indices = np.argwhere(w > th).flatten()
print(len(indices))
43
There are total of 43 eigenvalues above the significance threshold. In other words, the eigenvectors corresponds with these 43 eigenvalues represents the chemical subspace that facilitate the binding of this receptor.
An "ideal" ligands will have fingerprint close to this chemical subspace we just discovered and the ligands lacks these fingerprint features, will lie farther away from this subspace.
Define a subspace consists of $m$ eigenvectors discovered above; $\mathbf{V} = (\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_m)$. For a new ligand with fingerprint vector $\mathbf{u}$, we can compute the projection of this vector onto the subspace $\mathbf{V}$. The projection is defined as
$$\mathbf{u}_p = \sum_i^m (\mathbf{v}_i \cdot \mathbf{u}) \mathbf{v}_i$$The distance between the $\mathbf{u}$ and the projection vector $\mathbf{u}_p$, defined as $ \left\Vert \mathbf{u} - \mathbf{u}_p \right\Vert$, is the measure of similarity between the new ligands and the ligands known to binds to the receptor. If we compute similarity measure using known active compounds and inactive compounds, then we should be able to separate them.
V = np.real(np.array([v[i] for i in indices]))
A_active = np.array(fps_active)[:, column_indices]
A_decoy = np.array(fps_decoy)[:, column_indices]
# normalize
u_active = (A_active - mean) / std
u_decoy = (A_decoy - mean) / std
# projection
u_p_active = np.dot(np.einsum('ij,kj->ki', V, u_active), V)
u_p_decoy = np.dot(np.einsum('ij,kj->ki', V, u_decoy), V)
# distance
dist_active = np.linalg.norm(u_p_active - u_active, axis=1)
dist_decoy = np.linalg.norm(u_p_decoy - u_decoy, axis=1)
#collapse-hide
# active
prob_active, bins_active = np.histogram(dist_active, bins=12, density=True)
x_active = (bins_active[1:] + bins_active[:-1]) / 2
plt.plot(x_active, prob_active)
# decoy
prob_decoy, bins_decoy = np.histogram(dist_decoy, bins=25, density=True)
x_decoy = (bins_decoy[1:] + bins_decoy[:-1]) / 2
plt.plot(x_decoy, prob_decoy)
plt.xlabel('Distance from V')
plt.ylabel('Probability')
plt.show()