#!/usr/bin/env python
# coding: utf-8

# Pat Walters, the writer of the cheminformatics blog [Practical Cheminformatics](https://practicalcheminformatics.blogspot.com), has a [repository](https://github.com/PatWalters/datafiles) for datasets used in his blog posts. As several datasets are based on ChEMBL, there is a benefit to building reproducible workflows for re-generating them using `chembl-downloader`.
# 
# In this notebook, we'll look at a dataset of the small molecule inhibitors of [5-lipoxygenase activating protein (CHEMBL4550)](https://bioregistry.io/chembl:CHEMBL4550). It's available at as a `*.smi` file, which is a CSV file with SMILES strings in the first column arbitarary, application-specific content. In this case, the remaining two columns are a ChEMBL compound identifier and a [pChEMBL](https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions#what-is-pchembl) value. The data for this example can be found at [https://github.com/PatWalters/datafiles/raw/main/CHEMBL4550.smi](https://github.com/PatWalters/datafiles/raw/main/CHEMBL4550.smi).

# In[1]:


import sys
import time
from collections import defaultdict

import matplotlib.pyplot as plt
import matplotlib_inline
import pandas as pd
import seaborn as sns
from scipy import stats
from rdkit import Chem
from tqdm.auto import tqdm
from sklearn.decomposition import PCA

import chembl_downloader
import chembl_downloader.contrib


# In[2]:


matplotlib_inline.backend_inline.set_matplotlib_formats("svg")


# In[3]:


print(sys.version)


# In[4]:


print(time.asctime())


# ## Loading Walters' File

# In[5]:


walters_df = pd.read_csv(
    "https://github.com/PatWalters/datafiles/raw/main/CHEMBL4550.smi",
    header=None,
    names=["canonical_smiles", "molecule_chembl_id", "pchembl_value"],
).sort_values("molecule_chembl_id")
walters_df


# In[6]:


walters_counts = walters_df.groupby("molecule_chembl_id").count()["pchembl_value"]
walters_duplicates = set(walters_counts[walters_counts > 1].index)

print(f"There are {len(walters_duplicates):,} duplicate compounds in the Walters data file.")


# ## Rebuilding with `chembl-downloader`

# In[7]:


latest_version = chembl_downloader.latest()


# In[8]:


df = chembl_downloader.contrib.get_target_smi_df("CHEMBL4550", aggregate=None).sort_values(
    "molecule_chembl_id"
)
df


# In[9]:


counts = df.groupby("molecule_chembl_id").count()["pchembl_value"]
duplicates = set(counts[counts > 1].index)

print(f"There are {len(duplicates):,} duplicate compounds in the new data file.")


# ## Choosing an Appropriate Aggregation

# Because this dataset covers several assays, there are multiple values for each. The following cell shows the intersecting chemicals between the Walters dataset and the new dataset (where duplicates exist). It compares the arithmetic mean (i.e., the average) as well as the geometric mean to see which one is correct to regenerate the file.

# In[10]:


comparison_df = walters_df[walters_df["molecule_chembl_id"].isin(duplicates)].copy()

comparison_idx = df["molecule_chembl_id"].isin(set(walters_df["molecule_chembl_id"])) & df[
    "molecule_chembl_id"
].isin(duplicates)
comparison_df["new_mean"] = (
    df[comparison_idx].groupby("molecule_chembl_id").mean(numeric_only=True).values
)
comparison_df["new_gmean"] = (
    df[comparison_idx].groupby("molecule_chembl_id").agg(stats.gmean).values.round(3)
)

comparison_df


# Even though they're pretty close, it appears the arithmetic mean was used.

# In[11]:


agg_df = (
    df.groupby(["canonical_smiles", "molecule_chembl_id"])
    .mean(numeric_only=True)["pchembl_value"]
    .reset_index()
)
agg_df


# ## Summary of Value Added
# 
# The following plots a histogram of pChEMBL values of the two datasets The newer version of ChEMBL provides a greater variety of pChEMBL scores as well as a total larger number of compounds.

# In[12]:


walters_df["source"] = f"Walters        ({len(walters_df.index):,} mols.)"
agg_df["source"] = f"ChEMBL {latest_version}  ({len(agg_df.index):,} mols.)"
concat_df = pd.concat([walters_df, agg_df])

sns.histplot(data=concat_df, x="pchembl_value", hue="source")
plt.show()


# ## Chemical Space Added
# 
# Combining topological fingerprints and a dimensionality reduction technique, it can be seen that there's added chemical space covered by newer data (likely from new assays).

# In[13]:


fps = [Chem.RDKFingerprint(Chem.MolFromSmiles(smiles)) for smiles in agg_df.canonical_smiles]
wp_fps = [Chem.RDKFingerprint(Chem.MolFromSmiles(smiles)) for smiles in walters_df.canonical_smiles]


# In[14]:


pca = PCA(2)
reduced_fps = pca.fit_transform(fps)
wp_reduced_fps = pca.transform(wp_fps)


# In[15]:


fig, ax = plt.subplots()

sns.scatterplot(
    x=reduced_fps[:, 0],
    y=reduced_fps[:, 1],
    ax=ax,
    label=f"ChEMBL {latest_version}",
    alpha=0.5,
)
sns.scatterplot(
    x=wp_reduced_fps[:, 0],
    y=wp_reduced_fps[:, 1],
    ax=ax,
    label="Walters",
    alpha=0.5,
)
plt.show()