Quickstart¶

First let import the modules to work with. Some functionalities of MolDrug may not be available on pypi or conda yet. This could be because the new version is not yet release, but in brief it is going to be. You could install directly from the repo if some problem pops up. Just paste the following in a code cell:

! pip install git+https://github.com/ale94mleon/MolDrug.git@main

In [1]:

from rdkit import Chem
from moldrug import utils, fitness
from moldrug.data import receptors, ligands, boxes
import tempfile, os, gzip, shutil, requests
from multiprocessing import cpu_count

We will use some test examples from moldrug.data module.

In [2]:

# Creating a temporal directory
tmp_path = tempfile.TemporaryDirectory()
# Creating receptors files
r_x0161_file = os.path.join(tmp_path.name, 'r_x0161.pdbqt')
with open(r_x0161_file, 'w') as r:
    r.write(receptors.r_x0161)

Now let's download the crem data base. Here we are downloading the smaller one. But consider to visit CReM for more information about how to use and create the data base of fragments.

In [3]:

url = "http://www.qsar4u.com/files/cremdb/replacements02_sc2.db.gz"
r = requests.get(url, allow_redirects=True)
crem_dbgz_path = os.path.join(tmp_path.name,'crem.db.gz')
crem_db_path = os.path.join(tmp_path.name,'crem.db')
open(crem_dbgz_path, 'wb').write(r.content)
with gzip.open(crem_dbgz_path, 'rb') as f_in:
    with open(crem_db_path, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Now let's initialize the GA class.

In [4]:

maxiter = 3
popsize = 20
njobs = 2

out = utils.GA(
    seed_mol=Chem.MolFromSmiles(ligands.r_x0161),
    maxiter=maxiter,
    popsize=popsize,
    crem_db_path = crem_db_path,
    pc = 1,
    get_similar = False,
    mutate_crem_kwargs = {
        'radius':3,
        'min_size':1,
        'max_size':8,
        'min_inc':-5,
        'max_inc':3,
        'ncores':cpu_count(),
    },
    costfunc = fitness.Cost,
    costfunc_kwargs = {
        'vina_executable': 'vina',
        'receptor_path': r_x0161_file,
        'boxcenter' : boxes.r_x0161["A"]['boxcenter'] ,
        'boxsize': boxes.r_x0161["A"]['boxsize'],
        'exhaustiveness': 4,
        'ncores': int(cpu_count() / njobs),
        'num_modes': 1,
    },
    save_pop_every_gen = 20,
    deffnm = os.path.join(tmp_path.name, 'pop_test_single_receptor'),
    AddHs=True
)

Then let's call the class

In [5]:

out(njobs = njobs)


Creating the first population with 20 members:

100%|██████████| 20/20 [00:38<00:00,  1.92s/it]

Initial Population: Best individual: Individual(idx = 15, smiles = NS(=O)(=O)c1ccc(-n2c(=O)[nH][nH]c2=O)cc1, cost = 0.8145165194871083)
File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad!

Evaluating generation 1 / 3:

100%|██████████| 20/20 [00:47<00:00,  2.39s/it]

Generation 1: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).

Evaluating generation 2 / 3:

100%|██████████| 20/20 [00:54<00:00,  2.72s/it]

Generation 2: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).

Evaluating generation 3 / 3:

100%|██████████| 20/20 [00:52<00:00,  2.60s/it]

File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad!
Generation 3: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

The simulation finished successfully after 3 generations with a population of 20 individuals. A total number of 80 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0)
Final Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627)
The cost function droped in 0.27968089635473725 units.

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

'__call__'  257648.92 ms

This is a very silly example. In real life, you should increase the population size and the number of generation. In this way we explore in a better way the chemical space. The GA class has some nice methods that give you an idea of the obtained results.

In [6]:

out.to_dataframe().tail()

Out[6]:

	smiles	pdbqt	cost	idx	qed	sa_score	vina_score
75	[H]c1c([H])c(-n2c(C([H])([H])[H])c([H])c([H])c...	MODEL 1\nREMARK VINA RESULT: -6.8 0....	1.000000	75	0.819081	7.446127	-6.8
76	[H]OC([H])([H])n1nnn(-c2c([H])c([H])c(C(=O)OC(...	MODEL 1\nREMARK VINA RESULT: -6.2 0....	0.838417	76	0.644502	6.395654	-6.2
77	[H]c1c([H])c(S(=O)(=O)C([H])([H])C(=O)N(C([H])...	MODEL 1\nREMARK VINA RESULT: -5.9 0....	1.000000	77	0.781446	6.987495	-5.9
78	[H]OC(=O)C([H])(O[H])C(=C([H])[H])c1c([H])c([H...	MODEL 1\nREMARK VINA RESULT: -5.8 0....	1.000000	78	0.759651	7.469266	-5.8
79	[H]c1c([H])c(C([H])([H])C([H])([H])C(=O)N([H])...	MODEL 1\nREMARK VINA RESULT: -6.4 0....	0.844438	79	0.818279	6.774127	-6.4

You could save the entire class for future in a compressed or uncompressed pickle file. Here I just saved the file in the temporal directory, you could change the path to your current working directory.

In [7]:

out.pickle(os.path.join(tmp_path.name, f"NumGens_{out.NumGens}_PopSize_{popsize}"), compress=True)

Could be that after a long simulation we would like to perform a different searching strategy over the last population. This is really simple, we could "initialize" again the out instance with a different set of parameters. Let say that we would like to be close to the last solutions. Then we could use the parameters: min_size=0, max_size=1, min_inc=-1, max_inc=1.This flavor will add, delate mutate heavy atoms or change hydrogens to heavy atoms.

In [8]:

out.maxiter = 5
out.mutate_crem_kwargs = {
        'radius':3,
        'min_size':0,
        'max_size':1,
        'min_inc':-1,
        'max_inc':1,
        'ncores':cpu_count(),
    }

And run again the class

In [9]:

out(njobs = njobs)

File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad!
Evaluating generation 4 / 8:

100%|██████████| 20/20 [00:49<00:00,  2.49s/it]

Generation 4: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).

Evaluating generation 5 / 8:

100%|██████████| 19/19 [00:54<00:00,  2.85s/it]

Generation 5: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).

Evaluating generation 6 / 8:

100%|██████████| 17/17 [00:48<00:00,  2.85s/it]

Generation 6: Best Individual: Individual(idx = 132, smiles = Cc1cc(CCC(N)=O)ccc1N1CCNCC1, cost = 0.6991101572359109).

Evaluating generation 7 / 8:

100%|██████████| 16/16 [00:49<00:00,  3.09s/it]

Generation 7: Best Individual: Individual(idx = 132, smiles = Cc1cc(CCC(N)=O)ccc1N1CCNCC1, cost = 0.6991101572359109).

Evaluating generation 8 / 8:

100%|██████████| 18/18 [00:45<00:00,  2.53s/it]

File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad!
Generation 8: Best Individual: Individual(idx = 164, smiles = N#CNS(=O)(=O)c1ccc(-n2c(=O)[nH][nH]c2=O)cc1Br, cost = 0.6683728823567392).


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

The simulation finished successfully after 8 generations with a population of 20 individuals. A total number of 170 Individuals were seen during the simulation.
Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0)
Final Individual: Individual(idx = 164, smiles = N#CNS(=O)(=O)c1ccc(-n2c(=O)[nH][nH]c2=O)cc1Br, cost = 0.6683728823567392)
The cost function droped in 0.3316271176432608 units.

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

'__call__'  266955.44 ms

As you can see the GA class is updated rather that replaced. This is a perfect feature for rerun with different ideas. As you see the number of generations is also updated as well the number of calls.

In [10]:

print(out.NumGens, out.NumCalls)

8 2

The same idea is for the rest of the methods and atributtes

In [11]:

out.to_dataframe().tail()

Out[11]:

	smiles	pdbqt	cost	idx	qed	sa_score	vina_score
165	[H]c1c(Br)c(OC([H])([H])C(=O)C([H])(Cl)Cl)c(OC...	MODEL 1\nREMARK VINA RESULT: -6.7 0....	0.730751	165	0.678321	6.247825	-6.7
166	[H]Oc1c([H])c(-c2noc(N([H])[H])c2[H])c([H])c(O...	MODEL 1\nREMARK VINA RESULT: -6.5 0....	0.795321	166	0.682974	6.541091	-6.5
167	[H]c1c([H])c(S(=O)(=O)N(Cl)Cl)c([H])c([H])c1-n...	MODEL 1\nREMARK VINA RESULT: -6.4 0....	0.740191	167	0.783261	5.947768	-6.4
168	[H]c1c([H])c(S([H])(=O)=O)c([H])c([H])c1-n1c(=...	MODEL 1\nREMARK VINA RESULT: -6.1 0....	0.893223	168	0.575643	6.600720	-6.1
169	[H]c1c([H])c(C([H])([H])OC(=O)N([H])[H])c([H])...	MODEL 1\nREMARK VINA RESULT: -6.5 0....	0.803898	169	0.641205	6.565248	-6.5

This is the best solution of this simulation (in your case you could get a different one):

In [12]:

Chem.MolFromSmiles(out.pop[0].smiles)

Out[12]:

And here is the input

In [13]:

from rdkit import Chem
Chem.MolFromSmiles("COC(=O)C=1C=CC(=CC1)S(=O)(=O)N")

Out[13]:

Printing some information

In [14]:

import matplotlib.pyplot as plt
from moldrug import utils
import pandas as pd
sascorer = utils.import_sascorer()


fig, ax = plt.subplots(nrows = 6, figsize = (16,20))

ax[0].plot(range(len(out.SawIndividuals)),[Individual.qed for Individual in out.SawIndividuals], 'o')
ax[0].set(title = 'qed')

ax[1].plot(range(len(out.SawIndividuals)),[Individual.sa_score for Individual in out.SawIndividuals], 'o')
ax[1].set(title = 'sa_score')

ax[2].plot(range(len(out.SawIndividuals)),[Individual.vina_score for Individual in out.SawIndividuals], 'o')
ax[2].set(title = 'vina_score')

ax[3].plot(range(len(out.SawIndividuals)),[Individual.cost for Individual in out.SawIndividuals], 'o')
ax[3].set(title = 'cost')

ax[4].plot(range(len(out.avg_cost)),out.avg_cost, 'o')
ax[4].set(title = 'avg_cost')

ax[5].plot(range(len(out.bestcost)),out.bestcost, 'o')
ax[5].set(title = 'bestcost')


print(f"Initial vina score: {out.InitIndividual.vina_score}. Final vina score: {out.pop[0].vina_score}")
print(f"sascorer of the best Individual: {sascorer.calculateScore(out.pop[0].mol)}")
print(f"QED of the best Individual: {fitness.QED.weights_mean(out.pop[0].mol)}")

pd.DataFrame(utils.lipinski_profile(out.pop[0].mol), index=['value']).T

Initial vina score: -5.2. Final vina score: -7.2
sascorer of the best Individual: 5.796739279673451
QED of the best Individual: 0.4940340217312014

Out[14]:

	value
NumHAcceptors	6.00000
NumHDonors	3.00000
wt	360.14900
MLogP	-0.62422
NumRotatableBonds	3.00000
TPSA	140.61000
FractionCSP3	0.00000
HeavyAtomCount	20.00000
NHOHCount	3.00000
NOCount	9.00000
NumAliphaticCarbocycles	0.00000
NumAliphaticHeterocycles	0.00000
NumAliphaticRings	0.00000
NumAromaticCarbocycles	1.00000
NumAromaticHeterocycles	1.00000
NumAromaticRings	2.00000
NumHeteroatoms	11.00000
NumSaturatedCarbocycles	0.00000
NumSaturatedHeterocycles	0.00000
NumSaturatedRings	0.00000
RingCount	2.00000