First let import the modules to work with. Some functionalities of MolDrug may not be available on pypi or conda yet. This could be because the new version is not yet release, but in brief it is going to be. You could install directly from the repo if some problem pops up. Just paste the following in a code cell:
! pip install git+https://github.com/ale94mleon/MolDrug.git@main
from rdkit import Chem
from moldrug import utils, fitness
from moldrug.data import receptors, ligands, boxes
import tempfile, os, gzip, shutil, requests
from multiprocessing import cpu_count
We will use some test examples from moldrug.data
module.
# Creating a temporal directory
tmp_path = tempfile.TemporaryDirectory()
# Creating receptors files
r_x0161_file = os.path.join(tmp_path.name, 'r_x0161.pdbqt')
with open(r_x0161_file, 'w') as r:
r.write(receptors.r_x0161)
Now let's download the crem data base. Here we are downloading the smaller one. But consider to visit CReM for more information about how to use and create the data base of fragments.
url = "http://www.qsar4u.com/files/cremdb/replacements02_sc2.db.gz"
r = requests.get(url, allow_redirects=True)
crem_dbgz_path = os.path.join(tmp_path.name,'crem.db.gz')
crem_db_path = os.path.join(tmp_path.name,'crem.db')
open(crem_dbgz_path, 'wb').write(r.content)
with gzip.open(crem_dbgz_path, 'rb') as f_in:
with open(crem_db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Now let's initialize the GA
class.
maxiter = 3
popsize = 20
njobs = 2
out = utils.GA(
seed_mol=Chem.MolFromSmiles(ligands.r_x0161),
maxiter=maxiter,
popsize=popsize,
crem_db_path = crem_db_path,
pc = 1,
get_similar = False,
mutate_crem_kwargs = {
'radius':3,
'min_size':1,
'max_size':8,
'min_inc':-5,
'max_inc':3,
'ncores':cpu_count(),
},
costfunc = fitness.Cost,
costfunc_kwargs = {
'vina_executable': 'vina',
'receptor_path': r_x0161_file,
'boxcenter' : boxes.r_x0161["A"]['boxcenter'] ,
'boxsize': boxes.r_x0161["A"]['boxsize'],
'exhaustiveness': 4,
'ncores': int(cpu_count() / njobs),
'num_modes': 1,
},
save_pop_every_gen = 20,
deffnm = os.path.join(tmp_path.name, 'pop_test_single_receptor'),
AddHs=True
)
Then let's call the class
out(njobs = njobs)
Creating the first population with 20 members:
100%|██████████| 20/20 [00:38<00:00, 1.92s/it]
Initial Population: Best individual: Individual(idx = 15, smiles = NS(=O)(=O)c1ccc(-n2c(=O)[nH][nH]c2=O)cc1, cost = 0.8145165194871083) File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad!
Evaluating generation 1 / 3:
100%|██████████| 20/20 [00:47<00:00, 2.39s/it]
Generation 1: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).
Evaluating generation 2 / 3:
100%|██████████| 20/20 [00:54<00:00, 2.72s/it]
Generation 2: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).
Evaluating generation 3 / 3:
100%|██████████| 20/20 [00:52<00:00, 2.60s/it]
File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad! Generation 3: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627). =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ The simulation finished successfully after 3 generations with a population of 20 individuals. A total number of 80 Individuals were seen during the simulation. Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0) Final Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627) The cost function droped in 0.27968089635473725 units. =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ '__call__' 257648.92 ms
This is a very silly example. In real life, you should increase the population size and the number of generation. In this way we explore in a better way the chemical space. The GA
class has some nice methods that give you an idea of the obtained results.
out.to_dataframe().tail()
smiles | pdbqt | cost | idx | qed | sa_score | vina_score | |
---|---|---|---|---|---|---|---|
75 | [H]c1c([H])c(-n2c(C([H])([H])[H])c([H])c([H])c... | MODEL 1\nREMARK VINA RESULT: -6.8 0.... | 1.000000 | 75 | 0.819081 | 7.446127 | -6.8 |
76 | [H]OC([H])([H])n1nnn(-c2c([H])c([H])c(C(=O)OC(... | MODEL 1\nREMARK VINA RESULT: -6.2 0.... | 0.838417 | 76 | 0.644502 | 6.395654 | -6.2 |
77 | [H]c1c([H])c(S(=O)(=O)C([H])([H])C(=O)N(C([H])... | MODEL 1\nREMARK VINA RESULT: -5.9 0.... | 1.000000 | 77 | 0.781446 | 6.987495 | -5.9 |
78 | [H]OC(=O)C([H])(O[H])C(=C([H])[H])c1c([H])c([H... | MODEL 1\nREMARK VINA RESULT: -5.8 0.... | 1.000000 | 78 | 0.759651 | 7.469266 | -5.8 |
79 | [H]c1c([H])c(C([H])([H])C([H])([H])C(=O)N([H])... | MODEL 1\nREMARK VINA RESULT: -6.4 0.... | 0.844438 | 79 | 0.818279 | 6.774127 | -6.4 |
You could save the entire class for future in a compressed or uncompressed pickle file. Here I just saved the file in the temporal directory, you could change the path to your current working directory.
out.pickle(os.path.join(tmp_path.name, f"NumGens_{out.NumGens}_PopSize_{popsize}"), compress=True)
Could be that after a long simulation we would like to perform a different searching strategy over the last population. This is really simple, we could "initialize" again the out
instance with a different set of parameters. Let say that we would like to be close to the last solutions. Then we could use the parameters: min_size=0, max_size=1, min_inc=-1, max_inc=1
.This flavor will add, delate mutate heavy atoms or change hydrogens to heavy atoms.
out.maxiter = 5
out.mutate_crem_kwargs = {
'radius':3,
'min_size':0,
'max_size':1,
'min_inc':-1,
'max_inc':1,
'ncores':cpu_count(),
}
And run again the class
out(njobs = njobs)
File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad! Evaluating generation 4 / 8:
100%|██████████| 20/20 [00:49<00:00, 2.49s/it]
Generation 4: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).
Evaluating generation 5 / 8:
100%|██████████| 19/19 [00:54<00:00, 2.85s/it]
Generation 5: Best Individual: Individual(idx = 30, smiles = NC(=O)CCc1ccc(N2CCNCC2)cc1, cost = 0.7203191036452627).
Evaluating generation 6 / 8:
100%|██████████| 17/17 [00:48<00:00, 2.85s/it]
Generation 6: Best Individual: Individual(idx = 132, smiles = Cc1cc(CCC(N)=O)ccc1N1CCNCC1, cost = 0.6991101572359109).
Evaluating generation 7 / 8:
100%|██████████| 16/16 [00:49<00:00, 3.09s/it]
Generation 7: Best Individual: Individual(idx = 132, smiles = Cc1cc(CCC(N)=O)ccc1N1CCNCC1, cost = 0.6991101572359109).
Evaluating generation 8 / 8:
100%|██████████| 18/18 [00:45<00:00, 2.53s/it]
File /tmp/tmp6q8frrym/pop_test_single_receptor_pop.sdf was createad! Generation 8: Best Individual: Individual(idx = 164, smiles = N#CNS(=O)(=O)c1ccc(-n2c(=O)[nH][nH]c2=O)cc1Br, cost = 0.6683728823567392). =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ The simulation finished successfully after 8 generations with a population of 20 individuals. A total number of 170 Individuals were seen during the simulation. Initial Individual: Individual(idx = 0, smiles = COC(=O)c1ccc(S(N)(=O)=O)cc1, cost = 1.0) Final Individual: Individual(idx = 164, smiles = N#CNS(=O)(=O)c1ccc(-n2c(=O)[nH][nH]c2=O)cc1Br, cost = 0.6683728823567392) The cost function droped in 0.3316271176432608 units. =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ '__call__' 266955.44 ms
As you can see the GA
class is updated rather that replaced. This is a perfect feature for rerun with different ideas. As you see the number of generations is also updated as well the number of calls.
print(out.NumGens, out.NumCalls)
8 2
The same idea is for the rest of the methods and atributtes
out.to_dataframe().tail()
smiles | pdbqt | cost | idx | qed | sa_score | vina_score | |
---|---|---|---|---|---|---|---|
165 | [H]c1c(Br)c(OC([H])([H])C(=O)C([H])(Cl)Cl)c(OC... | MODEL 1\nREMARK VINA RESULT: -6.7 0.... | 0.730751 | 165 | 0.678321 | 6.247825 | -6.7 |
166 | [H]Oc1c([H])c(-c2noc(N([H])[H])c2[H])c([H])c(O... | MODEL 1\nREMARK VINA RESULT: -6.5 0.... | 0.795321 | 166 | 0.682974 | 6.541091 | -6.5 |
167 | [H]c1c([H])c(S(=O)(=O)N(Cl)Cl)c([H])c([H])c1-n... | MODEL 1\nREMARK VINA RESULT: -6.4 0.... | 0.740191 | 167 | 0.783261 | 5.947768 | -6.4 |
168 | [H]c1c([H])c(S([H])(=O)=O)c([H])c([H])c1-n1c(=... | MODEL 1\nREMARK VINA RESULT: -6.1 0.... | 0.893223 | 168 | 0.575643 | 6.600720 | -6.1 |
169 | [H]c1c([H])c(C([H])([H])OC(=O)N([H])[H])c([H])... | MODEL 1\nREMARK VINA RESULT: -6.5 0.... | 0.803898 | 169 | 0.641205 | 6.565248 | -6.5 |
This is the best solution of this simulation (in your case you could get a different one):
Chem.MolFromSmiles(out.pop[0].smiles)
And here is the input
from rdkit import Chem
Chem.MolFromSmiles("COC(=O)C=1C=CC(=CC1)S(=O)(=O)N")
Printing some information
import matplotlib.pyplot as plt
from moldrug import utils
import pandas as pd
sascorer = utils.import_sascorer()
fig, ax = plt.subplots(nrows = 6, figsize = (16,20))
ax[0].plot(range(len(out.SawIndividuals)),[Individual.qed for Individual in out.SawIndividuals], 'o')
ax[0].set(title = 'qed')
ax[1].plot(range(len(out.SawIndividuals)),[Individual.sa_score for Individual in out.SawIndividuals], 'o')
ax[1].set(title = 'sa_score')
ax[2].plot(range(len(out.SawIndividuals)),[Individual.vina_score for Individual in out.SawIndividuals], 'o')
ax[2].set(title = 'vina_score')
ax[3].plot(range(len(out.SawIndividuals)),[Individual.cost for Individual in out.SawIndividuals], 'o')
ax[3].set(title = 'cost')
ax[4].plot(range(len(out.avg_cost)),out.avg_cost, 'o')
ax[4].set(title = 'avg_cost')
ax[5].plot(range(len(out.bestcost)),out.bestcost, 'o')
ax[5].set(title = 'bestcost')
print(f"Initial vina score: {out.InitIndividual.vina_score}. Final vina score: {out.pop[0].vina_score}")
print(f"sascorer of the best Individual: {sascorer.calculateScore(out.pop[0].mol)}")
print(f"QED of the best Individual: {fitness.QED.weights_mean(out.pop[0].mol)}")
pd.DataFrame(utils.lipinski_profile(out.pop[0].mol), index=['value']).T
Initial vina score: -5.2. Final vina score: -7.2 sascorer of the best Individual: 5.796739279673451 QED of the best Individual: 0.4940340217312014
value | |
---|---|
NumHAcceptors | 6.00000 |
NumHDonors | 3.00000 |
wt | 360.14900 |
MLogP | -0.62422 |
NumRotatableBonds | 3.00000 |
TPSA | 140.61000 |
FractionCSP3 | 0.00000 |
HeavyAtomCount | 20.00000 |
NHOHCount | 3.00000 |
NOCount | 9.00000 |
NumAliphaticCarbocycles | 0.00000 |
NumAliphaticHeterocycles | 0.00000 |
NumAliphaticRings | 0.00000 |
NumAromaticCarbocycles | 1.00000 |
NumAromaticHeterocycles | 1.00000 |
NumAromaticRings | 2.00000 |
NumHeteroatoms | 11.00000 |
NumSaturatedCarbocycles | 0.00000 |
NumSaturatedHeterocycles | 0.00000 |
NumSaturatedRings | 0.00000 |
RingCount | 2.00000 |