This notebook serves as a practical guide to common questions users might have. Some functionalities of MolDrug may not be available on PyPi or Conda yet. This could be because the new version is not yet release, but in brief it is going to be. You could install directly from the repo if some problem pops up. Just paste the following in a code cell:
! pip install git+https://github.com/ale94mleon/MolDrug.git@main
Table of content
# Import section
from moldrug.data import ligands, boxes, receptor_pdbqt
from rdkit import Chem
import tempfile, os, requests,inspect, gzip, shutil, yaml, sys
from moldrug import utils
from typing import List
import random
MolDrug it is meant to be for using as a python module. However it has capabilities to work from the command line. Due to the complexity of the input for a normal simulation, yaml structured file are used for the input specification. Let first see the help of moldrug.
! moldrug -h
usage: moldrug [-h] [-v] [-f FITNESS] [-o OUTDIR] yaml_file For information of MolDrug: Docs: https://moldrug.readthedocs.io/en/latest/ Source Code: https://github.com/ale94mleon/moldrug positional arguments: yaml_file The configuration yaml file optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -f FITNESS, --fitness FITNESS The path to the user-custom fitness module; inside of which the given custom cost function must be implemented. See the docs for how to do it properly. E.g. my/awesome/fitness_module.py.By default will look in the moldrug.fitness module. -o OUTDIR, --outdir OUTDIR The path to where all the files should be written. By default the current working directory will be used (where the command line was invoked).
As it shows, only one positional arguments is needed yaml_file
. Then you could give:
fitness
: A customized fitness python module where your cost function must be implemented.
outdir
: The out directory.
Lets go steep by steep
# Getting the data
tmp_path = tempfile.TemporaryDirectory()
lig = ligands.r_x0161
box = boxes.r_x0161
with open(os.path.join(tmp_path.name, 'x0161.pdbqt'), 'w') as f:
f.write(receptor_pdbqt.r_x0161)
# Getting the CReM data base
url = "http://www.qsar4u.com/files/cremdb/replacements02_sc2.db.gz"
r = requests.get(url, allow_redirects=True)
crem_dbgz_path = os.path.join(tmp_path.name,'crem.db.gz')
crem_db_path = os.path.join(tmp_path.name,'crem.db')
open(crem_dbgz_path, 'wb').write(r.content)
with gzip.open(crem_dbgz_path, 'rb') as f_in:
with open(crem_db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
print(lig)
print(box)
print(tmp_path.name, os.listdir(tmp_path.name))
COC(=O)C=1C=CC(=CC1)S(=O)(=O)N {'A': {'boxcenter': [12.11, 1.84, 23.56], 'boxsize': [22.5, 22.5, 22.5]}} /tmp/tmps9vo4jki ['crem.db.gz', 'x0161.pdbqt', 'crem.db']
Now we have all that we need. The SMILES, the definition of the box, the CReM data base and the receptor. But first let see what are the arguments that the function utils.Local
needs:
inspect.getfullargspec(utils.Local.__init__).args
['self', 'seed_mol', 'crem_db_path', 'costfunc', 'grow_crem_kwargs', 'costfunc_kwargs', 'AddHs']
We will pass all of this parameters using config.yaml file.
Note: Here we will not work neither modify with the desirability parameter of the cost function. We will talk about it in a separate tutorial inside of Advance Topics.
Warning!: Check your correct temporal file and change it accordantly. Of course you could also set a normal directory and safe the results of the simulation.
Here is an example yaml file to work with the class utils.Local
.
main:
type: Local
njobs: 1
pick: 1
AddHs: False
seed_mol: COC(=O)C=1C=CC(=CC1)S(=O)(=O)N
costfunc: Cost
costfunc_kwargs:
vina_executable: vina
receptor_path: /tmp/your_tmp_dir/x0161.pdbqt
boxcenter:
- 12.11
- 1.84
- 23.56
boxsize:
- 22.5
- 22.5
- 22.5
exhaustiveness: 4
ncores: 12
num_modes: 1
crem_db_path: /tmp/your_tmp_dir/crem.db
grow_crem_kwargs:
radius: 3
min_atoms: 1
max_atoms: 2
ncores: 12
The structure of a yaml file is like a python dictionary but more human readable (see this youtube video if you are not familiar).
First we have the directrix:
main:
This is just the name of your main project and could be any word (we will see that moldrug.utils.GA
accept also follow projects). Then we have:
type: Local
njobs: 1
pick: 1
AddHs: False
Look how this section is inside of main
(one indentation level). type
is the name of the class inside moldrug.utils
; for now could be GA
or Local
(case insensitive). In this case we want to use the class Local
. njobs
, pick
and AddHs
are the parameters when the class Local
(or GA
) is call. The rest of the parameters are just the parameters that needs Local
for the initialization. All of them are at the same level of the previous ones. Because costfunc_kwargs
is a dictionary; we add for its value a new indentation level:
costfunc_kwargs:
vina_executable: vina
receptor_path: /tmp/your_tmp_dir/x0161.pdbqt
boxcenter:
- 12.11
- 1.84
- 23.56
boxsize:
- 22.5
- 22.5
- 22.5
exhaustiveness: 4
ncores: 12
num_modes: 1
As you could imagine the keyword boxcenter is a list and this is one of the way to represent list in yaml. Cool right?
The next dictionary is the same that the yaml file and we will save it in the temporal file that we created.
config = {
"main": {
"type": "Local",
"njobs": 1,
"pick": 1,
"seed_mol": lig,
"costfunc": "Cost",
"costfunc_kwargs": {
"vina_executable": "vina",
"receptor_pdbqt_path": os.path.join(tmp_path.name, 'x0161.pdbqt'),
"boxcenter": [
12.11,
1.84,
23.56
],
"boxsize": [
22.5,
22.5,
22.5
],
"exhaustiveness": 4,
"ncores": 12,
"num_modes": 1
},
"crem_db_path": os.path.join(tmp_path.name, 'crem.db'),
"grow_crem_kwargs": {
"radius": 3,
"min_atoms": 1,
"max_atoms": 2,
"ncores": 12
}
}
}
# Save the config as a yaml file
with open(os.path.join(tmp_path.name, 'config.yml'), 'w') as f:
yaml.dump(config, f)
os.listdir(tmp_path.name)
['crem.db.gz', 'x0161.pdbqt', 'config.yml', 'crem.db']
cwd = os.getcwd()
os.chdir(tmp_path.name)
! moldrug config.yml
os.chdir(cwd)
os.listdir(tmp_path.name)
os.chdir(cwd)
You are using moldrug: 2.0.0. The main job is being executed. Calculating cost function... 100%|█████████████████████████████████████████████| 2/2 [00:07<00:00, 3.60s/it] Finished at Thu Aug 25 15:35:22 2022. File local_pop.sdf was createad! The main job finished!.
As you see the simulation run successfully and the following files were generated:
local_result.pbz2
; the compresses version of the Local
class with all the information of the simulation (binary file).
local_pop.sdf
; the mol structures. This file is useful to use inside Pymol.
The following is a simple cost function based on the vina score and very similar to moldrug.fitness.Cost
used by MolDrug. The details on the parameters and how to correctly implement it will be discussed letter on the following section.
def MyAwesomeCost(Individual:utils.Individual, wd:str = '.vina_jobs', vina_executable:str = 'vina', receptor_path:str = None, boxcenter:List[float] = None, boxsize:List[float] =None, exhaustiveness:int = 8, ncores:int = 1, num_modes:int = 1):
# Getting Vina score
cmd = f"{vina_executable} --receptor {receptor_path} --ligand {os.path.join(wd, f'{Individual.idx}.pdbqt')} "\
f"--center_x {boxcenter[0]} --center_y {boxcenter[1]} --center_z {boxcenter[2]} "\
f"--size_x {boxsize[0]} --size_y {boxsize[1]} --size_z {boxsize[2]} "\
f"--out {os.path.join(wd, f'{Individual.idx}_out.pdbqt')} --cpu {ncores} --exhaustiveness {exhaustiveness} --num_modes {num_modes}"
# Creating the ligand pdbqt
with open(os.path.join(wd, f'{Individual.idx}.pdbqt'), 'w') as l:
l.write(Individual.pdbqt)
utils.run(cmd)
# Getting the information
best_energy = utils.VINA_OUT(os.path.join(wd, f'{Individual.idx}_out.pdbqt')).BestEnergy()
# Changing the 3D conformation by the conformation of the binding pose
Individual.pdbqt = ''.join(best_energy.chunk)
# Getting the Scoring function of Vina
Individual.vina_score = best_energy.freeEnergy
# Getting the Scoring function of Vina and assign it to the cost attribute of the Individual
Individual.cost = best_energy.freeEnergy
return Individual
Because this cost function accept the same parameters we can use the same configuration yaml file; but the name of the function is different. So we have to modify that parameter. Let's save the function in a .py file with the corresponded import statements
# Create a new config file
# Save the config as a yaml file
config['main']['costfunc'] = 'MyAwesomeCost'
with open(os.path.join(tmp_path.name, 'config_custom_fitness.yml'), 'w') as f:
yaml.dump(config, f)
os.listdir(tmp_path.name)
['local_result.pbz2', 'crem.db.gz', 'x0161.pdbqt', 'config.yml', 'local_pop.sdf', 'crem.db', 'config_custom_fitness.yml']
str_module = """#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from moldrug import utils
from typing import List
import os
def MyAwesomeCost(Individual:utils.Individual, wd:str = '.vina_jobs', vina_executable:str = 'vina', receptor_pdbqt_path:str = None, boxcenter:List[float] = None, boxsize:List[float] =None, exhaustiveness:int = 8, ncores:int = 1, num_modes:int = 1):
# Getting Vina score
cmd = f"{vina_executable} --receptor {receptor_pdbqt_path} --ligand {os.path.join(wd, f'{Individual.idx}.pdbqt')} "\
f"--center_x {boxcenter[0]} --center_y {boxcenter[1]} --center_z {boxcenter[2]} "\
f"--size_x {boxsize[0]} --size_y {boxsize[1]} --size_z {boxsize[2]} "\
f"--out {os.path.join(wd, f'{Individual.idx}_out.pdbqt')} --cpu {ncores} --exhaustiveness {exhaustiveness} --num_modes {num_modes}"
# Creating the ligand pdbqt
with open(os.path.join(wd, f'{Individual.idx}.pdbqt'), 'w') as l:
l.write(Individual.pdbqt)
utils.run(cmd)
# Getting the information
best_energy = utils.VINA_OUT(os.path.join(wd, f'{Individual.idx}_out.pdbqt')).BestEnergy()
# Changing the 3D conformation by the conformation of the binding pose
Individual.pdbqt = ''.join(best_energy.chunk)
# Getting the Scoring function of Vina
Individual.vina_score = best_energy.freeEnergy
# Getting the Scoring function of Vina and assign it to the cost attribute of the Individual
Individual.cost = best_energy.freeEnergy
return Individual
"""
# Saving the new fitness
with open(os.path.join(tmp_path.name, 'MyCustomFitness.py'), 'w') as f:
f.write(str_module)
os.listdir(tmp_path.name)
['local_result.pbz2', 'crem.db.gz', 'custom_fitness', 'x0161.pdbqt', 'config.yml', 'local_pop.sdf', 'crem.db', 'MyCustomFitness.py', 'config_custom_fitness.yml']
Now we only need to call MolDrug and specify the new fitness module to use. It is recommendable to output in on folder the results; therefore also use the option outdir
of the command line.
cwd = os.getcwd()
os.chdir(tmp_path.name)
! moldrug config_custom_fitness.yml --fitness MyCustomFitness.py --outdir custom_fitness
os.listdir(tmp_path.name)
os.chdir(cwd)
You are using moldrug: 2.0.0. The main job is being executed. Calculating cost function... 100%|█████████████████████████████████████████████| 2/2 [00:05<00:00, 2.79s/it] Finished at Thu Aug 25 15:36:43 2022. File local_pop.sdf was createad! The main job finished!.
print(os.listdir(tmp_path.name))
print(os.listdir(os.path.join(tmp_path.name, 'custom_fitness')))
['local_result.pbz2', 'crem.db.gz', 'custom_fitness', 'x0161.pdbqt', 'config.yml', 'local_pop.sdf', 'crem.db', 'MyCustomFitness.py', 'config_custom_fitness.yml'] ['local_result.pbz2', 'CustomMolDrugFitness.py', 'local_pop.sdf', '__pycache__']
As you can see the simulation run successfully! But be aware of a possible issue! Because the cost function used is not actually inside of MolDrug. If we would like to use on the future the binary file local_result.pbz2
. We have first say to Python where to find the used fitness function. For reproducibility and also peace of mind, MolDrug will generate a new file: CustomMolDrugFitness.py
this is nothing more than a copy of your implemented python fitness module. This module is the one used for the internal calculation of MolDrug. Therefore, we have to say to python where is CustomMolDrugFitness
and append to sys.path
in order to open the binary file. If we do not do that we will get an error. Let see the example.
try:
whole_result = utils.decompress_pickle(os.path.join(tmp_path.name, 'custom_fitness', 'local_result.pbz2'))
except Exception as e:
print(f"This was the problem: {e}. So we have to add to the system path the directory where CustomMolDrugFitness.py is located.")
sys.path.append(os.path.join(tmp_path.name, 'custom_fitness'))
whole_result = utils.decompress_pickle(os.path.join(tmp_path.name, 'custom_fitness', 'local_result.pbz2'))
whole_result
This was the problem: No module named 'CustomMolDrugFitness'. So we have to add to the system path the directory where CustomMolDrugFitness.py is located.
<moldrug.utils.Local at 0x7f0504542a60>
# The cost modify cost function
whole_result.costfunc
<function CustomMolDrugFitness.MyAwesomeCost(Individual: moldrug.utils.Individual, wd: str = '.vina_jobs', vina_executable: str = 'vina', receptor_pdbqt_path: str = None, boxcenter: List[float] = None, boxsize: List[float] = None, exhaustiveness: int = 8, ncores: int = 1, num_modes: int = 1)>
# The original cost function form the first simulation
(utils.decompress_pickle(os.path.join(tmp_path.name, 'local_result.pbz2'))).costfunc
<function moldrug.fitness.Cost(Individual: moldrug.utils.Individual, wd: str = '.vina_jobs', vina_executable: str = 'vina', receptor_pdbqt_path: str = None, boxcenter: List[float] = None, boxsize: List[float] = None, exhaustiveness: int = 8, ncores: int = 1, num_modes: int = 1, constraint: bool = False, constraint_type='score_only', constraint_ref: rdkit.Chem.rdchem.Mol = None, constraint_receptor_pdb_path: str = None, constraint_num_conf: int = 100, constraint_minimum_conf_rms: int = 0.01, desirability: Dict = {'qed': {'w': 1, 'LargerTheBest': {'LowerLimit': 0.1, 'Target': 0.75, 'r': 1}}, 'sa_score': {'w': 1, 'SmallerTheBest': {'Target': 3, 'UpperLimit': 7, 'r': 1}}, 'vina_score': {'w': 1, 'SmallerTheBest': {'Target': -12, 'UpperLimit': -6, 'r': 1}}})>
You could see the difference between the two cost functions and the error printed in the try-except block. Finally let's take a look to the sys.path; check that at the end it is our path with our new fitness module.
sys.path
['/home/ale/GITLAB/moldrug/docs/notebooks', '/home/ale/.vscode/extensions/ms-toolsai.jupyter-2022.7.1102252217/pythonFiles', '/home/ale/.vscode/extensions/ms-toolsai.jupyter-2022.7.1102252217/pythonFiles/lib/python', '/home/ale/anaconda3/envs/moldrug/lib/python39.zip', '/home/ale/anaconda3/envs/moldrug/lib/python3.9', '/home/ale/anaconda3/envs/moldrug/lib/python3.9/lib-dynload', '', '/home/ale/anaconda3/envs/moldrug/lib/python3.9/site-packages', '/tmp/tmps9vo4jki/custom_fitness']
Create your own cost function should be straightforward as soon as we understand how MolDrug works with it. So, let's take a look again to the function defined previously:
def MyAwesomeCost(Individual:utils.Individual, wd:str = '.vina_jobs', vina_executable:str = 'vina', receptor_pdbqt_path:str = None, boxcenter:List[float] = None, boxsize:List[float] =None, exhaustiveness:int = 8, ncores:int = 1, num_modes:int = 1):
# Getting Vina score
cmd = f"{vina_executable} --receptor {receptor_pdbqt_path} --ligand {os.path.join(wd, f'{Individual.idx}.pdbqt')} "\
f"--center_x {boxcenter[0]} --center_y {boxcenter[1]} --center_z {boxcenter[2]} "\
f"--size_x {boxsize[0]} --size_y {boxsize[1]} --size_z {boxsize[2]} "\
f"--out {os.path.join(wd, f'{Individual.idx}_out.pdbqt')} --cpu {ncores} --exhaustiveness {exhaustiveness} --num_modes {num_modes}"
# Creating the ligand pdbqt
with open(os.path.join(wd, f'{Individual.idx}.pdbqt'), 'w') as l:
l.write(Individual.pdbqt)
utils.run(cmd)
# Getting the information
best_energy = utils.VINA_OUT(os.path.join(wd, f'{Individual.idx}_out.pdbqt')).BestEnergy()
# Changing the 3D conformation by the conformation of the binding pose
Individual.pdbqt = ''.join(best_energy.chunk)
# Getting the Scoring function of Vina
Individual.vina_score = best_energy.freeEnergy
# Getting the Scoring function of Vina and assign it to the cost attribute of the Individual
Individual.cost = best_energy.freeEnergy
return Individual
MyAwesomeCost
function it will only:
pdbqt
attribute of Individual
cost
attribute of Individual
(the must important in order to MolDrug works properly)vina_score
attribute to Individual
Basically the idea is give a number to the attribute cost
of the passed Individual
instance. MolDrug works based on the class moldrug.utils.Individual
; this is just a representation of each chemical structure. Therefore all the custom cost function must work based on this class; in oder words, "the chemical structure input" must be coded as an Individual
and it will be the first positional argument to pass. The rest is completely optional. If the function perform I/O operations you can also provided as keyword wd
. By default moldrug.utils.GA
and moldrug.utils.Local
create a temporal directory (in /tmp) even when you specify some values for it. This could be a problem if your function create big files (bigger than the capacity of your /tmp directory) or if you would like to preserve some of this files. In this case just create an alternative keyword (like wd_dont_touch
) and work based on it. The other important idea to have in mind is that moldrug.utils.GA
perform minimization (the best element is the one with lower cost
attribute).
WARNING: MolDrug uses multiprocessing
for the parallelization. This module uses itself pickle
. In our case MolDrug parallelize the cost function; therefore the COST FUNCTION MUST BE PICKLEABLE. To see what can be pickled and unpickled go here.
Common error
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7efd82c72b80>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed")'
The last error could be due to some user defined functions (or classes) that uses your custom cost function that are not a the top level of the module. In other words, you are using user defined functions (or classes) that are outside of your custom_fitness_module.py (in case command line is used) or your main script where the cost function was defined. Solution: put all your code for the cost function in only one .py ((in case command line is used)) file or in your main script.
The following is probably the easiest and dummy cost function possible. But I bring it here to show that there is not limitation on how to create your own cost function.
import multiprocessing as mp
def MyPerfectDummyCost(Individual):
Individual.cost = random.random()
return Individual
ga = utils.GA(
Chem.MolFromSmiles('CCCC(=O)OCCN'),
crem_db_path = os.path.join(tmp_path.name, 'crem.db'),
maxiter = 2,
popsize = 20,
costfunc=MyPerfectDummyCost,
mutate_crem_kwargs={
'ncores': mp.cpu_count()
},
costfunc_kwargs={}
)
ga(20)
Creating the first population with 20 members:
100%|██████████| 20/20 [00:00<00:00, 1976.49it/s]
Initial Population: Best individual: Individual(idx = 10, smiles = Cc1nc2ccccc2c(=O)o1, cost = 0.015217543704288294)
Evaluating generation 1 / 2:
100%|██████████| 20/20 [00:00<00:00, 898.43it/s]
Generation 1: Best Individual: Individual(idx = 10, smiles = Cc1nc2ccccc2c(=O)o1, cost = 0.015217543704288294).
Evaluating generation 2 / 2:
100%|██████████| 20/20 [00:00<00:00, 302.11it/s]
Generation 2: Best Individual: Individual(idx = 10, smiles = Cc1nc2ccccc2c(=O)o1, cost = 0.015217543704288294). =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ The simulation finished successfully after 2 generations with a population of 20 individuals. A total number of 60 Individuals were seen during the simulation. Initial Individual: Individual(idx = 0, smiles = CCCC(=O)OCCN, cost = 0.9086986766236456) Final Individual: Individual(idx = 10, smiles = Cc1nc2ccccc2c(=O)o1, cost = 0.015217543704288294) The cost function droped in 0.8934811329193573 units. =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Total time (2 generations): 30.01 (s). Finished at Thu Aug 25 15:37:59 2022.
moldrug.utils.GA
exports an sdf file that contains the individual of the current generations every save_pop_every_gen
. For that, it uses the information contained in the pdbqt
or pdbqts
attributes (depending if single or multi-receptor is used respectively). Therefore if we use docking we must update the pdbqt attribute with the pose obtained by vina. If we don't do that, we will just get a conformation generated by RDKit that is automatically created when a instance of utils.Individual
is made it.