from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
Chem.SanitizeMol() calls the following in sequence
MolOps::cleanUp()
: clean up a small-set of substructures that are frequently "incorrectly" drawn. Example: N(=O)=O -> [N+](=O)[O-]
mol.updatePropertyCache()
: calculate explicit and implicit valence of all atoms. FAILS when atoms have illegal valence.MolOps::symmetrizeSSSR()
: generates a symmetrized set of smallest rings. FAILS in very rare cases.MolOps::Kekulize()
: generates a Kekule structure for the molecule. FAILS if a Kekule form cannot be found or non-ring bonds are marked as aromatic.MolOps::assignRadicals()
: determines the number of radicals at each atomMolOps::setAromaticity()
: finds aromatic ring systems, marks their atoms as aromatic, marks their bonds as aromatic, and converts their bonds to BondType::AROMATIC
MolOps::setConjugation()
: identifies conjugated bond systemsMolOps::setHybridization()
: assigns hybridization to atomsMolOps::cleanupChirality()
: removes atom-stereochemistry markings from non-sp3 centers.MolOps::adjustHs()
: assigns explicit Hs to aromatic heteroatoms that need them. e.g. solves the C1=CNC=C1 -> c1c[nH]cc1
problemAs of this writing (21 Sept 2012), building 100K drug-like molecules from SMILES takes ~14 seconds on my linux box.
Here's the percentage of total run time broken down by step:
| Step | Time | Percent | Notes |
| toMol() | 1.8 | 13.1 | SMILES parser |
| MolOps::removeHs() | 7.7 | 55.7 | |
| MolOps::sanitizeMol() | 6.8 | 48.8 | Part of the removeHs tally |
| MolOps::assignStereochemistry() | 3.7 | 26.9 | |
And here's a detailed view of the sanitization steps:
| Step | Time | Percent | Notes |
| MolOps::cleanUp() | 0.1 | 1.2 | |
| mol.updatePropertyCache() | 0.4 | 2.8 | |
| MolOps::symmetrizeSSSR() | 2.1 | 15.3 | 2.08 in findSSSR |
| MolOps::Kekulize() | 1.0 | 6.9 | |
| MolOps::assignRadicals() | 0.1 | 0.6 | |
| MolOps::setAromaticity() | 1.2 | 8.7 | |
| MolOps::setConjugation() | 0.8 | 5.5 | |
| MolOps::setHybridization() | 0.3 | 2.0 | |
| MolOps::cleanupChirality() | 0.1 | 0.5 | |
| MolOps::adjustHs() | 0.4 | 3.1 | |
Build molecules once, store pickles, build from them.
This is nice because the pickle contains pretty much everything the RDKit knows about the molecule. It's also quite fast, but it's not inter-operable with other systems. It also takes up a fair amount of disk space.
For the 100K drug-like molecules, creating the molecules from the pickle file takes about 1.3 seconds (compare with the 14 seconds for building the molecule from SMILES). The pickle file itself took 0.9 seconds to create and is 28.3MB (the SMILES file is 3.3MB).
Build molecules once, write canonical SMILES, build from that using limited sanitization.
The idea is to trust the input SMILES to only contain correct molecules with a reasonable aromaticity assignment and no bogus stereochemistry info.
For the 100K drug-like molecules, parsing the SMILES and calling mol.updatePropertyCache()
so that atoms have valences assigned takes 2.1 seconds.
These input molecules have sufficient information to allow many standard operations
Calling MolOps::fastFindRings()
to determine whether or not atoms and bonds are in rings takes 0.4 seconds.
MolOps::fastFindRings()
does a depth-first traversal of the molecule graph to find rings. This is a fast way to find cycles, but unfortunately it doesn't necessarily find the smallest cycles. In order to get information about about ring sizes, etc. a call to MolOps::symmetrizeSSSR()
is required. This takes 2.1 seconds for the 100K molecules.
Start by building a molecule from SMILES and assigning valences:
m = Chem.MolFromSmiles('N#Cc1ccccc1N1C[C@H](C(=O)[O-])CC1=O',sanitize=False)
m.UpdatePropertyCache()
m
Notice that the depiction around the triple bond is wrong. This is because no hybridization information is present.
That's easy to fix:
Chem.SetHybridization(m)
m
tmp=m.GetSubstructMatch(Chem.MolFromSmarts('c1ccccc1'))
m
Doing a substructure search that looks for ring information fails though:
#m.HasSubstructMatch(Chem.MolFromSmarts('[r]'))
But this can be cleared up using FastFindRings()
:
Chem.FastFindRings(m)
m.HasSubstructMatch(Chem.MolFromSmarts('[r]'))
True
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[a;r]'))]
[[2], [3], [4], [5], [6], [7]]
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[A;r]'))]
[[8], [9], [10], [14], [15]]
m
Here's another example of the problems that a lack of complete ring information can cause.
Start with the sanitized molecule:
sm = Chem.MolFromSmiles('O=C(N[C@H]1CCCc2ccccc21)c1n[nH]c2ccccc21')
sm
Now build the same thing with minimal sanitization.
m = Chem.MolFromSmiles('O=C(N[C@H]1CCCc2ccccc21)c1n[nH]c2ccccc21',sanitize=False)
m.UpdatePropertyCache()
Chem.SetHybridization(m)
Chem.FastFindRings(m)
m
That happened because the depiction algorithm uses ring-size information, which is not correct after calling FastFindRings()
Ring membership is fine:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[A;r]'))]
[[3], [4], [5], [6]]
m
But the ring sizes are not:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[r5]'))]
[]
ring counts are also wrong:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[R2]'))]
[[7], [8], [9], [10], [11], [12], [16], [17], [18], [19], [20], [21]]
m
Here's what the query should have generated:
[list(x) for x in sm.GetSubstructMatches(Chem.MolFromSmarts('[R2]'))]
[[7], [12], [16], [21]]
sm
What rings does it find?
m = Chem.MolFromSmiles('O=C(N[C@H]1CCCc2ccccc21)c1n[nH]c2ccccc21',sanitize=False)
m.UpdatePropertyCache()
Chem.FastFindRings(m)
ri = m.GetRingInfo()
ri.AtomRings()
((12, 11, 10, 9, 8, 7, 6, 5, 4, 3), (12, 11, 10, 9, 8, 7), (21, 20, 19, 18, 17, 16, 15, 14, 13), (21, 20, 19, 18, 17, 16))
Draw.MolToImage(m,highlightAtoms=ri.AtomRings()[0])
There's an extra bond highlighted in the above drawing (the fusing bond). This is a limitation of the way atom highlighting is handled in drawings: all bonds between highlighted atoms are highlighted.
We can avoid the problem by highlighting bonds instead:
Draw.MolToImage(m,highlightBonds=ri.BondRings()[0])
Draw.MolToImage(m,highlightBonds=ri.BondRings()[1])
This is an old problem...
Use a classic example: cubane only has five smallest rings.
mb="""
Mrv0541 09231207052D
8 12 0 0 0 0 999 V2000
0.4714 0.2946 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.4420 -1.0607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.7679 0.3241 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.7973 -1.0902 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.4479 -0.4184 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.1432 -0.4184 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.1137 -1.9800 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.4479 -1.9800 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
3 4 1 0 0 0 0
4 2 1 0 0 0 0
1 5 1 0 0 0 0
5 6 1 0 0 0 0
6 3 1 0 0 0 0
6 7 1 0 0 0 0
7 4 1 0 0 0 0
5 8 1 0 0 0 0
8 7 1 0 0 0 0
8 2 1 0 0 0 0
M END
"""
m = Chem.MolFromMolBlock(mb)
m
m2 = Chem.MolFromMolBlock(mb,sanitize=False)
m2.UpdatePropertyCache()
Chem.GetSSSR(m2)
m2.GetRingInfo().NumRings()
5
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[0])
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[1])
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[2])
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[3])
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[4])
The front face is being missed.
Isn't this all nitpicking though? Does it actually matter?
It does if you want to ask questions about ring membership:
m2.GetSubstructMatches(Chem.MolFromSmarts('[R3]'))
((0,), (1,), (2,), (3,))
m2
The symmetrization algorithm (run by default) solves this problem:
Chem.GetSymmSSSR(m2)
m2.GetRingInfo().NumRings()
6
m2.GetSubstructMatches(Chem.MolFromSmarts('[R3]'))
((0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,))
Chem.CanonSmiles('C[C@H](F)C') # <- two identical atoms
'CC(C)F'
Chem.CanonSmiles('C/C=C') # <- only substituted on one side
'C=CC'
m = Chem.MolFromSmiles('C[C@](F)(Cl)Br')
m
m.GetAtomWithIdx(1).GetProp('_CIPCode')
'S'
m = Chem.MolFromSmiles('C/C=C/C')
m
m.GetBondWithIdx(1).GetStereo()
rdkit.Chem.rdchem.BondStereo.STEREOE
m = Chem.MolFromSmiles(r'C/C=C\C')
m
m.GetBondWithIdx(1).GetStereo()
rdkit.Chem.rdchem.BondStereo.STEREOZ
m = Chem.MolFromSmiles('C[C@H]1CCC[C@@H](C)[C@H]1C')
Draw.MolToImage(m,includeAtomNumbers=True)
m.GetAtomWithIdx(1).GetProp('_CIPCode'),m.GetAtomWithIdx(5).GetProp('_CIPCode'),m.GetAtomWithIdx(7).GetProp('_CIPCode')
('S', 'R', 'S')
m = Chem.MolFromSmiles('C[C@H]1CCC[C@H](C)[C@H]1C')
Draw.MolToImage(m,includeAtomNumbers=True)
m = Chem.MolFromSmiles(r'C[C@H](F)/C([C@H](F)C)=C\Cl')
m
m.GetAtomWithIdx(1).GetProp('_CIPCode'),m.GetAtomWithIdx(4).GetProp('_CIPCode')
('S', 'R')
Chem.MolToSmiles(m,True)
'C[C@@H](F)/C(=C/Cl)[C@H](C)F'
m = Chem.MolFromSmiles(r'C[C@H](F)/C([C@@H](F)C)=C\Cl')
m
m.GetAtomWithIdx(1).GetProp('_CIPCode'),m.GetAtomWithIdx(4).GetProp('_CIPCode')
('S', 'S')
Chem.MolToSmiles(m,True)
'C[C@H](F)C(=CCl)[C@H](C)F'
Chem.CanonSmiles('F[P@@](=O)(OC(C)C)C') #<- Sarin
'CC(C)O[P@](C)(=O)F'
m = Chem.MolFromSmiles('F[P@@](=O)(OC(C)C)C')
m
m.GetAtomWithIdx(1).GetProp('_CIPCode')
'S'
Chem.CanonSmiles('COc1ccc2nc([nH]c2c1)[S@@](=O)Cc1ncc(C)c(OC)c1C') # <- Esomeprazole
'COc1ccc2nc([S@@](=O)Cc3ncc(C)c(OC)c3C)[nH]c2c1'
m = Chem.MolFromSmiles('COc1ccc2nc([nH]c2c1)[S@@](=O)Cc1ncc(C)c(OC)c1C')
m
m.GetAtomWithIdx(11).GetProp('_CIPCode')
'S'