Molecule
¶Every pathway through the OpenFF Toolkit boils down to four steps:
Molecule
Molecule
objects to construct a Topology
ForceField.create_openmm_system(topology)
to create an OpenMM System
(or, in the near future, an OpenFF Interchange
for painless conversion to all sorts of MD formats)So let's take a look at every way there is to construct a molecule! We'll use zwitterionic L-alanine as an example biomolecule with all the tricky bits - a stereocenter, non-zero formal charges, and bonds of different orders.
[^rs]: Note that this stereochemistry must be defined on the graph of the molecule. It's not good enough to just co-ordinates with the correct stereochemistry. But if you have the co-ordinates, you can try getting the stereochemistry automatically with rdkit
or openeye
--- If you dare!
# Imports
from openff.toolkit import Molecule, Topology
# Hide tracebacks for simpler errors
import sys
ipython = get_ipython()
def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
exception_only=False, running_compiled_code=False):
etype, value, tb = sys.exc_info()
value.__cause__ = None # suppress chained exceptions
return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))
ipython.showtraceback = hide_traceback
# Hide NumPy warnings
import warnings
warnings.filterwarnings('ignore', r'The value of the smallest subnormal for ')
SMILES is the classic way to create a Molecule
. SMILES is a widely-used compact textual representation of arbitrary molecules. This lets us specify an exact molecule, including stereochemistry and bond orders, very easily --- though they may not be the most human-readable format.
The Molecule.from_smiles()
method is used to create a Molecule
from a SMILES code.
zw_l_alanine = Molecule.from_smiles("C[C@H]([NH3+])C(=O)[O-]")
zw_l_alanine.visualize()
smiles_explicit_h = Molecule.from_smiles(
"[H][C]([H])([H])[C@@]([H])([C](=[O])[O-])[N+]([H])([H])[H]",
hydrogens_are_explicit=True
)
assert zw_l_alanine.is_isomorphic_with(smiles_explicit_h)
smiles_explicit_h.visualize()
By default, no guarantees are made about the indexing of atoms from a SMILES string. If the indexing is important, a mapped SMILES string may be used. In this case, Hydrogens must be explicit. Note that though mapped SMILES strings must start at index 1, Python lists start at index 0.
mapped_smiles = Molecule.from_mapped_smiles(
"[H:10][C:2]([H:7])([H:8])[C@@:4]([H:9])([C:3](=[O:5])[O-:6])[N+:1]([H:11])([H:12])[H:13]"
)
assert zw_l_alanine.is_isomorphic_with(mapped_smiles)
assert mapped_smiles.atoms[0].atomic_number == 7 # First index is the Nitrogen
assert all([a.atomic_number==1 for a in mapped_smiles.atoms[6:]]) # Final indices are all H
mapped_smiles.visualize()
(smiles_no_stereochemistry)=
The Toolkit won't accept an ambiguous SMILES. This SMILES could be L- or D- alanine; rather than guess, the Toolkit throws an error:
smiles_non_isomeric = Molecule.from_smiles(
"CC([NH3+])C(=O)[O-]"
)
UndefinedStereochemistryError: Unable to make OFFMol from OEMol: OEMol has unspecified stereochemistry. oemol.GetTitle(): Problematic atoms are: Atom atomic num: 6, name: , idx: 1, aromatic: False, chiral: True with bonds: bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 0, aromatic: False, chiral: False bond order: 1, chiral: False to atom atomic num: 7, name: , idx: 2, aromatic: False, chiral: False bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: False, chiral: False bond order: 1, chiral: False to atom atomic num: 1, name: , idx: 9, aromatic: False, chiral: False
We can downgrade this error to a warning with the allow_undefined_stereo
argument. This will create a molecule with undefined stereochemistry, which might lead to incorrect parametrization or surprising conformer generation. See the FAQ for more details.
smiles_non_isomeric = Molecule.from_smiles(
"CC([NH3+])C(=O)[O-]",
allow_undefined_stereo=True
)
assert not zw_l_alanine.is_isomorphic_with(smiles_non_isomeric)
smiles_non_isomeric.visualize()
Warning (not error because allow_undefined_stereo=True): OEMol has unspecified stereochemistry. oemol.GetTitle(): Problematic atoms are: Atom atomic num: 6, name: , idx: 1, aromatic: False, chiral: True with bonds: bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 0, aromatic: False, chiral: False bond order: 1, chiral: False to atom atomic num: 7, name: , idx: 2, aromatic: False, chiral: False bond order: 1, chiral: False to atom atomic num: 6, name: , idx: 3, aromatic: False, chiral: False bond order: 1, chiral: False to atom atomic num: 1, name: , idx: 9, aromatic: False, chiral: False
You can always construct a Molecule
by building it up from individual atoms and bonds. Other methods are generally easier, but it's a useful fallback for when you need to write your own constructor for an unsupported source format.
The Molecule()
constructor and the add_atom()
and add_bond()
methods are used to construct a Molecule
by hand.
by_hand = Molecule()
by_hand.name = "Zwitterionic l-Alanine"
by_hand.add_atom(
atomic_number = 8, # Atomic number 8 is Oxygen
formal_charge = -1, # Formal negative charge
is_aromatic = False, # Atom is not part of an aromatic system
stereochemistry = None, # Optional argument; "R" or "S" stereochemistry
name = "O-" # Optional argument; descriptive name for the atom
)
by_hand.add_atom(6, 0, False, name="C")
by_hand.add_atom(8, 0, False, name="O")
by_hand.add_atom(6, 0, False, stereochemistry="S", name="CA")
by_hand.add_atom(1, 0, False, name="CAH")
by_hand.add_atom(6, 0, False, name="CB")
by_hand.add_atom(1, 0, False, name="HB1")
by_hand.add_atom(1, 0, False, name="HB2")
by_hand.add_atom(1, 0, False, name="HB3")
by_hand.add_atom(7, +1, False, name="N+")
by_hand.add_atom(1, 0, False, name="HN1")
by_hand.add_atom(1, 0, False, name="HN2")
by_hand.add_atom(1, 0, False, name="HN3")
by_hand.add_bond(
atom1 = 0, # First (zero-indexed) atom specified above ("O-")
atom2 = 1, # Second atom specified above ("C")
bond_order = 1, # Single bond
is_aromatic = False, # Bond is not aromatic
stereochemistry = None, # Optional argument; "E" or "Z" stereochemistry
fractional_bond_order = None # Optional argument; Wiberg (or similar) bond order
)
by_hand.add_bond( 1, 2, 2, False) # C = O
by_hand.add_bond( 1, 3, 1, False) # C - CA
by_hand.add_bond( 3, 4, 1, False) # CA - CAH
by_hand.add_bond( 3, 5, 1, False) # CA - CB
by_hand.add_bond( 5, 6, 1, False) # CB - HB1
by_hand.add_bond( 5, 7, 1, False) # CB - HB2
by_hand.add_bond( 5, 8, 1, False) # CB - HB3
by_hand.add_bond( 3, 9, 1, False) # CB - N+
by_hand.add_bond( 9, 10, 1, False) # N+ - HN1
by_hand.add_bond( 9, 11, 1, False) # N+ - HN2
by_hand.add_bond( 9, 12, 1, False) # N+ - HN3
assert zw_l_alanine.is_isomorphic_with(by_hand)
by_hand.visualize()
Rather than build up the Molecule
one method at a time, the Molecule.from_dict()
method can construct a Molecule
in one shot from a Python dict
that describes the molecule in question. This allows Molecule
objects to be written to and read from disk in any format that can be interpreted as a dict
; this mechanism underlies the from_bson()
, from_json()
, from_messagepack()
, from_pickle()
, from_toml()
, from_xml()
, and from_yaml()
methods.
This format can get very verbose, as it is intended for serialization, so this example uses hydrogen cyanide rather than alanine.
molecule_dict = {
"name": "",
"atoms": [
{
"atomic_number": 1,
"formal_charge": 0,
"is_aromatic": False,
"stereochemistry": None,
"name": "H",
},
{
"atomic_number": 6,
"formal_charge": 0,
"is_aromatic": False,
"stereochemistry": None,
"name": "C",
},
{
"atomic_number": 7,
"formal_charge": 0,
"is_aromatic": False,
"stereochemistry": None,
"name": "N",
},
],
"virtual_sites": [],
"bonds": [
{
"atom1": 0,
"atom2": 1,
"bond_order": 1,
"is_aromatic": False,
"stereochemistry": None,
"fractional_bond_order": None,
},
{
"atom1": 1,
"atom2": 2,
"bond_order": 3,
"is_aromatic": False,
"stereochemistry": None,
"fractional_bond_order": None,
},
],
"properties": {},
"conformers": None,
"partial_charges": None,
"partial_charges_unit": None,
"hierarchy_schemes": {},
}
from_dictionary = Molecule.from_dict(molecule_dict)
from_dictionary.visualize()
We can construct a Molecule
from a file or file-like object with the from_file()
method. We're a bit constrained in what file formats we can accept, because they need to provide all the information needed to construct the molecular graph; not just coordinates, but also elements, formal charges, bond orders, and stereochemistry.
sdf_path = Molecule.from_file("zw_l_alanine.sdf")
assert zw_l_alanine.is_isomorphic_with(sdf_path)
sdf_path.visualize()
from_file()
can also take a file object, rather than a path. Note that the object must be in binary mode!
with open("zw_l_alanine.sdf", mode="rb") as file:
sdf_object = Molecule.from_file(file, file_format="SDF")
assert zw_l_alanine.is_isomorphic_with(sdf_object)
sdf_object.visualize()
The from_polymer_pdb()
method can interpret a protein from a PDB file as a molecule. It can infer the full chemical graph of the canonical amino acids (26 including protonation states) in capped and uncapped forms. It does this according to the RCSB chemical component dictionary --- the information is not explicitly in the PDB. For the moment, it's quite slow, but this is the easiest way to get a full protein into the OpenFF ecosystem:
from openff.toolkit.utils import get_data_file_path
path = get_data_file_path('proteins/T4-protein.pdb')
protein = Molecule.from_polymer_pdb(path)
protein.visualize("nglview")
NGLWidget()
Using PDB files for small molecules is not recommended, even if they have CONECT records, as they do not provide stereoisomeric information or bond orders. The RDKit backend assumes that bond orders are 1, so the toolkit refuses to use it:
from openff.toolkit.utils.toolkits import RDKitToolkitWrapper
pdb = Molecule.from_file("zw_l_alanine.pdb", "pdb", toolkit_registry=RDKitToolkitWrapper())
NotImplementedError: Toolkit The RDKit can not read file zw_l_alanine.pdb (format PDB). Supported formats for this toolkit are ['SDF', 'MOL', 'SMI'].RDKit can however read PDBs with a valid smiles string using the Molecule.from_pdb_and_smiles(file_path, smiles) constructor
OpenEye can infer bond orders and stereochemistry from the structure. This is not recommended, as it can make mistakes that may be difficult to catch. Note also that this requires a license for OpenEye, as this is proprietary software.
If we instead provide a SMILES code, a PDB file can be used to populate the Molecule
object's conformers
attribute and provide atom ordering, as well as check that the SMILES code matches the PDB file. This method is the recommended way to create a Molecule
from a PDB file. The PDB file used here can be found on GitHub
:::{important} Note that the Toolkit doesn't guarantee that the coordinates in the PDB are correctly assigned to atoms. It makes an effort, but you should check that the results are reasonable. :::
pdb_with_smiles = Molecule.from_pdb_and_smiles(
"zw_l_alanine.pdb",
"C[C@H]([NH3+])C(=O)[O-]"
)
assert zw_l_alanine.is_isomorphic_with(pdb_with_smiles)
pdb_with_smiles.visualize()
The Molecule.from_inchi()
method constructs a Molecule
from an IUPAC InChI string. Note that InChI cannot distinguish the zwitterionic form of alanine from the neutral form (see section 13.2 of the InChI Technical FAQ), so the toolkit defaults to the neutral form.
:::{warning}
The OpenFF Toolkit makes no guarantees about the atomic ordering produced by the from_inchi
method. InChI is not intended to be an interchange format.
:::
inchi = Molecule.from_inchi("InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1")
inchi.visualize()
The Molecule.from_iupac()
method constructs a Molecule
from an IUPAC name.
:::{important} This code requires the OpenEye toolkit. :::
iupac = Molecule.from_iupac("(2S)-2-azaniumylpropanoate")
assert zw_l_alanine.is_isomorphic_with(iupac)
iupac.visualize()
Molecule
¶Most Molecule
creation methods don't specify the ordering of atoms in the new Molecule
. The Molecule.remap()
method allows a new ordering to be applied to an existing Molecule
.
See also #mapped-smiles.
:::{warning}
The Molecule.remap()
method is experimental and subject to change.
:::
# Note that this mapping is off-by-one from the mapping taken
# by the remap method, as Python indexing is 0-based but SMILES
# is 1-based
print("Before remapping:", zw_l_alanine.to_smiles(mapped=True))
# Flip the order of the carbonyl carbon and oxygen
remapped = zw_l_alanine.remap({0: 0,1: 1, 2: 2, 3: 4, 4: 3, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12})
# ^--------^
print("After remapping: ", remapped.to_smiles(mapped=True))
# Doesn't affect the identity of the molecule
assert zw_l_alanine.is_isomorphic_with(remapped)
remapped.visualize()
Before remapping: [H:3][C@@:2]([C:5](=[O:6])[O-:7])([C:1]([H:8])([H:9])[H:10])[N+:4]([H:11])([H:12])[H:13] After remapping: [H:3][C@@:2]([C:4](=[O:6])[O-:7])([C:1]([H:8])([H:9])[H:10])[N+:5]([H:11])([H:12])[H:13]
Topology
objects¶The Topology
class represents a biomolecular system; it is analogous to the similarly named objects in GROMACS, MDTraj or OpenMM. Notably, it does not include co-ordinates and may represent multiple copies of a particular molecular species or even more complex mixtures of molecules. Topology
objects are usually built up one species at a time from Molecule
objects.
Molecule
objects can be retrieved from a Topology
via the Topology.molecule()
method by providing the index of the molecule within the topology. For a topology consisting of a single molecule, this is just topology.molecule(0)
.
Constructor methods that are available for Topology
but not Molecule
generally require a Molecule
to be provided via the unique_molecules
keyword argument. The provided Molecule
is used to provide the identity of the molecule, including aromaticity, bond orders, formal charges, and so forth. These methods therefore don't provide a route to the graph of the molecule, but can be useful for reordering atoms to match another software package.
Topology
¶The Topology.from_openmm()
method constructs an OpenFF Topology
from an OpenMM Topology
. The method requires that all the unique molecules in the Topology
are provided as OpenFF Molecule
objects, as the structure of an OpenMM Topology
doesn't include the concept of a molecule. When using this method to create a Molecule
, this limitation means that the method really only offers a pathway to reorder the atoms of a Molecule
to match that of the OpenMM Topology
.
from openmm.app.pdbfile import PDBFile
openmm_topology = PDBFile('zw_l_alanine.pdb').getTopology()
openff_topology = Topology.from_openmm(openmm_topology, unique_molecules=[zw_l_alanine])
from_openmm_topology = openff_topology.molecule(0)
assert zw_l_alanine.is_isomorphic_with(from_openmm_topology)
from_openmm_topology.visualize()
Topology
¶The Topology.from_mdtraj()
method constructs an OpenFF Topology
from an MDTraj Topology
. The method requires that all the unique molecules in the Topology
are provided as OpenFF Molecule
objects to ensure that the graph of the molecule is correct. When using this method to create a Molecule
, this limitation means that the method really only offers a pathway to reorder the atoms of a Molecule
to match that of the MDTraj Topology
.
from mdtraj import load_pdb
mdtraj_topology = load_pdb('zw_l_alanine.pdb').topology
openff_topology = Topology.from_openmm(openmm_topology, unique_molecules=[zw_l_alanine])
from_mdtraj_topology = openff_topology.molecule(0)
assert zw_l_alanine.is_isomorphic_with(from_mdtraj_topology)
from_mdtraj_topology.visualize()
The OpenFF Toolkit calls out to other software to perform low-level tasks like reading SMILES or files. These external software packages are called toolkits, and presently include RDKit and the OpenEye Toolkit. OpenFF Molecule
objects can be created from the equivalent objects in these toolkits.
Mol
¶The Molecule.from_rdkit()
method converts an rdkit.Chem.rdchem.Mol
object to an OpenFF Molecule
.
from rdkit import Chem
rdmol = Chem.MolFromSmiles("C[C@H]([NH3+])C([O-])=O")
print("rdmol is of type", type(rdmol))
from_rdmol = Molecule.from_rdkit(rdmol)
assert zw_l_alanine.is_isomorphic_with(from_rdmol)
from_rdmol.visualize()
rdmol is of type <class 'rdkit.Chem.rdchem.Mol'>
OEMol
¶The Molecule.from_openeye()
method converts an object that inherits from openeye.oechem.OEMolBase
to an OpenFF Molecule
.
from openeye import oechem
oemol = oechem.OEGraphMol()
oechem.OESmilesToMol(oemol, "C[C@H]([NH3+])C([O-])=O")
assert isinstance(oemol, oechem.OEMolBase)
from_oemol = Molecule.from_openeye(oemol)
assert zw_l_alanine.is_isomorphic_with(from_oemol)
from_oemol.visualize()
QCArchive is a repository of quantum chemical calculations on small molecules. The Molecule.from_qcschema()
method creates a Molecule
from a record from the archive. Because the identity of a molecule can change of the course of a QC calculation, the Toolkit accepts records only if they contain a hydrogen-mapped SMILES code.
:::{note} These examples use molecules other than l-Alanine because of their availability in QCArchive :::
The Molecule.from_qcschema()
method can take a molecule record queried from the QCArchive and create a Molecule
from it.
from qcportal import FractalClient
client = FractalClient()
query = client.query_molecules(molecular_formula="C7H12N2O4")
from_qcarchive = Molecule.from_qcschema(query[0])
from_qcarchive.visualize()
Molecule.from_qcschema()
can also take an optimisation record and create the corresponding Molecule
.
optimization_dataset = client.get_collection(
"OptimizationDataset",
"SMIRNOFF Coverage Set 1"
)
dimethoxymethanol_optimization = optimization_dataset.get_entry('coc(o)oc-0')
from_optimisation = Molecule.from_qcschema(dimethoxymethanol_optimization)
from_optimisation.visualize()