PDB files are a common way to represent biopolymer structures, but they don't natively contain all of the chemical information required to construct Molecule
objects. However, PDB files are used widely in the molecular simulation community, so OpenFF has implemented functionality to load them.
The recommended pathway for loading PDB files is Topology.from_pdb
. Additionally, the PDB file must have the following characteristics:
For protein atoms (ATOM
records):
For small molecules, waters, and ions (HETATM
records):
unique_molecules
keyword argument:::{admonition} Information
OpenFF Topology
and Molecule
objects store much more information than is in the PDB files they might be generated from. Don't be surprised if this Python code takes a few more seconds to load a PDB file than other tools.
:::
import sys
from openff.toolkit import Molecule, Topology
ipython = get_ipython() # noqa
def hide_traceback(
exc_tuple=None,
filename=None,
tb_offset=None,
exception_only=False,
running_compiled_code=False,
):
"""Hide tracebacks for simpler errors."""
etype, value, _ = sys.exc_info()
value.__cause__ = None # suppress chained exceptions
return ipython._showtraceback(
etype,
value,
ipython.InteractiveTB.get_exception_only(etype, value),
)
ipython.showtraceback = hide_traceback
# Hide NumPy warnings
import warnings
warnings.filterwarnings(
"ignore",
r"The value of the smallest subnormal for ",
)
6hvi_prepared.pdb
is protein from the Merck free energy perturbation study. It was prepared for simulation using commercial tools by Schrodinger, and doesn't have any bulk water, ions, small molecules, or a box.
protein = Topology.from_pdb("6hvi_prepared.pdb")
protein.visualize()
NGLWidget()
From here, you might use Packmol or PDBFixer to add water or other solvents without leaving Python (or there are tools available for water/solvent packing in AMBER, GROMACS, and many commercial packages). An example of using PDBFixer can be found in our "Toolkit Showcase" example.
This PDB file comes from the AMBER tutorial on making solvated simulations. We have slightly modified the procedure to create a file suitable for loading into the OpenFF Toolkit:
ramp1 = Topology.from_pdb("RAMP1_solv_box_ions.pdb")
ramp1.visualize()
NGLWidget()
The following PDB is taken from the GROMACS protein-ligand tutorial (with manually-added CONECT records) and contains a protein, water, and a small molecule (ligand).
While the chemistry of water, ions, and proteins can be deduced from known templates (this is what happened under the hood in the previous two examples), small molecules are trickier. The number of amino acids and ions is relatively small, but the number of possible small molecules is tremendous so storing them in a database is not feasible. We instead ask the user to fill in the missing information that would be looked up from the CCD or hard-coded heuristics, specifically the bond order and stereochemistry. This information is stored in a Molecule
object(s), which we ask the user to provide via the unique_molecules
argument.
If you try to load a PDB file containing a small molecule, without providing the chemistry of the small molecule, you'll get an error message identifying the atoms that couldn't be loaded:
Topology.from_pdb("gromacs_solv_complex.pdb")
UnassignedChemistryInPDBError: Some bonds or atoms in the input could not be identified. Hint: The following residue names with unassigned atoms were not found in the substructure library. While the OpenFF Toolkit identifies residues by matching chemical substructures rather than by residue name, it currently only supports the 20 'canonical' amino acids. JZ4 Hint: The following residues were assigned names that do not match the residue name in the input, or could not be assigned residue names at all. This may indicate that atoms are missing from the input or some other error. The OpenFF Toolkit requires all atoms, including hydrogens, to be explicit in the input to avoid ambiguities in protonation state or bond order: Input residue X:JZ4#0001 contains atoms matching substructures {'NO MATCH'} Error: The following 22 atoms exist in the input but could not be assigned chemical information from the substructure library: Atom 2614 (C4) in residue X:JZ4#0001 Atom 2615 (C7) in residue X:JZ4#0001 Atom 2616 (C8) in residue X:JZ4#0001 Atom 2617 (C9) in residue X:JZ4#0001 Atom 2618 (C10) in residue X:JZ4#0001 Atom 2619 (C11) in residue X:JZ4#0001 Atom 2620 (C12) in residue X:JZ4#0001 Atom 2621 (C13) in residue X:JZ4#0001 Atom 2622 (C14) in residue X:JZ4#0001 Atom 2623 (OAB) in residue X:JZ4#0001 Atom 2624 (H1) in residue X:JZ4#0001 Atom 2625 (H2) in residue X:JZ4#0001 Atom 2626 (H3) in residue X:JZ4#0001 Atom 2627 (H4) in residue X:JZ4#0001 Atom 2628 (H5) in residue X:JZ4#0001 Atom 2629 (H6) in residue X:JZ4#0001 Atom 2630 (H7) in residue X:JZ4#0001 Atom 2631 (H8) in residue X:JZ4#0001 Atom 2632 (H9) in residue X:JZ4#0001 Atom 2633 (H10) in residue X:JZ4#0001 Atom 2634 (H11) in residue X:JZ4#0001 Atom 2635 (H12) in residue X:JZ4#0001
Any OpenFF Molecule
object with the appropriate atoms and connectivity can be used to identify the small molecule when loading the PDB file. See the Molecule Cookbook for some of the numerous ways to create them! Since the coordinates are specified in the PDB file, it's not necessary that this Molecule
has conformers.
In this case we know the identity of the ligand and can look up a SMILES string, which is all that's needed to load this PDB file! Since there are many molecules in this file, it may take a few tens of seconds to load and visualize - this is expected and is a result of carefully checking and applying chemical information.
jz4 = Molecule.from_smiles("CCCC1=CC=CC=C1O")
complex = Topology.from_pdb("gromacs_solv_complex.pdb", unique_molecules=[jz4])
complex.visualize()
NGLWidget()