Welcome! In this notebook we will overview Biobox's main features, and point to relevant publications where they were used. Biobox is developed by the Degiacomi group (www.degiacomi.org) in Durham University (UK), and is downloadable from Github, https://github.com/Degiacomi-Lab/biobox. Its API is available here: https://Degiacomi-Lab.github.io/biobox/
Let's get started by compiling Biobox provided along with this notebook (this can take a minute). Then, let's import Biobox, numpy, matplotlib and nglview (we will need all these packages later).
import os
os.chdir('biobox')
os.system('python setup.py install')
os.chdir('..')
import numpy as np
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt
import nglview as nv
import biobox as bb
Let’s load a protein in a PDB file, containing 20 frames of a molecular dynamics simulation of a small heat-shock protein.
M = bb.Molecule()
M.import_pdb("HSP.pdb")
All the atomic coordinates are stored in M.coordinates
, which is a directly accessible 3D array of size (nb_conformations, nb_atoms, 3)
print(M.coordinates.shape)
The properties of all atoms (i.e. anything that is not coordinates) are stored in the pandas data structure M.data
.
M.data
Let's check how many chains this structure is made of:
print(np.unique(M.data["chain"]))
Well that's odd, small heat shock proteins come as a dimer... As it sometimes happens, the MD engine has replaced the chain assignment with a single symbol, X
. Let's ask Biobox to figure out how to split the protein in its two chains.
M.guess_chain_split()
print(np.unique(M.data["chain"]))
Ok, that's better, two chains have been found and assigned. Now, let's identify only the backbone atoms of chain A. atomselect
accepts as parameters single strings, lists or “*”
as wildcard. After this call, pos
contains the coordinates of all selected atoms, and idx
their indices. Another way to select atoms, is to use the query
method. The following call will yield the same result as the atomselect
above. The query
method follows the pandas query syntax, and allows to be more expressive. Any column stored in M.data
can be addressed.
pos, idx = M.atomselect("A", "*", ["CA","C","N","O"], get_index=True)
pos, idx = M.query('chain == "A" and name == ["CA","C","N","O"]', get_index=True)
Now that we have identified indices of interest, we can save a subset of the initial pdb in a new one, or to create a new Molecule
object containing only them.
M.write_pdb("chainA.pdb", index=idx)
M2 = M.get_subset(idx)
Let's have a look at what we did. We will represent in cartoon the input protein, and as a surface the part of the protein we selected and saved into a new file.
view = nv.show_file("HSP.pdb")
view.add_component("chainA.pdb")
view.clear_representations(component=0)
view.add_representation("cartoon", component=0)
view.clear_representations(component=1)
view.add_representation("surface", opacity=0.2, component=1)
view
multiple conformations may be available in the PDB file. By default, the first one is set as current. It is possible to set as current another one as follows:
M.set_current(2)
pos2, idx2 = M.atomselect("A", "*", ["CA","C","N","O"], get_index=True)
After this new atomselect
call, idx2
will be equal to idx1
(atom selected are still the same), but pos2
will be different from pos
(atoms positions differ between different conformations). Unless otherwise specified, get_subset
selects all the alternative conformations from the atoms of interest. get_subset
can however also be instructed to select a subset of conformations. For instance, the following call will select only the conformations 0, 1 and 2 of atoms of interest.
M2 = M.get_subset(idx, conformations=[0,1,2])
Biobox methods return numpy arrays. This means that you can directly benefit from data analysis tools in all major Python scientific computing packages. For instance, let's start by calculating an RMSD distance matrix of protein conformations stored in our molecule M
.
dist = M.rmsd_distance_matrix(flat=True)
Now, let's run a hierarchical clustering on the multi-PDB we previously loaded by first calculating an all-vs-all RMSD matrix.
# hierarchical clustering
hierarchical_cluster = hierarchy.linkage(dist, method='single')
#get assignment of structures to clusters
flat_clusters = hierarchy.fcluster(hierarchical_cluster, 2.0, criterion='distance')
print("clustering:", flat_clusters)
# plot dendrogram
d = hierarchy.dendrogram(hierarchical_cluster)
plt.xlabel("conformation (#)")
plt.ylabel("RMSD ($\AA$)")
We want to produce several protein tetrahedral assemblies, and compare them to each other. Now, let’s create a Multimer
arranged according to a tetrahedral symmetry. To do so, we have to load information about the tetrahedral scaffold Biobox will exploit to align six monomers. By default this information is stored in the file classes/polyhedron_database.dat
(along with many more symmetries), though the user can import their own database. The other information we need, is what molecule we will be the tetrahedron with. Here, we will select the first conformation of our small heat-shock protein, and we center it at the origin.
M_subunit = M.get_subset(idxs=list(range(len(M))), conformations=[0])
M_subunit.center_to_origin()
P = bb.Multimer()
P.setup_polyhedron('Tetrahedron', M_subunit)
P.generate_polyhedron(45, 180, 0, 0)
P.write_pdb("polyhedron.pdb", rename_chains=True)
Now, P
contains six proteins arranged as a tetrahedron having a radius of 10 Angstrom. Every subunit is rotated with respect of its specific position on the scaffold. Rotation angles are defined with respect of the molecule’s principal axes. Here, we rotate by 180 degrees around the first principal axis, 20 around the second, and 10 around the third. Let's have a look!
view = nv.show_file("polyhedron.pdb")
view
Let’s now build two new polyhedra with different radii and rotation angles:
P.generate_polyhedron(10, 180, 50, 65, add_conformation=True)
P.generate_polyhedron(12, 185, 40, 60, add_conformation=True)
Since we set add_conformation=True
, the atoms arrangement of the new multimers will be appended as new conformations. With add_conformation=False
(default) the previous subunits arrangements gets overwritten. Note that assemblies’ multiple conformations are treated by appending on each subunit its different conformation. Biobox then sets on all subunits the same current position.
Subunits can also be grouped, and different groups can be rotated differently. In the following example, the tetrahedron’s chains A, B, C and D, E, F form different groups that are rotated independently.
P.conn_type = np.array([0, 0, 0, 1, 1, 1])
P.generate_polyhedron(45, np.array([180, 90]), np.array([0, 0]), np.array([0, 0]), add_conformation=True)
Let's visualize our new polyhedron:
P.write_pdb("polyhedron2.pdb", rename_chains=True)
view = nv.show_file("polyhedron2.pdb")
view
Note that when more than one edge type is provided, rotation angles should be in the form of a numpy array having the same length as the amount of different groups in connection (values in conn_type
are used to index the angles arrays).
Polyhedral scaffolds are constituted of vertices connected by edges. By altering the position of the vertices, the scaffolds can be deformed (e.g. useful to model near-symmetries). In Biobox, deformations are treated in terms of deformation vectors, i.e. unit-vectors indicating in which direction a vertex can move. Here, we will allow the first vertex to move radially. We will then build a tetrahedron, where this vertex is displaced from its initial position by its deformation vector, scaled by a constant (here, 20).
P.deform = np.empty(0)
P.add_deformation(1)
P.conn_type = np.array([0, 0, 0, 0, 0, 0])
P.generate_polyhedron(45, 180, 0, 0, deformation=[20], add_conformation=True)
Let's take a look at our latest polyhedron!
P.write_pdb("polyhedron3.pdb", rename_chains=True)
view = nv.show_file("polyhedron3.pdb")
view
Note that add_deformation also accepts user-defined deformation vectors. To see how your scaffold looks like, a pdb file containing the vertices and an associated TCL script for VMD (drawing colored edges, as a function of grouping) can be produced.
P.write_poly_architecture("architecture", scale=10, deformation=[5])
This will generate two files architecture.pdb
and architecture.tcl
. The initial unit-sized scaffold will scaled by 10, and the first vertex moved away radially.
Now, we want to calculate the RMSD between the created multimers’ alpha carbons. With these lines, dist_mat
will contain the RMSD distance matrix between the multimers:
idxs = P.atomselect("*", "*" ,"*", "CA", get_index=True)[1]
dist_mat = P.rmsd_distance_matrix(points_indices=idxs)
print(dist_mat)
Note that, as for the case of atomselect
applied to Molecule
objects, a query
method is also available. The same selection as the command above can be obtained with:
pts, idx = P.query('name == "CA"', get_index=True)
print(len(idx))
To select atoms from some specific units, the following command can be issued:
pts, idx = P.query('unit == ["0", "3", "5"] and name == "CA"', get_index=True)
print(len(idx))
In this example, we will arrange a group of cylinders in a ring. To do so, we have first to create a single collection of points arranged like a cylinder. Unless otherwise specified (using the optional keyword radius), every point composing the Cylinder
instance (and any other convex point cloud) will have a radius of 1.4 Angstrom. To simulate a smooth surface, one can either increase the points radius, or their density. Finally, the resulting cylinder will be rotated by 45 degrees along the x axis.
cylinder_length = 10
cylinder_radius = 40
C = bb.Cylinder(cylinder_length, cylinder_radius, pts_density_u=np.pi/10., pts_density_h=0.5)
C.rotate(45, 0, 0)
We will now create an assembly loading ten copies of our template cylinder, arrange them in a 30 Angstrom-wide circle, and save the resulting structure into a PDB file.
A = bb.Assembly()
A.load(C, 4)
A.make_circular_symmetry(5)
Let's save this point cloud and display it with nglview.
A.write_pdb("assembly.pdb")
view = nv.show_file("assembly.pdb")
view
We can now assess some of the assembly’s characteristics, for instance its height and width. This can be done by extracting all the assembly’s points coordinates in a unique numpy array.
xyz = A.get_all_xyz()
width = np.max(xyz[:, 0]) - np.min(xyz[:, 0])
height = np.max(xyz[:, 2]) - np.min(xyz[:, 2])
An alternative way to measure assembly dimensions, it to profit of methods in Structure
class. Here we collapse the Assembly
units coordinates in a single Structure
instance.
S = A.make_structure()
print(S.get_size())
In case not all the subunits of the assembly are the same, a list of subunits can be loaded. In this case, we will load a Sphere
(and call it S
) as well as two identical cylinders (called C1
and C2
).
sphere_radius = 20
cylinder_radius = 5
cylinder_length = 50
S = bb.Sphere(sphere_radius, n_sphere_point=500)
C = bb.Cylinder(cylinder_radius, cylinder_length, pts_density_u=np.pi/10, pts_density_h=0.5)
A2 = bb.Assembly()
A2.load_list([S, C, C], ["S", "C1", "C2"])
Now, we will arrange the three loaded structures so that the bases of two cylinders are in touch with the sphere, and one cylinder is rotated by 45 degrees with respect to the other.
A2.translate(0, 0, -cylinder_length/2.0-sphere_radius, ["C1", "C2"])
A2.rotate(0.0, 45.0, 180.0, ["C2"])
As you can see, translations (and rotations) can be applied to units subsets. In this case, we kept the sphere fixed, and only translated the cylinders, and then rotated just one of the two cylinders. Let's take a look at the final result.
A2.write_pdb("assembly2.pdb")
view = nv.show_file("assembly2.pdb")
view
Ion Mobility (IM) experiments report on a molecule’s collision cross section (CCS). Here we show how to relate IM data with a electron density 3D reconstruction obtained by Electron Microscopy (EM).
We first import a GroEL density map EMD-1800.mrc
.
D = bb.Density()
D.import_map("EMD-1080.mrc", "mrc")
Depending on which threshold value one selects, the resulting isosurface will have a certain volume and CCS. We now compute the map’s relationship between threshold, volume and CCS with 100 equally spaced threshold values. This might take several minutes, depending on map size (by default, a scan between minimal and maximal map intensity is performed). Obtained values will be returned in a numpy array containining as columns [threshold, volume, CCS]
. This will also be stored in self.properties[‘scan’]
, for future usage.
tvc = D.threshold_vol_ccs(low=0, sampling_points=100, verbose=True)
Let’s predict the density CCS using a fitted mass-based threshold, and compare it the known CCS of 20600 A^2. This requires providing the map’s resolution (here, 5.4 Angstrom) and the mass of GroEL (801 kDa). The procedure interrogates the data previously stored in D.properties[‘scan’]
.
ccs_mass, fitted_mass_thresh = D.predict_ccs_from_mass(5.4, 801)
print("threshold: %s"%fitted_mass_thresh)
print("CCS: %s A2"%ccs_mass)
error = 100 * (np.abs(ccs_mass - 20600)/20600.)
print("error vs IM data: %4.2f percent"%error)
Error should be typically less than 5%. Values greater than 8% indicate that the protein’s conformation is likely different between EM and IM. We can use fitted_mass_thresh
to create a bead model, that can then be saved into a PDB.
D.place_points(fitted_mass_thresh)
D.write_pdb("model_ccs_mass.pdb")
Let's take a look at the bead model we produced!
view = nv.show_file("model_ccs_mass.pdb")
view.add_component("EMD-1080.mrc")
view.clear_representations()
view.add_spacefill(component=0)
view.add_representation('surface', color='grey', isolevel=0.5, opacity=0.5, component=1)
view
Cross-linking experiments report on the distance between the side chain of specific amino-acids. This distance, measured by a cross-linker molecule, is however not a straight line, but a “shortest solvent accessible path”.
To identify in a structure which lysines may be cross-linked, we start loading it and identifying the location of all lysines’ NZ atoms:
idx = M.atomselect("*", "LYS", "NZ", use_resname=True, get_index=True)[1]
print(M.get_data(idx))
To calculate the path distance between all these atoms, we must first define which protein atoms should be used for clash detection. Here, we select all backbone atoms as well as beta carbon ones. Furthermore, atoms buried in the protein core are also added (with densify=True
). This makes the protein core more “dense”, reducing the likelihood that a path will find its way through the protein, instead of around it.
XL = bb.Xlink(M)
XL.set_clashing_atoms(atoms=["CA", "C", "N", "O", "CB"], densify=True)
We then set up the grid used by the path detection algorithms. Here, we use a local search, using a cubic moving grid of 18 Angstrom per side. After this, the distance matrix path detection algorithm can be launched. We will use a lazy Theta* method, with flexible side chains, and path smoothing as postprocessing.
XL.setup_local_search(maxdist=18)
distance_mat, paths = XL.distance_matrix(idx, method="theta", smooth=True, flexible_sidechain=False, get_path=True)
print(distance_mat)
Let's have a look at the distances we measured. The data structure paths
is a list containing the information we need!
#gather all path points and save them into a PDB file
path_points = []
for p in paths:
if len(p[1])>0:
path_points.append(p[1])
path_points = np.vstack(tuple(path_points))
S = bb.Structure(p=path_points)
S.write_pdb("paths.pdb")
# view protein and the cross-linking distances
view = nv.show_file("paths.pdb")
view.add_component("HSP.pdb")
view.clear_representations()
view.add_representation("spacefill", radiusScale=0.2, component=0)
view.add_licorice('LYS', component=1)
view
distance_mat
is the distance matrix between all lysines, sorted according to idx
. It will contain -1 for lysine’s linking atoms too far to be encompassed by the moving grid, and -2 for failed path detection (e.g. because a linking atom is buried).