Notebook

12. Analysing MD data¶

Important note:¶

This part of the tutorial was originally developed with the assumption that the reader had already gone through the Molecular Dynamics (MD) OxCompBio workshop. It makes some assumptions of prior knowledge about molecular dynamics and the trajectories generated from such simulations.

We therefore advise readers to first complete the MD tutorial and then return to this section afterwards.

As previously discussed, there are many things for which python can be used. One such example is the analysis of Molecular Dynamics (MD) simulations.

There exist several MD analysis libraries, including (but not limited to):

In this section, we will look at how we can use the MDAnalysis library to process simulation structures and trajectories. We will also look at how we can interface with the NGLView to visualise our simulations.

This tutorial will aim to put together all you have learned so far into one final exercise.

12.1 The MDAnalysis Universe module¶

To work with structures and trajectories in MDAnalysis, you must first create an MDAnalysis Universe.

This is an object that contains all the necessary information about your system, including atomic coordinates, atom types, box vectors (if available), etc...

A Universe is a 'class' object, which stores all the information above your system, including;

Atomic coordinates (i.e. positions).
Atom names.
Atom types.
Bond information (if available).

.. and so on. Having access to this information is very useful for analysis, such as if you are trying to measure distances or dihedrals.

When we load a trajectory (as we will do later in the section), the Universe also stores information like the current frame number, etc.

First we import the MDAnalysis Universe module, we will also import numpy for later use:

In [ ]:

from MDAnalysis import Universe
import numpy

You then load the structure file into a Universe instance named alanine with:

In [ ]:

alanine = Universe('datafiles/ALA.pdb')

Note 1: We could have also used import MDAnalysis, in which case we would have to use MDAnalysis.Universe in place of Universe. As previously discussed, the from ... import ... syntax allows us to import only a sub-module from a library, rather than the full thing, which can be convenient in various circumstances.

Note 2: On loading the PDB file you will get a warning that element information is absent or missing. This is normal and occurs because the input PDB file we use does not have element information for all the atoms (which is expected in a normal PDB file). Therefore MDAnalysis warns us that there are no elements present for us to use.

You can check the different properties of your universe using MDAnalysis. The general syntax for this is universename.attributename; some properties (attributes) are themselves classes (e.g. alanine.atoms groups all the information about our atoms) and will have their own attributes. For example, if you want to check the names of the atoms of your structure, you can type:

In [ ]:

alanine.atoms.names

The atom "name" is essentially a type of tag that defines the atoms in your system. Quite often this will be used to label similar atoms, for example in the PDB convention alpha carbons are always named "CA". When analysing MD simulations, the naming scheme will vary depending on the force field used (see MD tutorial).

How would you find the values of the Phi and Psi dihedral angles of alanine dipeptide, in degrees or radians?¶

Hint: the Phi dihedral is calculated from atoms CLP, NL, CA, and CRP, and the Psi dihedral is calculated using atoms NR, CRP, CA, and NL.

For this we can make use of the built-in functions associated with the Universe class (or 'methods'). We can use select_atoms(selection) to isolate a group of atoms that match selection; the syntax of the selections strings is generally the same as for VMD (which you may have use in the MD tutorial).

alanine.select_atoms(selection) will give us an 'atom group' class instance. We can then use this instance's method dihedral to turn four atoms into a 'dihedral' instance, and finally using value() on this instance will return the (current) value of that dihedral angle.

In [ ]:

print(type(alanine))
print(type(alanine.select_atoms('name CLP NL CA CRP')))
print(type(alanine.select_atoms('name CLP NL CA CRP').dihedral))

phi = alanine.select_atoms('name CLP NL CA CRP').dihedral.value()
psi = alanine.select_atoms('name NR CRP CA NL').dihedral.value()
print(f'The Phi dihedral is {phi:.2f} in degrees and {numpy.deg2rad(phi):.2f} in radians')
print(f'The Phi dihedral is {psi:.2f} in degrees and {numpy.deg2rad(psi):.2f} in radians')

How would you plot the change in Phi and Psi over a trajectory?¶

In order to do this, let's break the process down into small steps.

You will first need to load the trajectory. The format to do this is:

universe_name = Universe(PDBfile, TRJfile)

In this case we have provided you a trajectory in the datafiles folder, the PDB file is called 'ALA.pdb' and the TRJ file is called 'ALA.xtc'. So we can call them as ./datafiles/ALA.pdb and ./datafiles/ALA.xtc.

In [ ]:

# Exercise 12.1.1: Create a Universe called 'ala_trajectory' using the PDB file 'ALA.pdb' 
# and the trajectory file 'ALA.xtc'
ala_trajectory = Universe('./datafiles/ALA.pdb', './datafiles/ALA.xtc')

Now that you have a Universe with your trajectory, we can access another feature of the Universe: trajectory data.

First we should look at what the current trajectory frame number is.

The frame number is stored under universename.trajectory.frame and the total number of frames is found under universename.trajectory.n_frames.

In [ ]:

# Exercise 12.1.2: Print the current frame number and the total number of frames
print(ala_trajectory.trajectory.frame)
print(ala_trajectory.trajectory.n_frames)

A Universe.trajectory acts kind of like a list storing each frame in a simulation and various information about it, so we can loop through it the same way we did for lists in previous sections. As we go through each frame, the coordinates of each atom, stored in Universe.atoms, will be updated.

We can therefore quite a quick for loop that prints the frame number for each frame in the trajectory.

In [ ]:

# When using MDanalysis, we usually refer to each frame as a 'timestep' or 'ts'
# Here at each loop instance, we allocate the current frame to 'ts'
for ts in ala_trajectory.trajectory:
    # the comma at the end here will stop it printing on a new line every time,
    # so this doesn't take up too much space
    print(ts.frame)

Now, use the same logic as Exercise 11.1 to print the phi and psi dihedral angles at each frame:

In [ ]:

# Exercise 12.1.3: Print the frame number and at least one of the dihedrals.
# First create the phi and psi selections
phi_dihedral = ala_trajectory.select_atoms('name CLP NL CA CRP').dihedral
psi_dihedral = ala_trajectory.select_atoms('name NR CRP CA NL').dihedral

# We then loop over the the trajectory
for ts in ala_trajectory.trajectory:
    phi = phi_dihedral.value()
    psi = psi_dihedral.value()
    print(f"Current frame: {ts.frame}, phi: {phi}, psi: {psi}")

Now that you can access both the frame number and the phi and psi dihedrals, it's time to plot them.

First, we will need to import a plotting library. In this case we import pyplot from matplotlib.

We then create three empty lists (frames, all_phi, and all_psi) to hold our trajectory data.

Then we fill each of those lists with the relevant values.

In [ ]:

# We declare matplotlib inline to make sure it plots properly
%matplotlib inline
# We need pyplot from matplotlib to plot the dihedrals
from matplotlib import pyplot

# Create three empty lists
frames = []
all_phi = []
all_psi = []

# iterate through the trajectory
for frame in ala_trajectory.trajectory:
    # calculate phi and psi
    phi = ala_trajectory.select_atoms('name CLP or name NL or name CA or name CRP').dihedral.value()
    psi = ala_trajectory.select_atoms('name NR or name CRP or name CA or name NL').dihedral.value()
    # append frame number, phi, and psi to the lists
    frames.append(frame.frame)
    all_phi.append(phi)
    all_psi.append(psi)

# We then plot our data
pyplot.plot(frames, all_psi)
pyplot.plot(frames, all_phi)
pyplot.show()

As shown above, you can use the matplotlib.pyplot module to plot x and y values in the following manner:

pyplot.plot(x_value, y_value)

We then use pyplot.show() to make the plot appear.

Note: You can adjust your matplotlib pyplot plot using various different options such as; pyplot.title(), pyplot.xlim(), pyplot.ylim(), pyplot.xlabel(), and pyplot.ylabel().

For example, pyplot.title() will add a title.

Check the help() of these functions for more details!

12.2 Analysing a protein simulation¶

In the MD tutorial you will do some molecular dynamics simulations of a HIV-1 protease protein using gromacs. Let's now look at how you could use MDAnalysis, and an associated python visulisation library NGLview, to analyse an MD trajectory of this system.

Note: depending on whether you have already done the MD tutorial, you may find that you have already done the analysis shown here. You may still find this section useful as we provide a few extra details which explain how the MD trajectory analysis works.

Visualising the trajectory using nglview¶

The nglview library is a python widget for visualising simulation trajectories, achieving a similar task to the VMD program that you may also use in the MD tutorial. One of the interesting advantages of nglview is that it interfaces directly with analysis packages such as MDAnalysis and runs within jupyter notebooks.

Let's see how we can use nglview to visualise an MDAnalysis universe object.

First we need to create a universe (let's call it protein) from the simulation output files "pre_md.pdb" and "md_cent.xtc" (which are present in the datafiles directory, hence will be passed to Universe as ./datafiles/pre_md.pdb and ./datafiles/md_cent.xtc).

Note 1: We have pre-aligned the trajectory to the first frame for you so as to remove any motions related to translation.

Note 2: When loading the trajectory you will get a warning that MDAnalysis needs to reload offsets, this is normal in this scenario.

In [ ]:

# Exercise 12.2.1: Let's load a universe named protein
protein = Universe('./datafiles/pre_md.pdb', './datafiles/md_cent.xtc')

Next let's load nglview and use it's show_mdanalysis function to load the MDAnalysis universe

In [ ]:

import nglview
protein_view = nglview.show_mdanalysis(protein)

By default this pre-sets the nglview to show the protein in the cartoon representation. Let's add a few options to colour the protein by secondary structure, show water oxygens and change the background colour

In [ ]:

# Let's update the cartoon representation to colour the protein by secondary structure
protein_view.update_cartoon(color='sstruc')

# We then add a transparent hyperball representation of the water oxygens 
#(play with the opacity value, see what you get)
protein_view.add_hyperball('SOL and not hydrogen', opacity=0.4)

# Let's change the display a little bit
protein_view.parameters = dict(camera_type='orthographic', clip_dist=0)

# Set the background colour to black
protein_view.background = 'black'

# Call protein_view to visualise the trajectory
protein_view

The nglview output can be controlled in the following way:

play / pause button: play the trajectory
double click window: enter or exit full screen mode
left mouse button: rotate system
middle mouse wheel: zoom in/out
right mouse button: translate system

As you can be seen from the trajectory, the HIV-1 protease structure does indeed move, but by how much? In the next section we will see how we can use MDAnalysis to quantify backbone fluctuations.

Calculating the root-mean-square deviation¶

In order to gain a quantitative description of how the HIV-1 protease moves in our simulation we can calculate the root-mean-square deviation (RMSD) of the protein backbone.

The RMSD gives us an idea of how 'stable' our protein is when compared to our starting, static, structure. The lower the RMSD is the, more stable we can say our protein is.

The RMSD as a function of time, $\rho (t)$, can be defined by the following equation:

\begin{equation} \\ \rho (t) = \sqrt{\frac{1}{N}\sum^N_{i=1}w_i\big(\mathbf{x}_i(t) - \mathbf{x}^{\text{ref}}_i\big)^2} \end{equation}

Luckily MDAnalysis has its own built-in function to calcualte this, we can import it like we did before.

In [ ]:

from MDAnalysis.analysis.rms import RMSD as rmsd

In order to calculate the RMSD for every frame in our trajectory we will need:

A reference structure
A universe object
A selection of atoms

In our case the reference structure will be the HIV-1 protease structure in the first frame.

Our universe object will be the 'protein' object we defined above.

For our selection we will use the backbone atoms.

In [ ]:

ref = Universe('datafiles/pre_md.pdb', 'datafiles/md_cent.xtc')

# Set the ref trajectory to the first frame
ref.trajectory[0]

Due to the way that GROMACS post processes the trajectory file we need to edit it slightly before running our RMSD.

This is done by aligning all frames to the reference structure.

In [ ]:

from MDAnalysis.analysis import align

# Create the MD simulation universe
protein = Universe('datafiles/pre_md.pdb', 'datafiles/md_cent.xtc')
# Call the MDAnalysis align function to align the MD simulation unvierse to the reference (first frame) universe
align_strucs = align.AlignTraj(protein, ref, select="backbone", weights="mass", in_memory=True, verbose=True)

R = align_strucs.run()

You will have noticed that running this function stores it in the variable 'R', we can now access the RMSD values:

In [ ]:

rmsd_data = R.rmsd

We'd like to visualise how the RMSD changes over time and this can be done in the same way you did in Excercise 11.1.5.

Take a look at the 'rmsd_data' variable (it's a numpy array) and try plotting it below.

You will need to access 'rmsd_data' (a numpy array) in order to plot both the time and the RMSD as a line plot.

In [ ]:

# Excercise 12.2.2: Plot the RMSD data for the HIV-1 protease system. 

# Make sure to add appropriate axis titles.
# What does the RMSD tell you about the protein?
# [If you have time] What happens when you calculate the RMSD using more atoms (i.e. not just the backbone)
pyplot.plot(rmsd_data)
pyplot.title("RMSD over time")
pyplot.xlabel("Frame number")
pyplot.ylabel("RMSD (Angstroms)")
pyplot.show()

Calculating the root-mean-square fluctuation¶

To look at how each residue flucuates over it's average postion we can use the closely related measurement of root-mean-square fluctuation (RMSF).

The RMSF for an atom, $\rho_i$, is given by:

\begin{equation} \rho_i = \sqrt{\sum^N_{i=1} \big\langle(\mathbf{x}_i - \langle \mathbf{x}_i \rangle )^2 \big\rangle } \end{equation}

In [ ]:

from MDAnalysis.analysis.rms import RMSF as rmsf

In [ ]:

# Reset the trajectory to the first frame
protein.trajectory[0]

# We will need to select the alpha Carbons only
calphas = protein.select_atoms("name CA")
rmsf_calc = rmsf(calphas, verbose=True).run()

In [ ]:

# Excercise 12.2.3: Plot the RMSF data for the HIV-1 protease system. 
    # Tip, in order to plot the resids you will need to access them through the rmsf_calc object

# Make sure to add appropriate axis titles.
# What parts of the protein have a high RMSF, can you locate these on the protein structure?
# [If you have time] What happens when you calculate the RMSF using more atoms (i.e. not just the backbone)
pyplot.plot(calphas.resids, rmsf_calc.rmsf)
pyplot.title("Per-Residue Alpha Carbon RMSF")
pyplot.xlabel("Residue Number")
pyplot.ylabel("RMSF (Angstrom)")
pyplot.show()