Notebook

Homology Modelling Tutorial¶

Max Epstein, Bjarne Feddersen and Phil Biggin (based on previous work from Phill Stansfeld)

Objectives¶

In this tutorial we will be building an homology model of the human adult muscle nAChR (nicotinic acetylcholine receptor). There are quite a few homology modelling tutorials available online - most of them guide you through a very simple case of a single sequence with a known good alignment to a target with no gaps. Often though we are not so lucky - this tutorial is designed to take you through a more typical scenario, where you might want to first generate an initial multiple sequence alignment to improve the pair-wise alignment that you will use for the structure generation, what to do about missing bits of the template, how to generate multiple chain models, dealing with disufides, loop refinement and finally how to score and select a good model.

In general though its also worth thinking about some of the following points before one starts making models:

In what state do we want to model our protein (i.e open, closed or desensitized)?
Does a structure of a phylogenetically related protein in that particular state exist?
If so, are there multiple structures of this state?
What structure gives the best sequence alignment?
What if you have high sequence alignment but low resolution? Perhaps it's worth considering a different template...
Are there any other reasons why you might not want to select a particular structure? I.e. inaccurate assignemnt of EM density - missing loops or regions.

We will use this jupyter notebook to explore this. If you are following the OxCompBio course, then an introduction to python and juypter notebooks is the first session.

The idea of this tutorial is that you should be able to follow it through at your own pace. If you are of a timetabled course, then there will be some demonstrator help available, but this tutorial has been designed such that you should be able to complete it without help, even if you get stuck.

BEFORE YOU START - Please ensure you installed all the necessary software (modeller, jalview, pymol - via conda install)

Most of software necessary can be very simply installed under the conda framework. Conda is available on windows, linux and mac.
It will be necessary to obtain the (free) license key for modeller (see https://salilab.org/modeller/ for details).
This is the first step if you have not already installed conda (or miniconda) on your laptop - Rather than repeating the instructions here, simply head over to https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html and pick up the relevant instructions for your operating system.
First create and activate an environment (you might have already done this if you did other parts of this course - in which case just activate it)
% conda create --name OxCompBio
% conda activate OxCompBio
Next, install the following packages with conda

% conda config -add channels defaults

% conda config -add channels bioconda

% conda config -add channels conda-forge

% conda install -c salilab modeller <-- Note - you will need to get a (free) license key - do this in advnance of commencing the tutorial!

% conda install jalview <-- if this gives problems - go to installers at http://www.jalview.org/builds/nextrel/Web_Installers/install.htm

% conda install -c anaconda openjdk

% conda install -c schrodinger pymol <-- Note - this is the schrodinger version - a freeware version is available depending on your OS.

% conda install -c binstar unxutils <-- windows users only
windows users might like to install notepad++ which has better manipulation features than the default Notepad (and will be needed for the last section)
Don't forget to edit config.py (under you modeller-9.25/modlib/modeller/ directory) such the the license says what your license key is.

1. Perform a mulitple sequence alignment¶

Make a directory entitled 'nAChR_model' where you can save outputs to. This will be relative to the location of this notebook.

In [ ]:

! mkdir nAChR_model

Get Sequences¶

Go to uniprot.org to search for the sequences of interest.
The gene encoding the muscle nicotinic acetylcholine receptor alpha1 subunit is 'CHRNA', so type that in to the search bar and look for ACHA_HUMAN.
Click on the entry number P02708 and locate the FASTA sequence.
Two isoforms of this subunit exist. Click on the 'Isoform 1' (Isoform 2 is listed first) FASTA file download. This will redirect you to a page with the amino acid sequence. Keep the tab open or copy the tab because we are going to cut and paste from this. Or you can save to a file in your nAChR_model directory and cut and paste from there.
Do this for CHRNB, CHRND and CHRNE sequences as well. These are the other subunits of the adult muscle nAChR.
We'll now do a blast search to look for a suitable template sequence for modelling the human alpha muscle subunit. This will be a sequence for which a 3D structure exists that we can use as a template to base our homology model on.
Go to the Blast wesbite
In the 'choose Search set', change the Database to PDB.
Copy and paste the CHRNA amino acid seqeunce into the FASTA box and click BLAST.
6PV7 is a structure for a recently solved human nAChR with alpha3 and beta4 subunits.
Now locate the respective FASTA sequences for these subunits in uniprot and open each FASTA sequence in a new tab.
Finally, save all the sequences in your nAChR_model directory you created.

Generate the alignment¶

An accurate MSA will have a good balance of related sequences. As we are trying to model a cationic channel- it makes sense to include sequences of mostly cationic channels.
If we include too many anionic channels (lower sequence identity) then the accuracy of our alignment may decrease.
Once you have all your sequences in invividual tabs, go to the MUSCLE multiple sequence alignment website:
Copy and paste in all your FASTA sequences. Remember to include the first line of each respective FASTA sequence that contains the metadata. If you want to include your email address to send the resulting file to you can, although this is not necessary.
When the alignment is ready, right click on the 'Download Alignment File' tab and click 'Save link as'.
Save the resulting .clw file as 'nachr_alignment.clw' to your 'nAChR_model' directory.

Next we will visualize the alignment...

Visualize the alignment¶

Open up Jalview (if you are on windows you will have to click on the desktop icon)

In [ ]:

! jalview # linux/mac - windows users hit the desktop icon

An example alignment should open automatically so just close them after the programme has loaded up.
Drag and drop your .clw file on to the JalView GUI. (alternatively rename nachr_alignment.clw as nachr_alignment.aln and use the file tab and the open file dialogue).
Next, click on the 'Colour' tab and select 'Clustalx'. This will visualise your alignment by amino acid properties.
Scroll through your alignment. If there are any insertions or deletions ('indels') denoted by dashes in regions of high sequence identity then you should scrutinize your alignemnt carefully and if needed, move them at the point of generating your final alignment. Indels commonly show up in regions of low sequence identity where flexible loops exist. These are usually less problematic but should be analysed before model generation.
Jalview can also interface to a secondary structure prediction.
Click on the 'Web Service' tab and then find 'JPred secondary structure prediction'.

You should be able to see red lines where alpha-helices have been predicted and green lines for beta-sheets.

Congratulations- you've completed the MSA part of this tutorial! Its worth at this stage just picking out the alpha subunit and looking at how similar they are.
Before you quit jalview, use to save a new format that is easier to manipulate in Modeller. Go to the tab and save as nachr_alignment.pir

2. Build a model of a single subunit (the human muscle alpha subunit) first¶

For large heteromeric proteins like the nAChR, it may help to just try and build a single subunit in the first instance. Or you may be wanting to model a protein that is comprised of a single chain only anyway.
Homology modelling at its simplest just needs a template and the sequence to be modelled. We could have made an alignment of just the alpha4 template sequence with our muscle alpha sequence. However, if you have other sequences from related proteins, the quality of the alignment is usually much better, which is why did the above multiple sequence alignment.
We could use the whole alignment, but to make things as simple as possible lets just extract the two alpha sequences.
Make a copy of your .pir file and save it with .ali at the end of the file name.

In [ ]:

! cp nAChR_model/nachr_alignment.pir nAChR_model/alpha_pairwise_alignment.ali

Although we copied the file and called it alpha_pairwise_alignment.ali it currently does not have the .ali format.
First remove all subunits that you will NOT include in the homology modelling process (i.e. everything except the alpha template (ACHA3_HUMAN) and our sequence of interest (ACHA_HUMAN)). You can do this with any editor (vim, nano, notebook, nedit, gedit)
You should have a file with the entries for the ACHA3_HUMAN (alpha 3 subunit) and the ACHA_HUMAN (the human muscle alpha subunit) only.
You will also need to replace the line starting with SP under the ACHA3_HUMAN entry with the following string :-

structureX:alpha3.pdb:START:A:END:A::: 0.00: 0.00
Similarly replace the line starting with SP under the ACHA_HUMAN with this:-

sequence:alpha_seq::::::: 0.00: 0.00
These lines contain information about whether the sequence is from the template (Structure) or from the sequence we want to model. : characters separate bits of information (full details can be found here). The key ones for us today to worry about are fields 3,4 5 and 6 for the template sequence - these fields are the starting residue,the chain letter (A in our case), the last residue and the chain letter (again just A for us). Here, at the moment we have place holders (START, END) for the residue mumbers, which we have to replace later. You might think we already know the start and last residue of the template from the uniprot sequence but as we will see, this will often not work directly because the crystal structure has some bits missing.
Take time with this step as it's the most important. Mistakes made at this point carry through to the final model.
If you are worried about this (or want to cheat!) you can look at the format of one we prepared earlier in the backup directory (note that this does not quite match - we will take you through how to work how long the alignment is in the section below - ie how to work out what to put in place of END).
We are going to have edit this file further later.

In [ ]:

! vim alpha_pairwise_alignment.ali # linux/mac

In [ ]:

! notepad alpha_pairwise_alignment.ali # windows

In [ ]:

! cat backup/single_subunit/alpha_pairwise_alignment.ali

Once you are satisfied with your alignment you can focus on generating a 3D model. Go to rcsb and search for 6pv7.
Download a .pdb file of 6pv7 to your current working directory, then open 6pv7.pdb in pymol.
This is the PDB of the alpha3beta4 nicotinic receptor (see Gahrpure et al 2019 - https://doi.org/10.1016/j.neuron.2019.07.030), which we will use as a template (other structures could be used - indeed, if there are potentially different templates, you should consider which one might be best or even using more than one at the same time - for now lets just assume that this our template of choice).

In [ ]:

! pymol 6pv7.pdb

Within pymol, type:

select A, chain A and not HETATM

Then on the right menu click 'A' and then 'copy to object'.
Save the resulting obj01 as alpha3.pdb (careful to select .pdb output)
You also use the followin command to get the sequence to cut and paste

print(cmd.get_fastastr('obj01'))

You can cut and paste that (just the sequence -not the line begining with >) to a file; lets call that alpha3_structure_sequence.fasta. Before you quit pymol take a look at just Obj01 (Chain A of the protein) - you might notice that in fact there is an apparent break in the chain (between the 3rd and 4th transmembrane helix) - this is a problem that we will need to do with and is quite a common problem in homology modelling.

We now need to update the .ali file to contain the correct information for the template you are using.
When proteins are solved experimentally (e.g. using X-ray crytallography or Cryo-electron microscopy), often, certain flexible motifs or loops are removed as these may make structure determination difficult to impossible. As such, the uniprot sequence that you have for the alpha3 nAChR subunit will contain residues that have not been included in the experimentally solved template structure that we have.
We will therefore need to correct this residue mismatch by removing those residues that are not in the alpha3.pdb stucture from the template structure section of our nachr_alignment.ali alignment file.
If you compare nachr_alignment.ali with the alpha3_structure_sequence.fasta, you will see that the structure starts at Serine 32 (this is the start of the mature protein - the uniprot derived sequences contain the whole sequence). The easiest thing at this stage is to delete residues and dashes from both sequences until that serine. Then you can simply replace START in alpha_pairwise_alignment.ali with 1.
Now we have to replace END with the last residue - this is simply the number of residues we have in the structure file. You could of course simple count them, but below is a useful bit command line unix that can do this for you (assuming that alpha3_structure_sequence.fasta contains ONLY the sequence (the answer should be 387). Note that you may have to trim the end of the sequences as well so that they end at the same residue.

In [ ]:

 ! tr -d '\n\r' < alpha3_structure_sequence.fasta |wc -m

If you haven't do so, make those edits to nachr_alignment.ali now using your favourite editor or vim

In [ ]:

! vim alpha_pairwise_alignment.ali # linux/mac

In [ ]:

! notepad alpha_pairwise_alignment.ali # windows

Now there is one final edit to make before we can actually run modeller. Remember this structure is not complete - there is a missing section between TM3 and 4. If you look at the sequence you will see the break between residue 327 to 410. We need to tell Modeller about that break - to do that we just insert a "/" character in both sequence after the R327 and before I410 in the sequence of the template and add the "/" at the exact equivalent position in the sequence to be modelled. Use your editor to make these changes. You can then check at this stage, by comparing to one I made earlier. A "/" is used to indicate a break in the sequence.

In [ ]:

! vim alpha_pairwise_alignment.ali # linux/mac

In [ ]:

! notepad alpha_pairwise_alignment.ali # windows

In [ ]:

! cat backup/single_subunit/alpha_pairwise_alignment.ali

Now we should be ready to run modeller.

In [ ]:

from modeller import *
from modeller.automodel import *    # Load the automodel class

log.verbose()
env = environ()

# directories for input atom files
env.io.atom_files_directory = ['.', './nAChR_model']

a = automodel(env,
              alnfile  = 'alpha_pairwise_alignment.ali',     # alignment filename
              knowns   = 'alpha3',              # codes of the templates
              sequence = 'alpha_seq')              # code of the target - we can call this what we like

a.starting_model= 1                 # index of the first model
a.ending_model  = 1                # index of the last model as this is 1 and the start is 1 - we will just generated 1 model
                                    # (determines how many models to calculate)
a.make()                            # do comparative modeling

You can either view this model directly as below or fire up pymol to check the structure looks OK - visual inspection is super important!

In [ ]:

import nglview
view = nglview.show_structure_file('alpha_seq.B99990001.pdb')

In [ ]:

view

In [ ]:

view.render_image()

In [ ]:

view._display_image()

That concludes this part of the tutorial.

3. Build a model of the full heteromeric receptor¶

Next we will investigate how to make a model of the complete receptor. The instructions will deliberately get less verbose, but you should be able to refer to the part 2 above if you need to.

First lets create a new directory to do this in - lets call that full-model

In [ ]:

! mkdir full-model

Copy your .clw, .ali and 6pv7.pdb files to this directory.
Make a copy of 6pv7.pdb and open this in pymol. Have a look at the structure you will see that there are various other entities in this structure including antibodies that we don't really want in our template model.
In pymol type:

remove not polymer or chain F or chain G or chain H or chain I

into the GUI to remove all non protein residues and 4 antibody chains.
You should also be able to clearly now see the pentameric nature of the a4b3 structure template comprised of five chains A-->E. If you imagine looking down the central axis with the large beta sheet domain nearest you then we will have to model our human muscle nAChR such that the order of the subunits going anticlockwise is alpha-delta-beta-alpha-epsilon.
In the header of 6pv7 you can see that it tells us that chains A and D are the alpha3 subunits, and B,C and E are the beta4 subunits. This matches the positions of where we expect the alpha subunits of our human muscle to be (phew!). The beta4 subunits will act as the template for all non-alpha subunits. Thus the order of the sequences in the alignment file we need to make should indeed be alpha-delta-beta-alpha-epsilon.
Save the resulting structure of just the receptor protein as a new .pdb file (simply file->export molecule then select all as we have removed all the atoms we don't want from this structure). If you want to check this then there is one we made earlier in ./backup/full-model/6pv7_clean.pdb

Modeller will expect one long protein essentially with '/' defining breaks in the chain between TM3 and 4 again but also between the ends of chains. The end of the whole pentamer is marked with an *. Remember its essentially just a pairwise aligment - so just two long lines, but because that is hard to read we make it over several lines.
Make a copy of the original nachr_alignment.ali and call it something sensible. You will then need to edit this into the same format you did for the single subunit.

If you get completely stuck with editing the alignment, then there is one we made earlier in the backup/full-model directory called pentamer-alignment.ali, but try to make work this out on your own as first!.

Modify the Modeller script below to accomodate your naming

In [ ]:

from modeller import *
from modeller.automodel import *    # Load the automodel class

log.verbose()
env = environ()

# directories for input atom files
env.io.atom_files_directory = ['.']

a = automodel(env,
              alnfile  = 'pentamer_alignment.ali',     # alignment filename
              knowns   = '6pv7_clean', # codes of the templates - ie your "cleaned up" pdb above
              sequence = 'muscle_nAChR')              # code of the target

a.starting_model= 1                 # index of the first model
a.ending_model  = 1                # index of the last model
                                    # (determines how many models to calculate)
a.make()                            # do comparative modeling

lets take a look - modify the holding code below as appropriate

In [ ]:

import nglview
view = nglview.show_structure_file('YOUR_TARGET_NAME.B99990001.pdb')

In [ ]:

view

In [ ]:

view._display_image()

4. Modifying models to include features like disufide bonds¶

The muscle nAChR, like many other surface proteins, contains disulfide bonds in its structure. It is of course crucially important to check that these have been correctly formed after you generate your model.
Modeller can easily deal with this problem by effectively using a model of itself and correcting the problematic bonds.
Included in the directory 'disulfide' is a previously built homology model (called muscle_nAChR.pdb) that does NOT contain proper disulfide bonds.
Open the model in pymol and locate the cysteine residues that should be bonded.

select br. resname CYS

into the pymol GUI to help find the correct residues.
As you did in previous sections above, extract the fast sequence of the protein. You can then use these sequence as both the template and the sequence to be modelled - thus the alignment is super-trivial in this case!
Construct a alignment.ali file for this.
Adapt the modeller run below (you can keep the number of models generated to 1)

In [ ]:

# Comparative modeling by the automodel class
from modeller import *			# Load standard Modeller classes
from modeller.automodel import *	# Load the automodel class

# Modeller can automatically build in disulfides if they are aligned with template disu
# Redefine the special_patches routine to include the additional disulfides
# (this routine is empty by default):
log.verbose()
env = environ()
env.io.atom_files_directory = ['.']

class MyModel(automodel):
    def special_patches(self, aln):
        # A disulfide between residues 196 and 197: - for brevity, we only fix one here.
        self.patch(residue_type='DISU', residues=(self.residues['192:A'],
                                                  self.residues['193:A']))

a = MyModel(env, alnfile='self-on-self-align.ali',#alignment filename
              knowns='muscle_nAChR', sequence='disulfide', 	#edit the knowns to reflect the structure of one you have made earlier
              assess_methods=(assess.DOPE,	
                              assess.GA341))
a.starting_model = 1	#index of the first model
a.ending_model = 1	#index of the last model (determines how many models to calculate)
a.make()		#do comparative modeling

5. Refining Loops¶

Due to the uncertainty of how to model flexible loops, you may need to try a few different loop conformations before you obtain something that makes sense.
select your model of the pentameric receptor you worked on previously.
You do not need a .ali file here, just modify the holding modeller script below.
A modeller objective function will be displayed in the pdb files as before.
Open all these structures in pymol to see the different loop conformation outputs. In this case we are refining the loop C region of one of the subunits (subunit A).

In [ ]:

# Loop refinement of an existing model
from modeller import *
from modeller.automodel import *

log.verbose()
env = environ()

# directories for input atom files
env.io.atom_files_directory = ['.']  # needed if you put your coordinate files somewhere

# Create a new class based on 'loopmodel' so that we can redefine
# select_loop_atoms (necessary)
class MyLoop(loopmodel):
    # This routine picks the residues to be refined by loop modeling
    def select_loop_atoms(self):
        # 10 residue insertion 
        return selection(self.residue_range('190:A', '200:A'))  # You will have CHANGE THIS FOR YOUR MODEL -
                                                                # these refer to the muscle_nAChR.pdb file -
                                                                # so change for your template that you call below... 

m = MyLoop(env,
           inimodel='muscle_nAChR.pdb', # initial model of the target - CHANGE TO YOUR TEMPLATE PDB
           sequence='muscle')          # code of the target

m.loop.starting_model= 1           # index of the first loop model 
m.loop.ending_model  = 10          # index of the last loop model
m.loop.md_level = refine.very_fast # loop refinement method; this yields
                                   # models quickly but of low quality;
                                   # use refine.slow for better models

m.make()

6. Scoring your models¶

A typical strategy in obtaining accurate models is to generate multiple homology models and assess their quality based on various scoring functions.
Contained in the example_models directory are 100 previously generated homology models.
You can find the Modeller objective function score on the second line of each .pdb file.
This function is not an absolute measure and only pertains to models generated from the same alignment. Essentailly it is a relative scoring function for the set of models you have just built.
Lets extract the scores for these models...

In [ ]:

! grep OBJECTIVE example_models/*pdb >> example_models/scores.txt

Open this file in vim to sort the data according to the last column.
In linux/man

:%!column -t which will ensure the data is column readable the type:

:%!sort -k6nr which should sort on the last column which is the objective function
In windows Notepad++ can do this for you.
Use this to select the top 10 models based on the Modeller objective function.
Copy these files into a new directory.
Next we will upload these models to the QMEAN server in order to select a single model.
The QMEAN (Qualitative Model Energy ANalysis) scoring function derives a quality estimate for both local and global (per-residue) quality estimates. For more information on the scoring function see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2703985/
Go to the QMEAN server website via:

https://swissmodel.expasy.org/qmean/
Select just 'QMEANBrane' from the three method options available (as we have a membrane protein)
Select the top 10 models you saved in the new directory for analysis by clicking the 'select coordinate file' option and click submit once these have been uploaded. Feel free to add your email if you'd like a copy of the results sent to you.
In the output you should see QMEAN scores for each model as well as a breakdown of the local quality.
By clicking Project Archive download we can obtain a .zip with a breakdown of the results.
For each model you should be able to scroll over the local quality score which will also simultaneously display amino acid residues on a 3D projection of your model on the panel to the right.
Select the model with the highest QMEAN4 value for the next step, keeping the QMEAN tab open.
We will now check the model's backbone torsions by generating a Ramachandran plot.
Go to the following website and upload your structure:

http://molprobity.biochem.duke.edu/index.php
Follow the website instructions to generate a ramachandran analysis.
Are there any violations and if so, why might they be occuring there?
Open up your .pdb file in pymol or VMD to check.
Do these regions also correspond to low local quality scores in QMEAN?

Homology Modelling Tutorial¶

Objectives¶

1. Perform a mulitple sequence alignment¶

Get Sequences¶

Generate the alignment¶

Visualize the alignment¶

2. Build a model of a single subunit (the human muscle alpha subunit) first¶

3. Build a model of the full heteromeric receptor¶

4. Modifying models to include features like disufide bonds¶

5. Refining Loops¶

6. Scoring your models¶

END! Congratulations you are now ready to build your own homology models!¶