Keywords: carbohydrate, glycan, sugar, glucose, mannose, sugar, GlycanTreeSet, saccharide, furanose, pyranose, aldose, ketose
In this chapter, we will focus on a special subset of non-peptide oligo- and polymers — carbohydrates.
Modeling carbohydrates — also known as saccharides, glycans, or simply sugars — comes with some special challenges. For one, most saccharide residues contain a ring as part of their backbone. This ring provides potentially new degrees of freedom when sampling. Additionally, carbohydrate structures are often branched, leading in Rosetta to more complicated FoldTrees
.
This chapter includes a quick overview of carbohydrate nomenclature, structure, and basic interactions within Rosetta.
Sugars (saccharides) are defined as hyroxylated aldehydes and ketones. A typical monosaccharide has an equal number of carbon and oxygen atoms. For example, glucose has the molecular formula C6H12O6.
Sugars containing more than three carbons will spontaneously cyclize in aqueous environments to form five- or six-membered hemiacetals and hemiketals. Sugars with five-membered rings are called furanoses; those with six-membered rings are called pyranoses (Fig. 1).
A sugar is classified as an aldose or ketose, depending on whether it has an aldehyde or ketone in its linear form (Fig. 2).
The different sugars have different names, depending on the stereochemistry at each of the carbon atoms in the molecule. For example, glucose has one set of stereochemistries, while mannose has another.
In addition to their full names, many individual saccharide residues have three-letter codes, just like amino acid residues do. Glucose is "Glc" and mannose is "Man".
A glycan tree is made up of many sugar residues, each residue a ring. The 'backbone' of a glycan is the connection between one residue and another. The chemical makeup of each sugar residue in this 'linkage' effects the propensity/energy of each bacbone dihedral angle. In addition, sugars can be attached via different carbons of the parent glycan. In this way, the chemical makeup and the attachment position effects the dihedral propensities. Typically, there are two backbone dihedral angles, but this could be up to 4+ angles depending on the connection.
In IUPAC, the dihedrals of N are defined as the dihedrals between N and N-1 (IE - the parent linkage). The ASN (or other glycosylated protein residue's) dihedrals become part of the first glycan residue that is connected. For this first first glycan residue that is connected to an ASN, it has 4 torsions, while the ASN now has none!
If you are creating a movemap for dihedral residues, please use the MoveMapFactory
as this has the IUPAC nomenclature of glycan residues built in in order to allow proper DOF sampling of the backbone residues, especially for branching glycan trees. In general, all of our samplers should use residue selectors and use the MoveMapFactory to build movemaps internally.
A sugar's side-chains are the constitutents of the glycan ring, which are typically an OH group or an acetyl group. These are sampled together at 60 degree angles by default during packing. A higher granularity of rotamers cannot currently be handled in Rosetta, but 60 degrees seems adequete for our purposes.
Within Rosetta, glycan connectivity information is stored in the GlycanTreeSet
, which is continually updated to reflect any residue changes or additions to the pose.
This info is always available through the function
pose.glycan_tree_set()
Chemical information of each glycan residue can be accessed through the CarbohydrateInfo object, which is stored in each ResidueType object:
pose.residue_type(i).carbohydrate_info()
We will cover both of these classes in the next tutorial.
Residue centric modeling and design of saccharide and glycoconjugate structures Jason W. Labonte Jared Adolf-Bryfogle William R. Schief Jeffrey J. Gray Journal of Computational Chemistry, 11/30/2016 - https://doi.org/10.1002/jcc.24679
Automatically Fixing Errors in Glycoprotein Structures with Rosetta Brandon Frenz, Sebastian Rämisch, Andrew J. Borst, Alexandra C. Walls Jared Adolf-Bryfogle, William R. Schief, David Veesler, Frank DiMaio Structure, 1/2/2019
Let's use Pyrosetta to compare some common monosaccharide residues and see how they differ. As usual, we start by importing the `pyrosetta` and `rosetta` namespaces.
!pip install pyrosettacolabsetup
import pyrosettacolabsetup; pyrosettacolabsetup.install_pyrosetta()
import pyrosetta; pyrosetta.init()
from pyrosetta import *
from pyrosetta.teaching import *
from pyrosetta.rosetta import *
First, one needs the -include_sugars
option, which will tell Rosetta to load sugars and add the sugar_bb energy term to a default scorefunction. This scoreterm is like rama for the sugar dihedrals which connect each sugar residue.
init('-include_sugars')
When loading structures from the PDB that include glycans, we use these options. This includes an option to write out the structures in pdb format instead of the (better) Rosetta format. We will be using these options in the next tutorial.
-maintain_links
-auto_detect_glycan_connections
-alternate_3_letter_codes pdb_sugar
-write_glycan_pdb_codes
-load_PDB_components false
pm = PyMOLMover()
We will use the function, pose_from_saccharide_sequence()
, which must be imported from the core.pose
namespace. Unlike with peptide chains, one-letter-codes will not suffice when specifying saccharide chains, because there is too much information to convey; we must use at least four letters. The first three letters are the sugar's three-letter code; the fourth letter designates whether the residue is a furanose (f
) or pyranose (p
).
from pyrosetta.rosetta.core.pose import pose_from_saccharide_sequence
glucose = pose_from_saccharide_sequence('Glcp')
galactose = pose_from_saccharide_sequence('Galp')
mannose = pose_from_saccharide_sequence('Manp')
Just like with peptides, saccharides come in two enantiomeric forms, labelled l and d. (Note the small-caps, used in print.) These can be loaded into PyRosetta using the prefixes `L-` and `D-`.
L_glucose = pose_from_saccharide_sequence('L-Glcp')
D_glucose = pose_from_saccharide_sequence('D-Glcp')
The carbon that is at a higher oxidation state — that is, the carbon of the hemiacetal/-ketal in the cyclic form or the carbon that is the carbonyl carbon of the aldehyde or ketone in the linear form — is called the anomeric carbon. Because the carbonyl of an aldehyde or ketone is planar, a sugar molecule can cyclize into one of two forms, one in which the resulting hydroxyl group is pointing "up" and another in which the same hydroxyl group is pointing "down". These two anomers are labelled α and β.
alpha_D_glucose = pose_from_saccharide_sequence('a-D-Glcp')
Oligo- and polysaccharides are composed of simple monosaccharide residues connected by acetal and ketal linkages called glycosidic bonds. Any of the monosaccharide's hydroxyl groups can be used to form a linkage to the anomeric carbon of another monosaccharide, leading to both linear and branched molecules.
Rosetta can create both linear and branched oligosaccharides from an IUPAC sequence. (IUPAC is the international organization dedicated to chemical nomenclature.)
To properly build a linear oligosaccharide, Rosetta must know the following details about each sugar residue being created in the following order:
->2)
), →4) (->4)
), →6) (->6)
), etc.; default value is ->4)-
a
or alpha
) or β (b
or beta
); default value is alpha
L
) or d (D
); default value is D
Residues must be separated by hyphens. Glycosidic linkages can be specified with full IUPAC notation, e.g., -(1->4)-
for “-(1→4)-”. (This means that the residue on the left connects from its C1 (anomeric) position to the hydoxyl oxygen at C4 of the residue on the right.) Rosetta will assume -(1->
for aldoses and -(2->
for ketoses.
Note that the standard is to write the IUPAC sequence of a saccharide chain in reverse order from how they are numbered. Lets create three new oligosacharides from sequence.
maltotriose = pose_from_saccharide_sequence('a-D-Glcp-' * 3)
lactose = pose_from_saccharide_sequence('b-D-Galp-(1->4)-a-D-Glcp')
isomaltose = pose_from_saccharide_sequence('->6)-Glcp-' * 2)
When you print a Pose
containing carbohydrate residues, the sugar residues will be listed as Z
in the sequence.
print("maltotriose\n", maltotriose)
print("\nisomaltose\n", isomaltose)
print("\nlactose\n", lactose)
However, you can have Rosetta print out the sequences for individual chains, using the chain_sequence()
method. If you do this, Rosetta is smart enough to give you a distinct sequence format for saccharide chains. (You may have noticed that the default file name for a .pdb
file created from this Pose
will be the same sequence.)
print(maltotriose.chain_sequence(1))
print(isomaltose.chain_sequence(1))
print(lactose.chain_sequence(1))
Again, the standard is to show the sequence of a saccharide chain in reverse order from how they are numbered.
This is also how phi, psi, and omega are defined. From i+1 to i.for res in lactose.residues: print(res.seqpos(), res.name())
Notice that for polysaccharides, the upstream residue is called the reducing end, while the downstream residue is called the non-reducing end.
You will also see the terms parent and child being used across Rosetta. Here, for Residue 2, residue 1 is the parent. For Residue 1, Residue 2 is the child. Due to branching, residues can have more than one child/non-reducing-end, but only a single parent residue.
Rosetta stores carbohydrate-specific information within `ResidueType`. If you print a residue, this additional information will be displayed.
print(glucose.residue(1))
Most bioolymers have predefined, named torsion angles for their main-chain and side-chain bonds, such as φ, ψ, and ω and the various χs for amino acid residues. The same is true for saccharide residues. The torsion angles of sugars are as follows:
Take special note of how φ, ψ, and ω are defined in the reverse order as the angles of the same names for amino acid residues!
The chi()
method of Pose
works with sugar residues in the same way that it works with amino acid residues, where the first argument is the χ subscript and the second is the residue number of the Pose
.
galactose.chi(1, 1)
galactose.chi(2, 1)
galactose.chi(3, 1)
galactose.chi(4, 1)
galactose.chi(5, 1)
galactose.chi(6, 1)
Likewise, we can use set_chi()
to change these torsion angles and observe the changes in
PyMOL, setting the option to keep history to true.
from pyrosetta.rosetta.protocols.moves import AddPyMOLObserver
observer = AddPyMOLObserver(galactose, True)
pm.apply(galactose)
galactose.set_chi(1, 1, 180)
YOUR-CODE-HERE
The phi()
, set_phi()
, psi()
, set_psi()
, omega()
, and set_omega()
methods of Pose
also work with sugars. However, since pose_from_saccharide_sequence()
may create a Pose
with angles that cause the residues to wrap around onto each other, instead, let's reload some Pose's from .pdb
files.
maltotriose = pose_from_file('inputs/glycans/maltotriose.pdb')
isomaltose = pose_from_file('inputs/glycans/isomaltose.pdb')
pm.apply(maltotriose)
maltotriose.phi(1)
maltotriose.psi(1)
maltotriose.phi(2)
maltotriose.psi(2)
maltotriose.omega(2)
maltotriose.phi(3)
maltotriose.psi(3)
Notice how φ1 and ψ1 are undefined—the first residue is not connected to anything
observer = AddPyMOLObserver(maltotriose, True)
for i in (2, 3):
maltotriose.set_phi(i, 180)
maltotriose.set_psi(i, 180)
Isomaltose is composed of (1→6) linkages, so in this case omega torsions are defined. Get and set φ2, ψ2, ω2
for isomaltoseobserver = AddPyMOLObserver(isomaltose, True)
YOUR-CODE-HERE
Any cyclic residue also stores its ν angles.
pm.apply(glucose)
Glc1 = glucose.residue(1)
for i in range(1, 6): print(Glc1.nu(i))
However, we generally care more about the ring conformation of a cyclic residue’s rings, in this case, its only ring with index of 1. (The output values here are the ideal angles, not the actual angles, which we viewed above.)
print(Glc1.ring_conformer(1))
The output above warrants a brief explanation. First, what does `4C1` mean? Most of us likely remember learning about chair and boat conformations in Organic Chemistry. Do you recall how there are two distinct chair conformations that can interconvert between each other? The names for these specific conformations are 4C1 and 1C4. The nomenclature is as follows: Superscripts to the left of the capital letter are above the plane of the ring if it is oriented such that its carbon atoms proceed in a clockwise direction when viewed from above. Subscripts to the right of the letter are below the plane of the ring. The letter itself is an abbreviation, where, for example, C indicates a chair conformation and B a boat conformation. In all, there are 38 different ideal ring conformations that any six-membered cycle can take.
`C-P parameters` refers to the Cremer–Pople parameters for this conformation (Cremer D, Pople JA. J Am Chem Soc. 1975;97:1354–1358.). C–P parameters are an alternative coordinate system used to refer to a ring conformation.
Finally, a RingConformer
in Rosetta includes the values of the ν angles. Each conformer has a unique set of angles.
Pose::set_nu()
does not exist, because it would rip a ring apart. Instead, to change a ring conformation, we need to use the set_ring_conformer()
method, which takes a RingConformer
object. Most of the time, you will not need to adjust the ring conformers, but you should be aware of it.
We can ask a cyclic ResidueType
for one of its RingConformerSet
s to give us the RingConformer
we want. (Each RingConformerSet
includes the list of possible idealized ring conformers that such a ring can attain as well as information about the most energetically favorable one.) Then, we can et the conformation for our residue through Pose
. (The arguments for set_ring_conformer()
are the Pose
’s sequence position, ring number, and the new conformer, respectively.)
ring_set = Glc1.type().ring_conformer_set(1)
conformer = ring_set.get_ideal_conformer_by_name('1C4')
glucose.set_ring_conformation(1, 1, conformer)
pm.apply(glucose)
.pdb
File LINK
Records¶Modified sugars can also be created in Rosetta, either from sequence or from file. In the former case, simply use the proper abbreviation for the modification after the “ring form code”. For example, the abbreviation for an N-acetyl group is “NAc”. Note the N-acetyl group in the PyMOL window.
LacNAc = pose_from_saccharide_sequence('b-D-Galp-(1->4)-a-D-GlcpNAc')
pm.apply(LacNAc)
Rosetta can handle branched oligosaccharides as well, but when loading from a sequence, this requires the use of brackets, which is the standard IUPAC notation. For example, here is how one would load Lewisx (Lex), a common branched glyco-epitope, into Rosetta by sequence.
Lex = pose_from_saccharide_sequence('b-D-Galp-(1->4)-[a-L-Fucp-(1->3)]-D-GlcpNAc')
pm.apply(Lex)
One can also load branched carbohydrates from a .pdb
file. These .pdb
files must include LINK
records, which are a standard part of the PDB format. Open the test/data/carbohydrates/Lex.pdb
file and look bear the top to see an example LINK
record, which looks like this:
LINK O3 Glc A 1 C1 Fuc B 1 1555 1555 1.5
It tells us that there is a covalent linkage between O3 of glucose A1 and C1 of fucose B1 with a bond length of 1.5 Å. (The 1555
s indicate symmetry and are ignored by Rosetta.)
Note that if the LINK records are not in order, or HETNAM records are not in a Rosetta format, we will fail to load. In the next tutorial we will use auto-detection to do this. For now, we know Lex.pdb will load OK.
Lex = pose_from_file('inputs/glycans/Lex.pdb')
pm.apply(Lex)
You may notice when viewing the structure in PyMOL that the hybridization of the carbonyl of the amido functionality of the N-acetyl group is wrong. This is because of an error in the model deposited in the PDB from which this file was generated. This is, unfortunately, a very common problem with sugar structures found in the PDB. It is always useful to use http://www.glycosciences.de to identify any errors in the solution PDB structure before working with them in Rosetta. The referenced paper, Automatically Fixing Errors in Glycoprotein Structures with Rosetta can be used as a guide to fixing these.
You may
also have noticed that the inputs/glycans/Lex.pdb
file indicated in its HETNAM
records that Glc1 was actually an N-acetylglycosamine (GlcNAc) with the indication 2-acetylamino-2-deoxy-
. This is optional and is helpful for human-readability, but Rosetta only needs to know the base ResidueType
of each sugar residue; specific
VariantType
s needed — and most sugar modifications are treated as VariantType
s — are determined automatically from
the atom names in the HETATM
records for the residue. Anything after the comma is ignored.
Pose
to see how the FoldTree
is defined.
YOUR-CODE-HERE
Note the CHEMICAL
Edge
(-2
). This is Rosetta’s way of indicating a branch backbone connection. Unlike a standard
POLYMER
Edge
(-1
), this one tells you which atoms are involved.
Can you see now why φ and ψ are defined the way they are? If they were defined as in AA residues, they would not have unique definitions, since GlcNAc is a branch point. A monosaccharide can have multiple children, but it can never have more than a single parent.
Note that for this oligosaccharide χ3(1) is equivalent to ψ(3) and χ4(1) is equivalent to ψ(2). Make sure that you understand why!
Lex.chi(3, 1), Lex.psi(3)
Lex.chi(4, 1), Lex.psi(2)
For chemically modified sugars, χ angles are redefined at the positions where substitution has occurred. For new χs that have come into existence from the addition of new atoms and bonds, new definitions are added to new indices. For example, for GlcN2Ac residue 1, χC2–N2–C′–Cα′ is accessed through `chi(7, 1)`.
Lex.chi(2, 1)
Lex.set_chi(2, 1, 180)
pm.apply(Lex)
Lex.chi(7, 1)
Lex.set_chi(7, 1, 0)
pm.apply(Lex)
Branching does not have to occur at sugars; a glycan can be attached to the nitrogen of an ASN or the oxygen of a SER or THR. N-linked glycans themselves tend to be branched structures.
We will cover more on linked glycan trees in the next tutorial through the GlycanTreeSet
object - which is always present in a pose that has carbohydrates.
N_linked = pose_from_file('inputs/glycans/N-linked_14-mer_glycan.pdb')
pm.apply(N_linked)
print(N_linked)
for i in range(4): print(N_linked.chain_sequence(i + 1))
O_linked = pose_from_file('inputs/glycans/O_glycan.pdb')
pm.apply(O_linked)
set_phi()
and set_psi()
still work when a glycan is linked to a peptide. (Below, we use pdb_info()
to give help us select the residue that we want. In this case, in the .pdb
file, the glycan is chain B.)
N_linked.set_phi(N_linked.pdb_info().pdb2pose("B", 1), 180)
pm.apply(N_linked)
Notice that in this case ψ and ω affect the side-chain torsions (χs) of the asparagine residue. This is another case where there are multiple ways of both naming and accessing the same specific torsion angles.
One can also create conjugated glycans from sequences if performed in steps, first creating the peptide portion by loading from a `.pdb` file or from sequence and then using the `glycosylate_pose()` function, (which needs to be imported first.) For example, to glycosylate an ASA peptide with a single glucose at position 2 of the peptide, we perform the following:
Here, we will glycosylate a simple peptide using the function, glycosylate_pose
. In the next tutorial, we will use a Mover interface to this function.
peptide = pose_from_sequence('ASA')
pm.apply(peptide)
from pyrosetta.rosetta.core.pose.carbohydrates import glycosylate_pose, glycosylate_pose_by_file
glycosylate_pose(peptide, 2, 'Glcp')
pm.apply(peptide)
Here, we uset the main function to glycosylate a pose. In the next tutorial, we will use a Mover interface to do so.
It is also possible to glycosylate a pose with common glycans found in the database. These files end in the `.iupac` extension and are simply IUPAC sequences just as we have been using throughout this chapter.
Here is a list of some common iupacs.
bisected_fucosylated_N-glycan_core.iupac
bisected_N-glycan_core.iupac
common_names.txt
core_1_O-glycan.iupac
core_2_O-glycan.iupac
core_3_O-glycan.iupac
core_4_O-glycan.iupac
core_5_O-glycan.iupac
core_6_O-glycan.iupac
core_7_O-glycan.iupac
core_8_O-glycan.iupac
fucosylated_N-glycan_core.iupac
high-mannose_N-glycan_core.iupac
hybrid_bisected_fucosylated_N-glycan_core.iupac
hybrid_bisected_N-glycan_core.iupac
hybrid_fucosylated_N-glycan_core.iupac
hybrid_N-glycan_core.iupac
man5.iupac
man9.iupac
N-glycan_core.iupac
peptide = pose_from_sequence('ASA'); pm.apply(peptide)
glycosylate_pose_by_file(peptide, 2, 'core_5_O-glycan')
pm.apply(peptide)
You now have a grasp on the basics of RosettaCarbohydrates. Please continue onto the next tutorial for more on glycan residue selection and various movers that can be of use when working with glycans.
Chapter contributors: