Keywords: classify_base_pairs, RNA torsions, RNA score terms, RNA motifs, mutate_position, RNA thread, RNA minimize, RNA_HelixAssembler, RNA fragment assembly, FARFAR protocol, rna_denovo
R. Das et al., "Atomic accuracy in predicting and designing noncanonical RNA structure," Nature Methods 7:4, 291-294 (2010).
A. Watkins et al., "Blind prediction of noncanonical RNA structure at atomic accuracy," Science Advances 4:5 (2018).
In this lab, we will explore common tasks and approaches for working with RNA using Rosetta. We will be focusing on a simple system that includes a helix capped by a tetraloop for this exercise.
!pip install pyrosettacolabsetup
import pyrosettacolabsetup; pyrosettacolabsetup.install_pyrosetta()
import pyrosetta; pyrosetta.init()
from pyrosetta import *
init()
from pyrosetta.rosetta import *
from pyrosetta.rosetta.core.pose.rna import *
from pyrosetta.rosetta.core.pose import *
Let's load in this structure with PyRosetta (make sure that you have the PDB file located in your current directory):
cd google_drive/MyDrive/student-notebooks/ pose = pose_from_pdb("inputs/stem_loop.pdb")
YOUR-CODE-HERE
Let's explore the structure in this PDB file. First, use pose.sequence()
to look at the sequence:
# print out the sequence of the pose
YOUR-CODE-HERE
We can see that the pose seems to contain RNA residues. To check this, let's go through the pose residue by residue, checking if each one is RNA.
for ii in range(pose.size()):
print(pose.residue_type(5).is_RNA())
RNA bases interact with each other via base pairing, either through the Watson-Crick base pairs that make up standard A-form helices or through non-canonical base pairing interactions. We can use the classify_base_pairs
function (this lives in core:pose:rna
which was loaded above) to find and classify all the base pairs in the current pose. Let's take a look.
base_pairs = classify_base_pairs(pose)
for base_pair in base_pairs:
print(base_pair)
We can see that the RNA molecule consists of Watson-Crick base pairs between residues 1-5 and residues 10-14 forming a standard RNA helix. We can also see that residues 6 and 9 form a non-canonical base pair interaction between the sugar and Hoogsteen edges of the respective bases. We can think of this structure as a simple stem-loop, with an idealized A-form helix between residues 1-5 and residues 10-14, and with a tetraloop connecting these chains.
Let's use some of Rosetta's tools for measuring distances and torsions to understand the typical geometry of an idealized A-form helix.
What is the distance between the phosphate atoms of two consecutive residues in one strand of a helix? Check this for a couple pairs of residues.
P1_xyz = pose.residue(1).xyz("P")
P2_xyz = pose.residue(2).xyz("P")
P3_xyz = pose.residue(3).xyz("P")
print((P1_xyz - P2_xyz).norm())
print((P2_xyz - P3_xyz).norm())
RNA nucleotides are quite large compared to amino acids, with many more torsion angles. In the diagram of a nucleotide below, we can see the backbone torsions applicable to RNA: $\alpha$, $\beta$, $\gamma$, $\delta$, $\epsilon$, $\zeta$, and $\chi$.
from IPython.display import Image
Image('./Media/nucleotide_torsions.png',width='500')
We can access the values of these torsions through the pose object. Just like protein torsions can be accessed with functions like pose.phi(resid)
, RNA torsions can be accessed with analogous functions like pose.alpha(resid)
.
Exercise: Below, make a function that prints out all the torsions for a given residue. Then, using that function, check the torsions for three different residues in the RNA helix. How similar are torsion angles for different residues in an idealized helix?
YOUR-CODE-HERE
YOUR-CODE-HERE
Rosetta's energy functions provide a mechanism to score RNA structures, rewarding realistic conformations using a variety of score terms. In this section, we will see how to score RNA poses, and we will use these score terms to better understand our structure.
To score structures with RNA in Rosetta, it is best to use a high-resolution energy function designed to work with RNA, for instance stepwise/rna/rna_res_level_energy4.wts
. In fact, the standard high resolution energy function used in Rosetta does not include the score terms that are quite helpful for modeling RNA. To see this, we will evaluate our RNA pose with the ref2015
score function and stepwise/rna/rna_res_level_energy4.wts
, comparing the resulting score term values.
hires_sf = core.scoring.ScoreFunctionFactory.create_score_function("ref2015");
hires_sf.show(pose)
Note that ref2015
does contain some terms that are used for RNA modeling like VDW and hydrogen bonding score terms. What extra terms are included in the RNA high resolution score function?
rna_hires_sf = core.scoring.ScoreFunctionFactory.create_score_function("stepwise/rna/rna_res_level_energy4.wts");
rna_hires_sf.show(pose)
We can see that some new score terms in the high resolution RNA potential, including rna_torsion
, suiteness_bonus
, rna_sugar_close
, and fa_stack
. We will explore a few of these terms below. To learn about these and other score terms that have been included to more realistically model RNA, check out the papers referenced at the beginning of this notebook.
Analogous to the protein low-resolution potential, an RNA low-resolution potential has been developed to more quickly score RNA structures represented in centroid mode. Lets take a look at the score terms involved.
rna_lowres_sf = core.scoring.ScoreFunctionFactory.create_score_function("rna/denovo/rna_lores_with_rnp_aug.wts");
rna_lowres_sf.show(pose)
We can see that when modeling an RNA molecule using centroid positions for nucleotides, we need to separately include terms for base pairing (rna_base_pair
), base-backbone interactions (rna_base_backbone
), and so on.
Returning to the high resolution RNA score function, let us see if we can decompose the energies further to understand which parts of the structure contribute positively and negatively to its energy. First, we can decompose the energies per residue like below.
rna_hires_sf(pose)
nonzero_scores = pose.energies().residue_total_energies(4).show_nonzero()
print(nonzero_scores)
A lot of the RNA specific energy terms make more sense when we look at pairs of residues. The energy graph object allows you to explore pairwise energies. The function below uses the energy graph to print out all non-zero scores between residues for a particular score term.
def print_nonzero_pairwise_energies(pose, energy_term, sf):
sf(pose)
energy_graph = pose.energies().energy_graph()
for ii in range(1, pose.size() + 1):
for jj in range(ii + 1, pose.size() + 1):
edge = energy_graph.find_energy_edge(ii, jj)
if (edge != None):
emap = edge.fill_energy_map()
resid1 = str(ii) + " " + pose.residue(ii).name1()
resid2 = str(jj) + " " + pose.residue(jj).name1()
resid_pair = resid1 + " " + resid2
score = emap[ energy_term ]
if score != 0:
print("%s: %f" % (resid_pair, score))
Using the function above, we're going to look at the stacking energies in the high resolution potential.
print_nonzero_pairwise_energies(pose, core.scoring.ScoreType.fa_stack, rna_hires_sf)
We can see that the stacking energies are highest for consecutive residues. In the idealized helix, the best stacking energy bonuses are given to stacked purine residues.
Now lets take a look at the torsion energies. Which energies are the highest? Where are these torsions in the structure?
print_nonzero_pairwise_energies(pose, core.scoring.ScoreType.rna_torsion, rna_hires_sf)
RNA structures are often viewed as being composed of small building blocks called RNA motifs. These motifs can be as simple as stacks of base pairs, which we have seen above. Typical motifs also include stereotyped loops, junctions, and tertiary contacts present across many common RNA molecules. Let's take a look to see whether any of these common RNA motifs are present in our simple stem loop structure.
lowres_potential = core.scoring.rna.RNA_LowResolutionPotential( "scoring/rna/rna_base_pair_xy.dat" )
rna_scoring_info = core.scoring.rna.rna_scoring_info_from_pose(pose).rna_filtered_base_base_info()
rna_motifs = core.scoring.rna.get_rna_motifs( pose, lowres_potential, rna_scoring_info)
print(rna_motifs)
We can see that our RNA structure includes many stacked Watson-Crick base pair, making the idealized A-form helix. In addition, the loop connecting the strands of the helix in our structure is a stereotyped "GNRA" tetraloop, taking a loop conformation that is common across many RNA structures in the PDB.
Rosetta allows you to not just explore a given PDB structure, but to manipulate and design structures. In this section, we discuss some basic ways to manipulate RNA structures, and we observe the effects of these manipulations on the structure's energy. For each manipulation, we will make a new copy of the pose to make sure that our changes do not affect the original structure we loaded in.
One basic manipulation we can make to an RNA structure is to change torsion angles for individual residues. Let's try this out on a residue in the A-form helix, and observe the effect on the rna_torsion score. Did the change we made make the score better or worse?
new_pose = Pose()
new_pose.assign(pose)
rna_hires_sf(pose)
torsion_score_before = pose.energies().total_energies()[core.scoring.ScoreType.rna_torsion]
new_pose.set_beta(2, 110)
rna_hires_sf(new_pose)
torsion_score_after = new_pose.energies().total_energies()[core.scoring.ScoreType.rna_torsion]
print("%s: %f" % ("Torsion score before", torsion_score_before))
print("%s: %f" % ("Torsion score after", torsion_score_after))
If you want to replace residues in an RNA molecule with their idealized versions, you can use the RNA_IdealCoord class in Rosetta. Below is an example for using that method to first replace a single residue with its idealized version, and then to replace all residues with their idealized versions across the whole pose.
ideal_pose_one = Pose()
ideal_pose_one.assign(pose)
resid = 2
core.pose.rna.RNA_IdealCoord().apply(ideal_pose_one, resid, core.chemical.rna.PuckerState.ANY_PUCKER, False)
ideal_pose = Pose()
ideal_pose.assign(pose)
core.pose.rna.RNA_IdealCoord().apply(ideal_pose, False)
Exercise: Figure out if the total energy of the pose went up or down after replacing one or all of the residues with their idealized versions. What can explain the difference? What about the total torsion energy only - does that go up or down in the pose with idealized residues compared to the original pose?
YOUR-CODE-HERE
YOUR-CODE-HERE
Another common manipulation for an RNA structure is to mutate the nucleotides to different bases. This is a manipulation that is commonly used while modeling one RNA structure using coordinates from another homologous (but not identical) structure. Below we can see how to mutate one residue of our RNA structure to another one.
mutated_pose = Pose()
mutated_pose.assign(pose)
print(pose.sequence())
rosetta.core.pose.rna.mutate_position(mutated_pose, 1, 'a')
print(mutated_pose.sequence())
Exercise: Make a function that mimics the 'rna_thread' Rosetta application, which takes in a pose and a new sequence and replaces all pose residues with the new sequence's residues. Remember to check that the pose's sequence and the new sequence have the same length.
The pose's current sequence is cauccgaaaggaug
. Use the function you wrote to make a version that has sequence cauccuucgggaug
and one that has sequence aaaaagaaauuuuu
.
YOUR-CODE-HERE
YOUR-CODE-HERE
Exercise: The RNA high resolution potential includes hydrogen bonding terms. CG base pairs have more hydrogen bonds than AU base pairs. Compare the original pose with the pose that has all AU base pairs. What happens to the hydrogen bonding energy in the high resolution potential?
YOUR-CODE-HERE
Exercise: The stacking energies of the GAAA and UUCG tetraloops differ from each other. Which tetraloop provides the most favorable stacking energies overall? Can you figure out which pairs of residues have different stacking energies when the structure has changed (hint: you can base your code here off of the function print_nonzero_pairwise_energies
above)?
YOUR-CODE-HERE
YOUR-CODE-HERE
Many of the same strategies are used when modeling RNA as when modeling proteins. Below, we shall explore some of these procedures specifically applied to RNA molecules to appreciate how they may come together to give a modern structure prediction method.
On a not wholly unrelated tangent, let us first see how we can quickly generate poses of ideal A-form RNA. You can think of this procedure as analogous to the pose_from_seq
function used to generate protein poses from primary sequences. Let's use it now to generate a single-strand RNA pose with A-form torsions and the same sequence as the hairpin we've been examining so far.
assembler = core.import_pose.RNA_HelixAssembler()
assembled_pose = assembler.build_init_pose(pose.sequence(), '')
YOUR-CODE-HERE
Let's get a PyMOLMover
up and running so we can examine our new pose.
pmm = PyMOLMover()
pmm.set_PyMOL_model_name('assembled_pose')
pmm.apply(assembled_pose)
You can also use RNA_HelixAssmebler
to generate poses that comprise two strands that form an ideal A-form helical stack, like residues 1-5 and 10-14 in the hairpin from above.
pmm_helix = PyMOLMover()
pmm_helix.set_PyMOL_model_name('helix_pose')
helix_pose = assembler.build_init_pose('ggg','ccc')
Looking in PyMOL, you may be able to appreciate that, true to its name, the RNA_HelixAssembler
has generated a pose that looks quite helical.
pmm_helix.apply(helix_pose)
Exercise: Examine the torsions in several of the residues of assembled_pose
using the print_torsions
function you wrote earlier. How do they compare to the torsions from the starting stem loop?
Given a library of RNA torsions excised from a published structure, fragment assembly methods will choose an n-mer in the current structure and replace the backbone geometry with the geometry of a corresponding n-mer from the library. Those of you familiar with protein structure prediction methods will recognize this strategy of fragment assembly.
We will implement a rudimentary version of this protocol for RNA below.
For the present exercise, we will use the torsions file inputs/1jj2.torsions
, which comes from the crystal structure of a large ribosomal subunit. This library will be used to initialize a Mover
specifically designed to perform fragment assembly on RNA molecules, RNA_FragmentMover
.
fragset = core.import_pose.libraries.RNA_LibraryManager.get_instance().rna_fragment_library("inputs/1jj2.torsions")
atom_level_domain_map = core.pose.toolbox.AtomLevelDomainMap(assembled_pose)
frag_mover = protocols.rna.denovo.movers.RNA_FragmentMover(fragset, atom_level_domain_map, 1, 0)
Don't worry too much about the other options that RNA_FragmentMover
requires at this point, but remember to include them if using this mover outside of this notebook.
YOUR-CODE-HERE
Let's practice applying this mover to a Pose
. To actually make a fragment assembly move, you can call the random_fragment_insertion
method which requires two arguments:
Pose
There is also an apply()
method that can be called in a similar manner, but it simply calls random_fragment_insertion()
, so the recommendation is to decrease overhead by calling random_fragment_insertion()
where possible.
Let's pratice calling this method below.
practice_pose = Pose()
practice_pose.assign(assembled_pose)
frag_mover.random_fragment_insertion(practice_pose, 3)
pmm.apply(practice_pose)
YOUR-CODE-HERE
Now that we know how to set up a fragment assembly mover in PyRosetta, try the excise below to write a quick folding routine that uses a fragment assembly strategy to try and fold the hairpin sequence.
Exercise: Fill in the function below such that fragment_assembly
Pose
, RNA_FragmentMover
, fragment size to substitute (frag_size
), and number of trials (n_trials
).rna_lowres_sf
energy function from earlier but allows the user to specify a different energy function, if desired* See section 4.1 of these notebooks for a review on Monte Carlo algorithms, if desired.
Then, apply it to our newly assembled pose using the following recipe:
fragment_assembly
using 3 nucleotide fragments for 400 trials.fragment_assembly
using 2 nucleotide fragments for 300 trials.fragment_assembly
using 1 nucleotide fragments for 300 trials.import math
import random
def fragment_assembly(start_pose, frag_mover, frag_size, n_trials, sf=rna_lowres_sf):
curr_pose = Pose()
curr_pose.assign(start_pose)
trial_pose = Pose()
trial_pose.assign(curr_pose)
opt_pose = Pose()
opt_pose.assign(curr_pose)
currE = newE = optE = sf(curr_pose)
YOUR-CODE-HERE
#return curr_pose
frag_pose = fragment_assembly(assembled_pose, frag_mover, 3, 400)
frag_pose = fragment_assembly(frag_pose, frag_mover, 2, 300)
frag_pose = fragment_assembly(frag_pose, frag_mover, 1, 300)
Examine the fragment assembled Pose
in PyMOL. Do you recognize any of the motifs from before?
frag_pmm = PyMOLMover()
frag_pmm.set_PyMOL_model_name('frag_pose')
frag_pmm.apply(frag_pose)
YOUR-CODE-HERE
In principle, the standard MinMover
that has been introduced previously in the context of minimizing purely protein structures can also be used to minimize poses with RNA (as long as the assigned energy function has score terms relevant to RNA and an appropriate MoveMap
is provided).
However, as part of the rna_denovo
protocol, Das and coworkers have developed a subroutine, RNA_Minimize
, that is specifically geared toward handling minimization of poses with RNA, the use of which is detailed below.
We can access the RNA_Minimize
mover from the protocols
namespace. The relevant options object, RNA_MinimizerOptions
, lives in the import_pose.options
namespace. We will set the maximum number of iterations to 1000, using default values for other options.
rna_min_options = core.import_pose.options.RNA_MinimizerOptions()
rna_min_options.set_max_iter(1000)
rna_minmizer = protocols.rna.denovo.movers.RNA_Minimizer(rna_min_options)
YOUR-CODE-HERE
Unlike in the case of using MinMover
, things like an appropriate energy function and MoveMap
are generated by default by the RNA_Minimizer
object. By default, RNA_Minimizer
uses the same high-resolution energy function as above (stepwise/rna/rna_res_level_energy4.wts
).
All that remains is to apply it to the relevant pose.
rna_minimizer.apply(frag_pose)
YOUR-CODE-HERE
Let's see what changes minimization has wrought on our structure:
min_pmm = PyMOLMover()
min_pmm.set_PyMOL_model_name('min_pose')
min_pmm.apply(frag_pose)
YOUR-CODE-HERE
Exercise: Using the functions described in the first part of the notebook, report on the following with respect to our de novo folded sequence:
YOUR-CODE-HERE
Examine our final folded structure and the hairpin from the first part of the tutorial and think about the following questions:
Write a function analogous to the fragment_assembly
function above that
Pose
RNA_Minimizer
Using this new suboutine, craft your own farfar
( Fragment Assembly of RNA with Full Atom Refinement) routine that performs multiple rounds of fragment assembly in a low-resolution potential followed by minimization in a high-resolution energy function.
Try playing around with the various parameters and see how well you can recover the hairpin starting from just the sequence.
Below we will be running a short RNA de novo modeling run for the stem-loop sequence we have been working with thus far, making use of the FARFAR protocol (which you can run with the rna_denovo
command in Rosetta). As discussed above, the FARFAR protocol involves a mixture of fragment assembly moves and full atom minimization moves. We will generate a small set of structures using FARFAR and compare the energy of the resulting structures to those constructed in the previous exercise.
FARFAR builds models for a structure as specified in a FASTA file, making use of any structures of known sub-pieces (for instance, A-form helices for regions known to form stems). In this case, we will provide the helical portion of our structure between residues 1-5 and residues 10-14 as an input to the FARFAR protocol, so that the protocol only has to worry about sampling the loop. In a real modeling scenario, it is often the case that information about the secondary structure of the RNA is known, allowing us to make use of A-form helix rigid bodies to accelerate modeling.
Let's set up the fasta file and input PDB files to use as options for FARFAR; these files should be in the inputs/
folder.
input_pdbs = rosetta.utility.vector1_std_string()
fasta_files = rosetta.utility.vector1_std_string()
input_pdbs.append("inputs/stem.pdb")
fasta_files = rosetta.utility.vector1_std_string()
fasta_files.append("inputs/stem_loop.fasta")
We will set up options for rna_denovo
below, specifying the FASTA file, the input PDBs, the number of structures we would like to generate, and the output file (silent file format).
rna_de_novo_setup = core.import_pose.RNA_DeNovoSetup()
rna_de_novo_setup.set_fasta_files(fasta_files)
rna_de_novo_setup.set_minimize_rna(True)
rna_de_novo_setup.set_input_pdbs(input_pdbs)
rna_de_novo_setup.initialize_from_command_line()
rna_de_novo_options = rna_de_novo_setup.options()
rna_de_novo_options.set_nstruct(10)
rna_de_novo_options.set_silent_file("outputs/stem_loop.out")
rna_de_novo_options.set_vall_torsions_file("./inputs/1jj2.torsions")
Now we will run FARFAR by generating an RNA_DeNovoProtocol
object and running it with apply
on a starting pose. This will take a few minutes to run, and will generate 10 structures to the silent file specified above. As the protocol is running, take a look at the output to understand how it works. Note that for each structure generated, the protocol goes through various rounds of fragment assembly with fragments of size 3, 2, and 1, and then runs the RNA minimizer. This is similar to the protocol you made above!
rna_de_novo_protocol = rosetta.protocols.rna.denovo.RNA_DeNovoProtocol(rna_de_novo_options, rna_de_novo_setup.rna_params())
rna_de_novo_pose = rna_de_novo_setup.pose()
rna_de_novo_protocol.apply(rna_de_novo_pose)
Now that we've generated RNA structures with the FARFAR protocol, let's look at the top scoring structures and compare to those that you generated earlier in this module.
Run the following to get the poses from the silent file that FARFAR wrote to.
poses = poses_from_silent("outputs/stem_loop.out")
Exercise: Get the best scoring pose generated from FARFAR by iterating through the poses above with a loop (for pose in poses
...). Make use of the rna_hires_sf that was generated earlier in this notebook to score these structures. Compare the best score with the the score of the frag_pose
you generated in the previous section. Which is better?
YOUR-CODE-HERE
YOUR-CODE-HERE
The FARFAR protocol in this section can generate structures with an improved Rosetta score in part because it uses an optimized number of fragment assembly moves for each structure, and in large part because we initialized this run with an idealized A-form helix for the stem portion of our structure. Let's take a look to see what the best pose we generated looks like using the PyMOLMover.
farfar_pmm = PyMOLMover()
farfar_pmm.set_PyMOL_model_name('farfar_pose')
farfar_pmm.apply(best_pose)
Exercise: What motifs and features can you see in the best pose for this sequence from our 10 FARFAR models? What features are missing? How might we recover those features?
Chapter contributors: