PyRNA stores and describes an RNA secondary structure using two different data structures:
The pandas Dataframe lists all the base pairs (a.k.a interactions) making the secondary structure. If a pandas Dataframe restricts the exploration of a secondary structure to the level of the base pairs, a pyrna.features.SecondaryStructure gives us access to high-level objects like:
In both data structures, the edges of the base pairs are described with three different characters:
* '(' or ')' for Watson-Crick edges,
* '[' or ']' for Hoogsteen edges,
* '{' or '}' for Sugar edges.
A base pair can also be in a cis ('c') or trans ('t') orientation.
import json #to have a better output for dict describing pyrna.features.RNA objects (interactions, helices, single-strands,..)
from pyrna.parsers import parse_bn
rna = RNA(name = 'my_rna', sequence = 'GGGGGACAACCCC')
bn = '(({(.....))))'
base_pairs = parse_bn(bn)
print base_pairs
orientation edge1 edge2 pos1 pos2 0 c ( ) 4 10 1 c { ) 3 11 2 c ( ) 2 12 3 c ( ) 1 13
Now we transform this list of base-pairs into a pyrna.features.SecondaryStructure object.
from pyrna.parsers import base_pairs_to_secondary_structure
ss = base_pairs_to_secondary_structure(rna, base_pairs)
Now we can browse over the helices. An helix contains a field to store non-canonical base pairs (meaning base pairs different from c() AU, c() GU or c() GC.
for helix in ss.helices:
print json.dumps(helix, indent = 2)
print "\nAny non canonical base pairs in this helix?\n"
for interaction in helix['interactions']:
print json.dumps(interaction, indent = 2)
{ "length": 4, "interactions": [ { "edge1": "{", "edge2": ")", "orientation": "c", "location": [ [ 3, 3 ], [ 11, 11 ] ] } ], "name": "H1", "location": [ [ 1, 4 ], [ 10, 13 ] ] } Any non canonical base pairs in this helix? { "edge1": "{", "edge2": ")", "orientation": "c", "location": [ [ 3, 3 ], [ 11, 11 ] ] }
or over the single-strands...
for single_strand in ss.single_strands:
print json.dumps(single_strand, indent = 2)
{ "name": "SS1", "location": [ 5, 9 ] }
Any high-level object from a pyrna.features.SecondaryStructure is described with a "location". A location is a N x 2 matrix where N is the number of blocks of contiguous residues, each block being described with its start and end positions.
An helix location will be a matrix like [ [1,4] , [10,13] ], meaning that the first strand is between residues 1 and 4, and the second strand is between residues 10 and 13.
A base pair location will be a matrix like [ [3,3] , [11,11] ], meaning a base pair between residues 3 and 11.
From a pyrna.features.SecondaryStructure object, we can easily go back to a list of base pairs stored in a pandas Dataframe.
from pyrna.parsers import secondary_structure_to_base_pairs
print secondary_structure_to_base_pairs(ss)
edge1 edge2 orientation pos1 pos2 0 ( ) c 1 13 1 ( ) c 2 12 2 { ) c 3 11 3 ( ) c 4 10
Now we can create a new secondary structure containing a single tertiary interaction. A tertiary interaction is defined as a base pair which is not contiguous to another base-pair.
rna = RNA(name = 'my_rna', sequence = 'GGGGGACGCAGTAACCCC')
bn = '(({(.(.....)..))))'
base_pairs = parse_bn(bn)
print base_pairs
orientation edge1 edge2 pos1 pos2 0 c ( ) 6 12 1 c ( ) 4 15 2 c { ) 3 16 3 c ( ) 2 17 4 c ( ) 1 18
The single tertiary interaction is stored in a pyrna.features.SecondaryStructure object.
ss = base_pairs_to_secondary_structure(rna, base_pairs)
for tertiary_interaction in ss.tertiary_interactions:
print json.dumps(tertiary_interaction, indent = 2)
{ "edge1": "(", "edge2": ")", "orientation": "c", "location": [ [ 6, 6 ], [ 12, 12 ] ] }
If we want to go back to a list of base pairs in a pandas Dataframe, we need to precise if we want to keep the tertiary interactions...
print secondary_structure_to_base_pairs(ss, keep_tertiaries=True)
edge1 edge2 orientation pos1 pos2 0 ( ) c 1 18 1 ( ) c 2 17 2 { ) c 3 16 3 ( ) c 4 15 4 ( ) c 6 12
...or not
print secondary_structure_to_base_pairs(ss)
edge1 edge2 orientation pos1 pos2 0 ( ) c 1 18 1 ( ) c 2 17 2 { ) c 3 16 3 ( ) c 4 15
PyRNA is able to parse several file formats describing secondary structures.
from pyrna.parsers import parse_vienna
h = open('../data/ft3100_2D_with_bracket_notation.fasta')
vienna_content = h.read()
h.close()
print vienna_content
>ft3100 FANTOM3 TAACAATCTGCTGAAAGGTACCGTCGGAGGGAGCTTTGTTGCCAGCGCCA GAAACGCCGGTTTAACCAGCGCCGAAGTGAGCGCAGTGATTAAAGCCATG CAGTGGCAAATGGATTTCCGCAAACTGAAAAAAGGCGATGAATTTGCGGT .......((((.....(((((((.((...((.(((........))).)). ....)).))))...)))..((((..(((..(((....((((...(((((. ..))))).....))))..)))..))).......))))........)))).
In a Vienna file, several molecules and bracket notations can be stored. Consequently, the function parse_vienna() returns a list of RNA objects and a list of pandas Dataframes.
all_molecules, all_base_pairs = parse_vienna(vienna_content)
for base_pairs in all_base_pairs:
print base_pairs
orientation edge1 edge2 pos1 pos2 0 c ( ) 35 44 1 c ( ) 34 45 2 c ( ) 33 46 3 c ( ) 31 48 4 c ( ) 30 49 5 c ( ) 26 55 6 c ( ) 25 56 7 c ( ) 23 58 8 c ( ) 22 59 9 c ( ) 21 60 10 c ( ) 20 61 11 c ( ) 19 65 12 c ( ) 18 66 13 c ( ) 17 67 14 c ( ) 99 103 15 c ( ) 98 104 16 c ( ) 97 105 17 c ( ) 96 106 18 c ( ) 95 107 19 c ( ) 91 113 20 c ( ) 90 114 21 c ( ) 89 115 22 c ( ) 88 116 23 c ( ) 83 119 24 c ( ) 82 120 25 c ( ) 81 121 26 c ( ) 78 124 27 c ( ) 77 125 28 c ( ) 76 126 29 c ( ) 73 134 30 c ( ) 72 135 31 c ( ) 71 136 32 c ( ) 70 137 33 c ( ) 11 146 34 c ( ) 10 147 35 c ( ) 9 148 36 c ( ) 8 149
PyRNA allows to recover Rfam entries directly from the database website.
from pyrna.db import Rfam
rfam = Rfam(use_website = True)
The function get_entry() returns:
gapped_rnas, organism_names_2_nse, consensus_2d = rfam.get_entry(rfam_id='RF00059')
for gapped_rna in gapped_rnas[:10]:
print "sequence for %s: %s"%(gapped_rna.name, gapped_rna.sequence)
sequence for AE014074.1/464992-464896: UAUUUCACAAAGGAGUGCUU---------------------------------------------------------UG-GCUGAGAUCGCAA--------------------------------------------------UUGCGAAA-UCCUGAGG-ACCUGA-UCUUGUUAGUACAAGCG-UAGGGA--UUGUGACCAAUAAUCAA sequence for AL935263.2/99249-99354: UUUAAACACUAGGGGUGUCCAAAA---------------------------------------------------AUGG-GCUGAGAUGGUGCUGUAA---------------------------------------------GUACCGAU-CCCUUUGA-ACCUG--U-AAGCUCAAACUUGCG-UAGGAA--AGUGUCACAGCUAAUGU sequence for BX248356.1/232868-232999: UUUAUAAAUCACGGGUGCUGGACGGCAUACGUUUGCC-------------------------------------ACAAA-GCUGAGACAGGGCGAGAAGACGUGCACG-----------------------------------UCCCUGAA-CCGUUGA--ACCUGA-UCCGGGUAAUACCGGCGAUAGGAA--GAAUAAUGAACCGAUCG sequence for AK120238.1/2075-2184: CUAUGUUAGGAGGUGGCCUCUUGGCCUGGAUUGUUGUGA-----------------------------------AUUGG-GCUGAGAA------------------------------------------------------------AGU-CCCUUUGA-ACCUGA-ACAGGAUAAUGCCUGCG-AAGGGA-GUGUGCAUUUCUACUUUU sequence for X54035.1/52-153: GAUGACCACAAGGGGAGCAUU-------------------------------------------------------AAA-GCUGAGAGUGAGCGGUUUC--------------------------------------------GUUC-UGA-CCCUUUGA-ACCUG--U-UAGUUAACGCUGGCG-UAGGGA--UGUGGCAAAUGCAAUAG sequence for AE008691.1/1524796-1524694: CUCAAGUGCUAGGGGAGCCAAA-----------------------------------------------------AAUG-GCUGAGAGGGGAUUC------------------------------------------------GUUCCCGA-CCCUUAGA-ACCUGA-CCAGGGUAAUGCCUGCG-AAGGGA--AGCACGUUUUACGAAAU sequence for AB087258.1/15478-15568: CCGACUUAGUCGGGGUGCUGAU-----------------------------------------------------AACA-GCUGAGAUA-----------------------------------------------------------AUA-CCCGUGA--ACCUGA-UACAGUUAAUACUGACG-UAGGAA--ACUAAUGGUCUUACUCC sequence for AE008691.1/2140374-2140272: UCAAAGUGCUAGGGGAGCCAGA-----------------------------------------------------AAUG-GCUGAGAGGGGAUUC------------------------------------------------GUUCCCGA-CCCUUAGA-ACCUGA-CCAGGGUAAUGCCUGCG-AAGGGA--AGCACGUUUUACGGAAU sequence for AL939111.1/162719-162830: CAAGGCACUCGCGGGAGCCCGGACGC------------------------------------------------ACCGG-GCUGAGAGGGAGGCUGGCG--------------------------------------------GCCUCCGA-CCGUACGA-ACCUGA-UCCGGGUCAUGCCGGCG-AAGGGA--GGGGCUGGACGCCCAUG sequence for AE017221.1/302202-302311: GGGGCAGGCUAGGGGUGCCCGAAUG-------------------------------------------------GAAGG-GCUGAGAGCUGGGUUUCU---------------------------------------------CCCAGCAA-CCCUUGGA-ACCUGA-UCCGGGUAAUGCCGGCG-GAGGGA--AGCCUAUGCGGAAGACC
The function consensus2d_to_base_pairs() allows to easily infer the secondary structure for a given RNA sequence
from pyrna.parsers import consensus2d_to_base_pairs, to_bn
for gapped_rna in gapped_rnas[:10]:
rna = RNA(name = gapped_rna.name, sequence = gapped_rna.sequence.replace('-',''))
print rna.name
print rna.sequence
print to_bn(consensus2d_to_base_pairs(gapped_rna, consensus_2d), len(rna))
AE014074.1/464992-464896 UAUUUCACAAAGGAGUGCUUUGGCUGAGAUCGCAAUUGCGAAAUCCUGAGGACCUGAUCUUGUUAGUACAAGCGUAGGGAUUGUGACCAAUAAUCAA .....[[[[[((((..(((()))).....(((((())))))..)))).....((((..((((......))))..))))..]]]]]............ AL935263.2/99249-99354 UUUAAACACUAGGGGUGUCCAAAAAUGGGCUGAGAUGGUGCUGUAAGUACCGAUCCCUUUGAACCUGUAAGCUCAAACUUGCGUAGGAAAGUGUCACAGCUAAUGU .....[[[[[((((..(((((....))))).....((((((.....))))))..)))).....((((.(((......)))...))))..]]]]]............ BX248356.1/232868-232999 UUUAUAAAUCACGGGUGCUGGACGGCAUACGUUUGCCACAAAGCUGAGACAGGGCGAGAAGACGUGCACGUCCCUGAACCGUUGAACCUGAUCCGGGUAAUACCGGCGAUAGGAAGAAUAAUGAACCGAUCG .....[[[[[((((..(((((..................))))).....((((((...............))))))..))))....((((..((((......))))...))))..]]]]]............ AK120238.1/2075-2184 CUAUGUUAGGAGGUGGCCUCUUGGCCUGGAUUGUUGUGAAUUGGGCUGAGAAAGUCCCUUUGAACCUGAACAGGAUAAUGCCUGCGAAGGGAGUGUGCAUUUCUACUUUU .....[[[[[((((..(((((....................))))).....()..)))).....((((..((((......))))..))))...]]]]]............ X54035.1/52-153 GAUGACCACAAGGGGAGCAUUAAAGCUGAGAGUGAGCGGUUUCGUUCUGACCCUUUGAACCUGUUAGUUAACGCUGGCGUAGGGAUGUGGCAAAUGCAAUAG .....[[[[[((((..((((())))).....(.((((......)))))..)))).....((((.(((......)))...))))..]]]]]............ AE008691.1/1524796-1524694 CUCAAGUGCUAGGGGAGCCAAAAAUGGCUGAGAGGGGAUUCGUUCCCGACCCUUAGAACCUGACCAGGGUAAUGCCUGCGAAGGGAAGCACGUUUUACGAAAU .....[[[[[((((..(((((..))))).....((((((..))))))..)))).....((((..((((......))))..))))..]]]]]............ AB087258.1/15478-15568 CCGACUUAGUCGGGGUGCUGAUAACAGCUGAGAUAAUACCCGUGAACCUGAUACAGUUAAUACUGACGUAGGAAACUAAUGGUCUUACUCC .....[[[[[((((..(((((..))))).....(.)..))))....((((..((((......))))..))))..]]]]]............ AE008691.1/2140374-2140272 UCAAAGUGCUAGGGGAGCCAGAAAUGGCUGAGAGGGGAUUCGUUCCCGACCCUUAGAACCUGACCAGGGUAAUGCCUGCGAAGGGAAGCACGUUUUACGGAAU .....[[[[[((((..(((((..))))).....((((((..))))))..)))).....((((..((((......))))..))))..]]]]]............ AL939111.1/162719-162830 CAAGGCACUCGCGGGAGCCCGGACGCACCGGGCUGAGAGGGAGGCUGGCGGCCUCCGACCGUACGAACCUGAUCCGGGUCAUGCCGGCGAAGGGAGGGGCUGGACGCCCAUG .....[[[[[((((..(((((.......))))).....((((((......))))))..)))).....((((..((((......))))..))))..]]]]]............ AE017221.1/302202-302311 GGGGCAGGCUAGGGGUGCCCGAAUGGAAGGGCUGAGAGCUGGGUUUCUCCCAGCAACCCUUGGAACCUGAUCCGGGUAAUGCCGGCGGAGGGAAGCCUAUGCGGAAGACC .....[[[[[((((..(((((......))))).....((((((.....))))))..)))).....((((..((((......))))..))))..]]]]]............
from pyrna.computations import Rnafold
from pyrna.features import RNA
rna = RNA(name = 'my_rna', sequence = 'GGGGTAGGGACGGTAGGGGGACGCAGTGCAGTAACGTACCCGGTAGGGGGTAGGGGGACGCAGTAACCCCGGGGACGCAGTAACCCCACGCAGTAACCCC')
rnafold = Rnafold() #the algorithm is launched locally, using a Docker image
base_pairs = rnafold.fold(rna)
print base_pairs
orientation edge1 edge2 pos1 pos2 0 c ( ) 41 48 1 c ( ) 40 49 2 c ( ) 39 50 3 c ( ) 38 51 4 c ( ) 37 52 5 c ( ) 56 67 6 c ( ) 55 68 7 c ( ) 54 69 8 c ( ) 53 70 9 c ( ) 35 72 10 c ( ) 32 75 11 c ( ) 31 76 12 c ( ) 29 77 13 c ( ) 28 78 14 c ( ) 27 79 15 c ( ) 24 80 16 c ( ) 23 81 17 c ( ) 20 84 18 c ( ) 19 85 19 c ( ) 18 86 20 c ( ) 17 87 21 c ( ) 14 90 22 c ( ) 13 91 23 c ( ) 11 93 24 c ( ) 10 94 25 c ( ) 5 96 26 c ( ) 4 97 27 c ( ) 3 98 28 c ( ) 2 99 29 c ( ) 1 100
We can also compute the 2D coordinates for a plot.
from pyrna.computations import Rnaplot
rnaplot = Rnaplot()
plot = rnaplot.plot(secondary_structure = base_pairs, rna = rna)
print plot
x y 1 51.114 -18.631 2 51.114 -3.631 3 51.114 11.369 4 51.114 26.369 5 51.114 41.369 6 43.659 46.738 7 39.542 55.107 8 39.850 64.568 9 44.640 72.894 10 52.900 78.063 11 54.298 92.998 12 47.780 106.391 13 56.770 119.410 14 58.167 134.345 15 48.494 145.613 16 49.699 160.779 17 61.583 170.848 18 62.981 185.783 19 64.379 200.718 20 65.777 215.652 21 56.104 226.921 22 57.308 242.086 23 69.193 252.156 24 70.590 267.091 25 68.168 271.394 26 68.550 277.290 27 72.568 282.901 28 74.891 297.719 29 77.215 312.538 30 75.104 319.761 .. ... ... 71 121.623 385.454 72 104.173 375.762 73 112.813 362.801 74 109.780 347.892 75 97.371 339.735 76 94.587 324.996 77 92.034 310.215 78 89.710 295.396 79 87.387 280.577 80 85.525 265.693 81 84.128 250.758 82 93.937 238.658 83 92.307 223.532 84 80.711 214.255 85 79.314 199.320 86 77.916 184.385 87 76.518 169.451 88 86.328 157.351 89 84.698 142.225 90 73.102 132.947 91 71.704 118.012 92 78.122 103.551 93 69.233 91.600 94 67.835 76.666 95 78.142 58.016 96 66.114 41.369 97 66.114 26.369 98 66.114 11.369 99 66.114 -3.631 100 66.114 -18.631 [100 rows x 2 columns]
We can also compute a secondary structure by annotating a tertiary one. First, we need to import a 3D structure from the Protein Databank.
from pyrna.db import PDB
from pyrna.parsers import parse_pdb
pdb = PDB()
tertiary_structures = parse_pdb(pdb.get_entry('1HR2'))
With PyRNA, a pyrna.features.TertiaryStructure object is made with a single molecular chain. Consequently, the function parse_pdb returns a list of such objects. We can iterate over these pyrna.features.TertiaryStructure objects to annotate them with the algorithm Rnaview (Rnaview is available for computations from my own server).
secondary_structures = []
for ts in tertiary_structures:
#the function annotate() from Rnaview returns a pyrna.features.SecondaryStructure object and its 3D counterpart as a pyrna.features.TertiaryStructure object
secondary_structure, tertiary_structure = Rnaview().annotate(ts)
secondary_structures.append(secondary_structure)
#the function secondary_structure_to_base_pairs() transform a pyrna.features.SecondaryStructure object into a list of base pairs stored in a pandas Dataframe
print "Molecular chain %s"%secondary_structure.rna.name
print secondary_structure_to_base_pairs(secondary_structure, keep_tertiaries = True)
Molecular chain A edge1 edge2 orientation pos1 pos2 0 ( ) c 4 112 1 ( ) c 5 111 2 ( ) c 6 110 3 ( ) c 7 109 4 ( ) c 8 108 5 ( ) c 9 107 6 ( ) c 10 106 7 { ] t 11 105 8 [ } t 12 104 9 ( ) c 14 103 10 ( ) c 15 102 11 ( ) c 16 101 12 ( ) c 17 100 13 ( ) c 18 99 14 ( ) c 19 98 15 ( ) c 24 94 16 ( ) c 25 93 17 ( ) c 26 92 18 ( ) c 27 91 19 ( ) c 30 89 20 ( ) c 31 88 21 ( ) c 32 87 22 ( ) c 34 80 23 ( ) c 35 79 24 ( ) c 36 78 25 [ } t 37 62 26 [ } t 38 61 27 ( ) c 39 60 28 ( ) c 40 59 29 ( ) c 41 58 .. ... ... ... ... ... 53 ( ] t 21 96 54 ( } t 22 99 55 ( ] c 33 85 56 { ] t 48 51 57 ( ) t 49 145 58 { } t 51 147 59 [ ) t 66 86 60 { ] c 69 70 61 { } t 79 84 62 { } t 82 109 63 { } c 95 98 64 { ] c 114 115 65 ( ] t 121 145 66 { ] c 122 123 67 ( ) c 123 146 68 ! ! ! 21 98 69 ! ! ! 111 157 70 ! ! ! 124 146 71 ! ! ! 9 66 72 ! ! ! 19 21 73 ! ! ! 50 147 74 ! ! ! 61 75 75 ! ! ! 1 115 76 ! ! ! 8 81 77 ! ! ! 24 100 78 ! ! ! 50 121 79 ! ! ! 51 120 80 ! ! ! 52 148 81 ! ! ! 77 78 82 ! ! ! 112 157 [83 rows x 5 columns] Molecular chain B edge1 edge2 orientation pos1 pos2 0 ( ) c 2 115 1 ( ) c 3 114 2 ( ) c 4 113 3 ( ) c 6 112 4 ( ) c 7 111 5 ( ) c 8 110 6 ( ) c 9 109 7 ( ) c 10 108 8 ( ) c 11 107 9 { ] t 12 106 10 [ } t 13 105 11 ( ) c 15 104 12 ( ) c 16 103 13 ( ) c 17 102 14 ( ) c 18 101 15 ( ) c 19 100 16 ( ) c 20 99 17 ( ) c 25 95 18 ( ) c 26 94 19 ( ) c 27 93 20 ( ) c 28 92 21 ( ) c 30 91 22 ( ) c 31 90 23 ( ) c 32 89 24 ( ) c 33 88 25 ( ) c 35 81 26 ( ) c 36 80 27 ( ) c 37 79 28 [ } t 38 63 29 [ } t 39 62 .. ... ... ... ... ... 59 [ ) c 7 158 60 { } c 9 82 61 { ) t 10 67 62 ( } t 23 100 63 ( ] c 34 86 64 { ] t 49 52 65 ( ) t 50 146 66 { } t 52 148 67 [ ) t 67 87 68 { ] c 70 71 69 { } t 80 85 70 { } t 83 110 71 { } c 96 99 72 { ] c 116 117 73 ( ] t 122 146 74 { ] c 123 124 75 ( ) c 124 147 76 ! ! ! 14 105 77 ! ! ! 125 147 78 ! ! ! 10 87 79 ! ! ! 20 22 80 ! ! ! 23 24 81 ! ! ! 51 148 82 ! ! ! 62 76 83 ! ! ! 8 83 84 ! ! ! 25 101 85 ! ! ! 51 122 86 ! ! ! 52 121 87 ! ! ! 53 149 88 ! ! ! 98 99 [89 rows x 5 columns]
Now we can search for all base pairs made with at least one Hoogsteen edge. Since the variable base_pairs stores a pandas Dataframe, we're using the operators available from pandas (http://pandas.pydata.org/pandas-docs/stable/indexing.html)
base_pairs = secondary_structure_to_base_pairs(secondary_structures[0], keep_tertiaries = True)
print base_pairs[(base_pairs['edge1'] == '[') | (base_pairs['edge2'] == ']')]
edge1 edge2 orientation pos1 pos2 7 { ] t 11 105 8 [ } t 12 104 25 [ } t 37 62 26 [ } t 38 61 52 [ ) c 4 156 53 ( ] t 21 96 55 ( ] c 33 85 56 { ] t 48 51 59 [ ) t 66 86 60 { ] c 69 70 64 { ] c 114 115 65 ( ] t 121 145 66 { ] c 122 123
A pyrna.features.SecondaryStructure can find all its junctions (meaning the group of single-strands linking helices).
According to the number of single-strands making a junction, we distinguish:
junctions = []
for ss in secondary_structures:
ss.find_junctions()
junctions += ss.junctions
A junction is described with:
print junctions[0]
{'single_strands': [{'name': 'SS_2', 'location': [13, 13]}], 'description': 'A -', 'location': [[12, 14], [103, 104]]}
Now we can search for all GNRA apical loops...
import re
for junction in junctions:
if re.match('G[AUGC][AG]A',junction['description']):
print "Apical loop with sequence %s"%junction['description']
Apical loop with sequence GAAA Apical loop with sequence GAAA
Or for all three-way junctions...
for junction in junctions:
if len(junction['location']) == 3:
print "Three-way junctions with sequence %s"%junction['description']
Three-way junctions with sequence GUAU - - Three-way junctions with sequence GUAU - -