(Text copied from http://rosalind.info/problems/grph/)
Problem
A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.
A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v
and head w is represented by (v,w) (but not by (w,v)). A directed loop is a directed edge of the form (v,v).
For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).
Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
Return: The adjacency list corresponding to O3. You may return edges in any order.
Sample Dataset
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
Sample Output
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
I am expected to make a directed graph of DNA sequences.
Because it is directed, there is a clear order, e.g. the end (suffix) of sequence 1 should match the beginning (prefix) of sequence 2.
From the example, it looks like you have to check: for each sequence, look which other sequences have a beginning that matches the end of the current sequence.
The length of the overlap is defined as parameter k. In the example k = 3. (Is this also true for the actual exercise?)
Now practically, this means the code is to:
Let's see how that translates to code:
from Bio import SeqIO
def look_for_overlaps(sequence_file, k=3, testing=True):
"""
Function that reads a sequence file to look for sequence overlaps in suffixes
and prefixes of sequences. Returns pairs of sequence IDs for which the suffix
of the former matches the prefix of the latter.
(Also see Rosalind.info exercise http://rosalind.info/problems/grph/)
"""
#Start by saving all sequences in a dictionary.
#This work easier than re-reading the file for each sequence.
sequence_dict = {}
for record in SeqIO.parse(sequence_file, "fasta"):
sequence_dict[record.id] = record.seq
for key, value in sequence_dict.items():
#For each item in the dictionary
# the key is the name
sequence_name = key
# and the sequence suffix is the last k characters from
# 'value' = the sequence.
suffix = value[-k:]
if testing:
#enable debugging prints
print("Now looking for overlaps with suffix %s of sequence %s..."
% (suffix, sequence_name))
else:
pass
for key, value in sequence_dict.items():
#For each sequence, read the complete list of sequences
other_name = key
prefix = value[:k]
if testing:
print("Now checking sequence %s with prefix %s..." %
(other_name, prefix))
if other_name == sequence_name:
#No need to check for matches with itself, so pass
pass
else:
#Check for overlap
if prefix == suffix:
#There is a match!
if testing:
print("There seems to be a match: %s (suffix of %s) = %s (prefix of %s)"
% (suffix, sequence_name, prefix, other_name))
else:
print(sequence_name, other_name)
else:
#if there is no overlap, do nothing
pass
return(None)
Now let's run this function with the test dataset:
look_for_overlaps("data/Example_overlap_graphs.fasta")
Now looking for overlaps with suffix AAA of sequence Rosalind_0498... Now checking sequence Rosalind_0498 with prefix AAA... Now checking sequence Rosalind_2391 with prefix AAA... There seems to be a match: AAA (suffix of Rosalind_0498) = AAA (prefix of Rosalind_2391) Now checking sequence Rosalind_2323 with prefix TTT... Now checking sequence Rosalind_0442 with prefix AAA... There seems to be a match: AAA (suffix of Rosalind_0498) = AAA (prefix of Rosalind_0442) Now checking sequence Rosalind_5013 with prefix GGG... Now looking for overlaps with suffix TTT of sequence Rosalind_2391... Now checking sequence Rosalind_0498 with prefix AAA... Now checking sequence Rosalind_2391 with prefix AAA... Now checking sequence Rosalind_2323 with prefix TTT... There seems to be a match: TTT (suffix of Rosalind_2391) = TTT (prefix of Rosalind_2323) Now checking sequence Rosalind_0442 with prefix AAA... Now checking sequence Rosalind_5013 with prefix GGG... Now looking for overlaps with suffix CCC of sequence Rosalind_2323... Now checking sequence Rosalind_0498 with prefix AAA... Now checking sequence Rosalind_2391 with prefix AAA... Now checking sequence Rosalind_2323 with prefix TTT... Now checking sequence Rosalind_0442 with prefix AAA... Now checking sequence Rosalind_5013 with prefix GGG... Now looking for overlaps with suffix CCC of sequence Rosalind_0442... Now checking sequence Rosalind_0498 with prefix AAA... Now checking sequence Rosalind_2391 with prefix AAA... Now checking sequence Rosalind_2323 with prefix TTT... Now checking sequence Rosalind_0442 with prefix AAA... Now checking sequence Rosalind_5013 with prefix GGG... Now looking for overlaps with suffix GGG of sequence Rosalind_5013... Now checking sequence Rosalind_0498 with prefix AAA... Now checking sequence Rosalind_2391 with prefix AAA... Now checking sequence Rosalind_2323 with prefix TTT... Now checking sequence Rosalind_0442 with prefix AAA... Now checking sequence Rosalind_5013 with prefix GGG...
And how does that look when not using debugging prints?
look_for_overlaps("data/Example_overlap_graphs.fasta", k=3, testing=False)
Rosalind_0498 Rosalind_2391 Rosalind_0498 Rosalind_0442 Rosalind_2391 Rosalind_2323
That looks fine! So now the code seems to work. At least with the test dataset... Let's try what happens with the real dataset:
look_for_overlaps("data/rosalind_grph.txt", k=3, testing=False)
Rosalind_1040 Rosalind_5677 Rosalind_1040 Rosalind_4711 Rosalind_6532 Rosalind_5400 Rosalind_1652 Rosalind_5442 Rosalind_6440 Rosalind_5677 Rosalind_6440 Rosalind_4711 Rosalind_5771 Rosalind_8550 Rosalind_9673 Rosalind_7260 Rosalind_9673 Rosalind_0232 Rosalind_9673 Rosalind_3971 Rosalind_1214 Rosalind_8526 Rosalind_5697 Rosalind_3085 Rosalind_5697 Rosalind_2525 Rosalind_8572 Rosalind_1040 Rosalind_8572 Rosalind_1449 Rosalind_9157 Rosalind_9564 Rosalind_9157 Rosalind_4016 Rosalind_3810 Rosalind_9673 Rosalind_3810 Rosalind_4526 Rosalind_8526 Rosalind_1652 Rosalind_8526 Rosalind_5697 Rosalind_4110 Rosalind_9009 Rosalind_2297 Rosalind_1040 Rosalind_2297 Rosalind_1449 Rosalind_9564 Rosalind_3810 Rosalind_9564 Rosalind_8991 Rosalind_7260 Rosalind_9158 Rosalind_7260 Rosalind_6075 Rosalind_7296 Rosalind_7051 Rosalind_7296 Rosalind_7177 Rosalind_2027 Rosalind_8830 Rosalind_2027 Rosalind_4607 Rosalind_2027 Rosalind_2021 Rosalind_3309 Rosalind_1214 Rosalind_3309 Rosalind_2553 Rosalind_1449 Rosalind_4402 Rosalind_2940 Rosalind_8550 Rosalind_5750 Rosalind_5572 Rosalind_7245 Rosalind_1040 Rosalind_7245 Rosalind_1449 Rosalind_4189 Rosalind_5409 Rosalind_3446 Rosalind_7245 Rosalind_1778 Rosalind_3694 Rosalind_8323 Rosalind_3053 Rosalind_8830 Rosalind_5400 Rosalind_2068 Rosalind_1778 Rosalind_2068 Rosalind_3694 Rosalind_0825 Rosalind_3832 Rosalind_0825 Rosalind_9385 Rosalind_6234 Rosalind_9564 Rosalind_6234 Rosalind_4016 Rosalind_5677 Rosalind_7051 Rosalind_5677 Rosalind_7177 Rosalind_4711 Rosalind_8572 Rosalind_4711 Rosalind_8323 Rosalind_4711 Rosalind_0944 Rosalind_7610 Rosalind_1040 Rosalind_7610 Rosalind_1449 Rosalind_7361 Rosalind_4242 Rosalind_4607 Rosalind_0659 Rosalind_4607 Rosalind_6143 Rosalind_9024 Rosalind_2297 Rosalind_9024 Rosalind_7296 Rosalind_9024 Rosalind_0290 Rosalind_1141 Rosalind_5572 Rosalind_0025 Rosalind_4110 Rosalind_0025 Rosalind_6234 Rosalind_0025 Rosalind_4770 Rosalind_0232 Rosalind_5677 Rosalind_0232 Rosalind_4711 Rosalind_3832 Rosalind_6775 Rosalind_3832 Rosalind_7789 Rosalind_9385 Rosalind_3810 Rosalind_9385 Rosalind_8991 Rosalind_0659 Rosalind_4790 Rosalind_6283 Rosalind_1040 Rosalind_6283 Rosalind_1449 Rosalind_6167 Rosalind_2027 Rosalind_6167 Rosalind_8830 Rosalind_6167 Rosalind_4607 Rosalind_6167 Rosalind_2021 Rosalind_6155 Rosalind_2068 Rosalind_6155 Rosalind_2196 Rosalind_0944 Rosalind_5400 Rosalind_4759 Rosalind_3309 Rosalind_4759 Rosalind_8174 Rosalind_4662 Rosalind_4402 Rosalind_8004 Rosalind_9395 Rosalind_8004 Rosalind_5157 Rosalind_2021 Rosalind_9673 Rosalind_2021 Rosalind_4526 Rosalind_7051 Rosalind_3810 Rosalind_7051 Rosalind_8991 Rosalind_4402 Rosalind_3832 Rosalind_4402 Rosalind_9385 Rosalind_5442 Rosalind_8526 Rosalind_6775 Rosalind_9395 Rosalind_6775 Rosalind_5157 Rosalind_2196 Rosalind_7260 Rosalind_2196 Rosalind_0232 Rosalind_2196 Rosalind_3971 Rosalind_9395 Rosalind_5409 Rosalind_3085 Rosalind_0659 Rosalind_3085 Rosalind_6143 Rosalind_4526 Rosalind_8550 Rosalind_6384 Rosalind_4790 Rosalind_4242 Rosalind_0507 Rosalind_4242 Rosalind_9328 Rosalind_2553 Rosalind_0507 Rosalind_2553 Rosalind_9328 Rosalind_5123 Rosalind_7610 Rosalind_9328 Rosalind_5677 Rosalind_9328 Rosalind_4711 Rosalind_0290 Rosalind_0507 Rosalind_0290 Rosalind_9328 Rosalind_2678 Rosalind_1487 Rosalind_8174 Rosalind_5400 Rosalind_8550 Rosalind_3832 Rosalind_8550 Rosalind_9385 Rosalind_5157 Rosalind_3053 Rosalind_2525 Rosalind_4625 Rosalind_1487 Rosalind_5400 Rosalind_7715 Rosalind_4759 Rosalind_7715 Rosalind_8004 Rosalind_7886 Rosalind_1778 Rosalind_7886 Rosalind_3694 Rosalind_8991 Rosalind_8526 Rosalind_6143 Rosalind_4402 Rosalind_7789 Rosalind_9564 Rosalind_7789 Rosalind_4016 Rosalind_5572 Rosalind_1214 Rosalind_5572 Rosalind_2553 Rosalind_0412 Rosalind_9024 Rosalind_0412 Rosalind_0042 Rosalind_4016 Rosalind_7610 Rosalind_5409 Rosalind_6440 Rosalind_3694 Rosalind_8572 Rosalind_3694 Rosalind_8323 Rosalind_3694 Rosalind_0944 Rosalind_3971 Rosalind_5677 Rosalind_3971 Rosalind_4711 Rosalind_4790 Rosalind_3810 Rosalind_4790 Rosalind_8991 Rosalind_0405 Rosalind_4110 Rosalind_0405 Rosalind_6234 Rosalind_0405 Rosalind_4770 Rosalind_5400 Rosalind_0507 Rosalind_5400 Rosalind_9328 Rosalind_6075 Rosalind_2297 Rosalind_6075 Rosalind_7296 Rosalind_6075 Rosalind_0290 Rosalind_1674 Rosalind_3810 Rosalind_1674 Rosalind_8991
I solved the problem. Now let's save this notebook and commit it to git.