The following assignments introduce applications of hashing with dict()
primitive of Python. While doing so, a rudimentary introduction to biological sequences is given.
This framework is then enhanced with probabilities, leading to routines to generate random sequences under some constraints, including a general concept of Markov-chains. All these components illustrate the usage of dict()
, but at the same time introduce some other computational routines to efficiently deal with probabilities.
The function collections.defaultdict
can be useful.
Below are some "suggested" imports. Feel free to use and modify these, or not. Generally it's good practice to keep most or all imports in one place. Typically very close to the start of notebooks.
from collections import defaultdict
from itertools import product
from itertools import accumulate
import numpy as np
from numpy.random import choice
import re
import random
from collections import Counter
from functools import reduce
import math
# Define ANSI escape codes for red text
RED = "\033[91m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
ORANGE = "\033[38;5;208m"
L_ORANGE = "\033[38;5;214m"
RESET = "\033[0m"
The automated TMC tests do not test cell outputs. These are intended to be evaluated in the peer reviews. So it is still be a good idea to make the outputs as clear and informative as possible.
To keep TMC tests running as well as possible it is recommended to keep global variable assignments in the notebook to a minimum to avoid potential name clashes and confusion. Additionally you should keep all actual code exection in main guards to keep the test running smoothly. If you run check_sequence.py in the part07-e01_sequence_analysis
folder, the script should finish very quickly and optimally produce no output.
If you download data from the internet during execution (codon usage table), the parts where downloading is done should not work if you decide to submit to the tmc server. Local tests should work fine.
A DNA molecule consist, in principle, of a chain of smaller molecules. These smaller molecules have some common basic components (bases) that repeat. For our purposes it is sufficient to know that these bases are nucleotides adenine, cytosine, guanine, and thymine with abbreviations A
, C
, G
, and T
. Given a DNA sequence e.g. ACGATGAGGCTCAT
, one can reverse engineer (with negligible loss of information) the corresponding DNA molecule.
Parts of a DNA molecule can transcribe into an RNA molecule. In this process, thymine gets replaced by uracil (U
).
1. Write a function dna_to_rna
to convert a given DNA sequence $s$ into an RNA sequence. For the sake of exercise, use dict()
to store the symbol to symbol encoding rules. Create a program to test your function.
DEBUG1 = False # Set to [ True ] to enable debug output, [ False ] to disable
def dna_to_rna(s):
dict_dna_to_rna = {"A":"A","C":"C","G":"G","T":"U"}
rna_sequence = "".join(dict_dna_to_rna[nucleotide] for nucleotide in s)
if DEBUG1:
print(f"DNA sequence: {s}")
print(f"RNA sequence: {rna_sequence}")
return rna_sequence
if __name__ == '__main__':
print(dna_to_rna("AACGTGATTTC"))
#print(dna_to_rna(""))
AACGUGAUUUC
The solution involves converting a DNA sequence into an RNA sequence by replacing each occurrence of the nucleotide thymine (T) with uracil (U) and leaving all other nucleotides unchanged. This transformation is achieved using a dictionary that maps each nucleotide to its corresponding RNA counterpart. The function iterates over the input string, performs the replacement using the dictionary, and constructs the RNA sequence using Python's join method.
The problem of converting a DNA sequence to an RNA sequence is relatively straightforward and can be efficiently handled using a dictionary for nucleotide mapping. Further improvements might include f.ex. input validation, but that doesn't appear necessary in this context.
Like DNA and RNA, protein molecule can be interpreted as a chain of smaller molecules, where the bases are now amino acids. RNA molecule may translate into a protein molecule, but instead of base by base, three bases of RNA correspond to one base of protein. That is, RNA sequence is read triplet (called codon) at a time.
2. Consider the codon to amino acid conversion table in http://htmlpreview.github.io/?https://github.com/csmastersUH/data_analysis_with_python_2020/blob/master/Codon%20usage%20table.html. Write a function get_dict
to read the table into a dict()
, such that for each RNA sequence of length 3, say $\texttt{AGU}$, the hash table stores the conversion rule to the corresponding amino acid. You may store the html page to your local src directory,
and parse that file.
DEBUG2 = False # Set to [ True ] to enable debug output, [ False ] to disable
def get_dict():
# initialize variables
filepath = "src/data.txt"
codon_to_amino_dict = {}
codon_pattern = re.compile(r'([AUGC]{3})\s+([A-Z\*])')
# get the file
try:
with open(filepath, 'r') as file:
lines = file.readlines()
except FileNotFoundError:
raise FileNotFoundError(f"The file at {filepath} was not found.")
except IOError as e:
raise IOError(f"An error occurred while reading the file: {e}")
# find codon-amino matches
for line in lines:
matches = codon_pattern.findall(line)
# input matches into dictionary
for codon, amino_acid in matches:
codon_to_amino_dict[codon] = amino_acid
if DEBUG2:
print(f"File:")
for line in lines:
print(line, end = '')
print('\n')
print(f"Codon to Amino dictionary: {codon_to_amino_dict}")
return codon_to_amino_dict
if __name__ == '__main__':
codon_to_aa = get_dict()
print(codon_to_aa)
{'UUU': 'F', 'UCU': 'S', 'UAU': 'Y', 'UGU': 'C', 'UUC': 'F', 'UCC': 'S', 'UAC': 'Y', 'UGC': 'C', 'UUA': 'L', 'UCA': 'S', 'UAA': '*', 'UGA': '*', 'UUG': 'L', 'UCG': 'S', 'UAG': '*', 'UGG': 'W', 'CUU': 'L', 'CCU': 'P', 'CAU': 'H', 'CGU': 'R', 'CUC': 'L', 'CCC': 'P', 'CAC': 'H', 'CGC': 'R', 'CUA': 'L', 'CCA': 'P', 'CAA': 'Q', 'CGA': 'R', 'CUG': 'L', 'CCG': 'P', 'CAG': 'Q', 'CGG': 'R', 'AUU': 'I', 'ACU': 'T', 'AAU': 'N', 'AGU': 'S', 'AUC': 'I', 'ACC': 'T', 'AAC': 'N', 'AGC': 'S', 'AUA': 'I', 'ACA': 'T', 'AAA': 'K', 'AGA': 'R', 'AUG': 'M', 'ACG': 'T', 'AAG': 'K', 'AGG': 'R', 'GUU': 'V', 'GCU': 'A', 'GAU': 'D', 'GGU': 'G', 'GUC': 'V', 'GCC': 'A', 'GAC': 'D', 'GGC': 'G', 'GUA': 'V', 'GCA': 'A', 'GAA': 'E', 'GGA': 'G', 'GUG': 'V', 'GCG': 'A', 'GAG': 'E', 'GGG': 'G'}
The get_dict() function parses a file containing codon-to-amino acid mappings and creates a dictionary from this data. The function uses a regular expression to find and extract codon-amino acid pairs from each line of the file. The regex pattern \b([AUGC]{3})\s+([A-Z*])\b captures RNA triplets (codons) and their corresponding amino acids. The function then constructs a dictionary with codons as keys and amino acids as values, and returns this dictionary.
The initial approach of using simple string operations to extract codon-amino acid pairs was prone to errors due to inconsistent spacing and formatting in the data. By employing regular expressions, the revised solution provides a more robust and precise method for parsing the file. Debugging output helps verify that the dictionary is correctly populated. This method appears both efficient and adaptable, making it a reliable solution for the task.
3. Use the same conversion table as above, but now write function get_dict_list
to read the table into a dict()
, such that for each amino acid the hash table stores the list of codons encoding it.
DEBUG3 = False # Set to [ True ] to enable debug output, [ False ] to disable
def get_dict_list():
# initialize variables
filepath = "src/data.txt"
codon_to_amino_dict = {}
codon_pattern = re.compile(r'([AUGC]{3})\s+([A-Z\*])')
amino_to_codon_dict_list = {}
# get the file
try:
with open(filepath, 'r') as file:
lines = file.readlines()
except FileNotFoundError:
raise FileNotFoundError(f"The file at {filepath} was not found.")
except IOError as e:
raise IOError(f"An error occurred while reading the file: {e}")
# find codon-amino matches
for line in lines:
matches = codon_pattern.findall(line)
# input matches into dictionary
for codon, amino_acid in matches:
codon_to_amino_dict[codon] = amino_acid
# input: values from codon->amino as keys / keys from codon->amino as values, in list
for key, value in codon_to_amino_dict.items():
if DEBUG3:
print(f"CODON -> AMINO | Key - {key}, Value - {value}")
if value not in amino_to_codon_dict_list:
amino_to_codon_dict_list[value] = []
amino_to_codon_dict_list[value].append(key)
else:
amino_to_codon_dict_list[value].append(key)
if DEBUG3:
print(f"AMINO -> CODON | Key - {value}, Value - {amino_to_codon_dict_list[value]}")
return amino_to_codon_dict_list
if __name__ == '__main__':
aa_to_codons = get_dict_list()
print(aa_to_codons)
{'F': ['UUU', 'UUC'], 'S': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'], 'Y': ['UAU', 'UAC'], 'C': ['UGU', 'UGC'], 'L': ['UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'], '*': ['UAA', 'UGA', 'UAG'], 'W': ['UGG'], 'P': ['CCU', 'CCC', 'CCA', 'CCG'], 'H': ['CAU', 'CAC'], 'R': ['CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'Q': ['CAA', 'CAG'], 'I': ['AUU', 'AUC', 'AUA'], 'T': ['ACU', 'ACC', 'ACA', 'ACG'], 'N': ['AAU', 'AAC'], 'K': ['AAA', 'AAG'], 'M': ['AUG'], 'V': ['GUU', 'GUC', 'GUA', 'GUG'], 'A': ['GCU', 'GCC', 'GCA', 'GCG'], 'D': ['GAU', 'GAC'], 'G': ['GGU', 'GGC', 'GGA', 'GGG'], 'E': ['GAA', 'GAG']}
The function get_dict_list() aims to create a dictionary, mapping each amino to a list of codons that encode it.
The debug output, if enabled, provides intermediate verification of these mappings.
The solution achieves the task by first populating a codon_to_amino_dict with codon-amino acid pairs and then transforming this dictionary into the final format where amino acids are keys and lists of codons are values. This two-step approach ensures clarity in the process but may involve some redundant steps, combining these steps into a single pass without intermediate storage could improve efficiency.
With the conversion tables at hand, the following should be trivial to solve.
4. Fill in function rna_to_prot
in the stub solution to convert a given DNA sequence $s$ into a protein sequence.
You may use the dictionaries from exercises 2 and 3. You can test your program with ATGATATCATCGACGATGTAG
.
DEBUG4 = False # Set to [ True ] to enable debug output, [ False ] to disable
def rna_to_prot(s):
# get the codon to amino dictionary
codon_to_amino_dict = get_dict()
# split the RNA sequence into codons (3-letter segments)
codon_lst = [s[i:i+3] for i in range(0, len(s), 3)]
if DEBUG4:
print(f'\ncodon list: {codon_lst}')
# translate each codon into its corresponding amino
amino_lst = [codon_to_amino_dict[c] for c in codon_lst]
if DEBUG4:
print(f'amino list: {amino_lst}\n')
# join the list of amino acids into a single string
amino_str = ''.join(amino_lst)
return amino_str
def dna_to_prot(s):
return rna_to_prot(dna_to_rna(s))
if __name__ == '__main__':
print(dna_to_prot("ATGATATCATCGACGATGTAG"))
MISSTM*
The function rna_to_prot converts an RNA sequence into a protein sequence leveraging the codon-to-amino dictionary.
This approach ensures that each codon is accurately translated and the resulting protein sequence is produced.
The rna_to_prot function translates an RNA sequence into a protein sequence in a straightforward manner by using the previously created functions. The use of list comprehension for splitting the RNA sequence into codons and then translating each codon into an amino acid provides a compact and readable solution. Debugging output helps verify the correctness of intermediate steps, by showing the list of codons and the resulting amino acids.
You may notice that there are $4^3=64$ different codons, but only 20 amino acids. That is, some triplets encode the same amino acid.
It has been observed that among the codons coding the same amino acid, some are more frequent than others. These frequencies can be converted to probabilities. E.g. consider codons AUU
, AUC
, and AUA
that code for amino acid isoleucine.
If they are observed, say, 36, 47, 17 times, respectively, to code isoleucine in a dataset, the probability that a random such event is AUU
$\to$ isoleucine is 36/100.
This phenomenon is called codon adaptation, and for our purposes it works as a good introduction to generation of random sequences under constraints.
5. Consider the codon adaptation frequencies in http://htmlpreview.github.io/?https://github.com/csmastersUH/data_analysis_with_python_2020/blob/master/Codon%20usage%20table.html and read them into a dict()
, such that for each RNA sequence of length 3, say AGU
, the hash table stores the probability of that codon among codons encoding the same amino acid.
Put your solution in the get_probabability_dict
function. Use the column "([number])" to estimate the probabilities, as the two preceding columns contain truncated values.
DEBUG5 = False # Set to [ True ] to enable debug output, [ False ] to disable
def get_probabability_dict():
filepath = "src/data.txt"
codon_adaptation_freq_dict = {}
amino_to_codon_dict_list = get_dict_list()
# read the file
with open(filepath, 'r') as file:
lines = file.readlines()
if DEBUG5:
print()
print(f'lines: {lines}')
# regular expression to match codons and their frequencies
freq_pattern = re.compile(r'([AUGC]{3})\s+[A-Z\*]\s+\d+\.\d+\s+\d+\.\d+\s*\(\s*(\d+)\s*\)')
# for each line in original table find the groups: (codon, frequency)
for line in lines:
matches = freq_pattern.findall(line)
if DEBUG5:
print()
print(f'{YELLOW}MATCHES FROM LINE & LINE {RESET}')
print(f'{RED}matches: {matches} {RESET}')
print(f'{RED}line: {line} {RESET}')
# extract frequencies
for codon, frequency in matches:
frequency = int(frequency)
amino = None
if DEBUG5:
print(f'{YELLOW}ITERATED ITEMS & DICTIONARY ADDITION {RESET}')
print(f'codon: {codon}')
print(f'frequency: {frequency}')
# find the corresponding amino for the codon
for key, values_list in amino_to_codon_dict_list.items():
if codon in values_list:
amino = key
if DEBUG5:
print(f'amino: {amino}')
break
# add amino, codon, frequency (amino -> [codon -> frequency])
if amino not in codon_adaptation_freq_dict:
codon_adaptation_freq_dict[amino] = {}
codon_adaptation_freq_dict[amino][codon] = frequency
else:
codon_adaptation_freq_dict[amino][codon] = frequency
if DEBUG5:
print(f'codon_adaptation_freq_dict: {codon_adaptation_freq_dict}')
print('-------------------------')
# compute probabilities from the frequencies
codon_encoding_amino_probability_dict = {}
for amino, codon_freqs in codon_adaptation_freq_dict.items():
total_frequency = sum(codon_freqs.values())
codon_encoding_amino_probability_dict [amino] = {codon: freq / total_frequency for codon, freq in codon_freqs.items()}
if DEBUG5:
print()
print(f'codon_adaptation_freq_dict: {codon_adaptation_freq_dict}')
print(f'codon_encoding_amino_probability_dict: {codon_encoding_amino_probability_dict}')
# flatten the dictionary for compatibility with the main block
codon_encoding_amino_probability_dict_flat = {}
for amino, codon_probs in codon_encoding_amino_probability_dict.items():
for codon, prob in codon_probs.items():
codon_encoding_amino_probability_dict_flat[codon] = prob
if DEBUG5:
print(f'codon_encoding_amino_probability_dict_flat: {codon_encoding_amino_probability_dict_flat}')
print()
return codon_encoding_amino_probability_dict_flat
if __name__ == '__main__':
codon_to_prob = get_probabability_dict()
items = sorted(codon_to_prob.items(), key=lambda x: x[0])
for i in range(1 + len(items)//6):
print("\t".join(
f"{k}: {v:.6f}"
for k, v in items[i*6:6+i*6]
))
AAA: 0.434049 AAC: 0.529633 AAG: 0.565951 AAU: 0.470367 ACA: 0.284188 ACC: 0.355232 ACG: 0.113812 ACU: 0.246769 AGA: 0.214658 AGC: 0.239938 AGG: 0.211091 AGU: 0.149602 AUA: 0.169062 AUC: 0.469866 AUG: 1.000000 AUU: 0.361072 CAA: 0.265017 CAC: 0.581485 CAG: 0.734983 CAU: 0.418515 CCA: 0.276603 CCC: 0.323470 CCG: 0.113196 CCU: 0.286731 CGA: 0.108812 CGC: 0.183777 CGG: 0.201554 CGU: 0.080108 CUA: 0.071380 CUC: 0.195577 CUG: 0.395702 CUU: 0.131716 GAA: 0.422453 GAC: 0.535458 GAG: 0.577547 GAU: 0.464542 GCA: 0.228121 GCC: 0.399781 GCG: 0.106176 GCU: 0.265922 GGA: 0.249922 GGC: 0.337109 GGG: 0.249882 GGU: 0.163087 GUA: 0.116577 GUC: 0.238306 GUG: 0.463346 GUU: 0.181770 UAA: 0.297019 UAC: 0.556662 UAG: 0.236738 UAU: 0.443338 UCA: 0.150517 UCC: 0.217960 UCG: 0.054398 UCU: 0.187586 UGA: 0.466243 UGC: 0.543843 UGG: 1.000000 UGU: 0.456157 UUA: 0.076568 UUC: 0.535866 UUG: 0.129058 UUU: 0.464134
The goal of this solution is to calculate the probability distribution of codons encoding each amino based on their frequency in a provided dataset. The program reads codon frequencies from a file and processes these frequencies to produce a normalized probability distribution for each codon-amino pairing.
The process consists of several steps:
Read Codon Frequency Data: The program reads the codon frequency data from a file. This file contains lines with codons, amino acids, and their respective frequencies.
Extract Frequencies Using Regular Expressions: A regular expression is used to parse each line of the file and extract codons along with their frequencies. The regular expression pattern matches a codon and its frequency value.
Map Codons to Amino Acids: The program maps each codon to its corresponding amino, fetching the amino (using the previously defined dictionary, amino_to_codon_dict_list).
Calculate Frequency Dictionary: For each codon, the program calculates the frequency for each aminos codon. These frequencies are stored in a nested dictionary structure (codon_adaptation_freq_dict), where the outer keys are aminos, and the inner dictionaries map codons to their frequencies.
Compute Probabilities: Once all codon frequencies are collected, the program calculates the probability of each codon for each amino acid. This is done by dividing the frequency of each codon by the total frequency of all codons encoding the same amino. The result is a probability distribution stored in a new dictionary (codon_encoding_amino_probability_dict).
Flatten the Probability Dictionary: The nested probability dictionary is then flattened for use in the main block of the program, resulting in a single dictionary that maps each codon directly to its probability.
Output the Results: Finally, the program sorts the codon-probability pairs alphabetically by codon and prints them in a formatted table.
The solution leverages a structured approach to convert raw frequency data into a usable probability distribution format. The use of regular expressions allows for flexible and robust parsing of the input data, ensuring that only correctly formatted lines are processed. The mapping from codons to aminos is managed using dictionaries, allowing quick lookups and updates.
A significant challenge encountered during development was managing inconsistent spacing and avoiding unnecessary data capture in the regex pattern. Initially, the regular expression was too broad, capturing more information than needed and struggling to handle varying spaces between data elements. To resolve this, the regex pattern was refined to focus solely on essential components: the three-letter codon sequence and the frequency count inside parentheses.
The program includes optional debug output. When enabled, the program prints detailed information about the internal state at various steps.
Now you should have everything in place to easily solve the following.
6. Write a class ProteinToMaxRNA
with a convert
method which converts a protein sequence into the most likely RNA sequence to be the source of this protein. Run your program with LTPIQNRA
.
DEBUG6 = False # Set to [ True ] to enable debug output, [ False ] to disable
class ProteinToMaxRNA:
def __init__(self):
self.amino_to_codon = get_dict_list()
self.codon_probabilities = get_probabability_dict()
def convert(self, s):
if DEBUG6:
print('amino_to_codon')
print(self.amino_to_codon)
print()
print('codon_probabilities')
print(self.codon_probabilities)
print()
# convert string to chars for use in loop
chars = list(s)
if DEBUG6:
print(f'chars: {chars}')
# initialize other variables for use in loop
max_codon_lst = []
# loop over each character in sequence
for char in chars:
prob_lst = []
# fetch the codons encoding amino
codons = self.amino_to_codon[char]
if DEBUG6:
print(f'codons encoding {char}')
print(codons)
# loop over each codon
for codon in codons:
# fetch the probability it encodes amino and append to list containing probabilities
probability = self.codon_probabilities[codon]
prob_lst.append(probability)
# find codon with max probability of encoding and index, then use index to get codon
max_value = max(prob_lst)
max_index = prob_lst.index(max_value)
max_codon = codons[max_index]
if DEBUG6:
print(f'max_codon: {max_codon}')
# append to list being returned
max_codon_lst.append(max_codon)
# convert list to string
max_codon_str = ''.join(max_codon_lst)
return max_codon_str
if __name__ == '__main__':
protein_to_rna = ProteinToMaxRNA()
print(protein_to_rna.convert("LTPIQNRA"))
CUGACCCCCAUCCAGAACAGAGCC
The solution involves creating a class, ProteinToMaxRNA, that translates a protein sequence into its most likely corresponding RNA sequence based on codon usage probabilities. This is done using two key dictionaries: one that maps each amino to its possible codons (amino_to_codon), and another that provides the probability of each codon being used (codon_probabilities).
The convert method takes a protein sequence as input and processes each amino one by one. For each amino, the method retrieves its associated codons and their probabilities from the respective dictionaries. It then determines which codon has the highest probability of encoding the amino and appends this codon to a list. After processing all aminos in the input sequence, the method returns a concatenated string of the selected codons, forming the RNA sequence that is most likely.
While the current implementation maintains a clear structure there may be some redundancy within the loop iteration. With some refinements, it could possibly be made even more efficient and concise. However, the code correctly implements the required functionality and provides the correct output for the example input and tests.
Now we are almost ready to produce random RNA sequences that code a given protein sequence. For this, we need a subroutine to sample from a probability distribution. Consider our earlier example of probabilities 36/100, 47/100, and 17/100 for AUU
, AUC
, and AUA
, respectively.
Let us assume we have a random number generator random()
that returns a random number from interval $[0,1)$. We may then partition the unit interval according to cumulative probabilities to $[0,36/100), [36/100,83/100), [83/100,1)$, respectively. Depending which interval the number random()
hits, we select the codon accordingly.
7. Write a function random_event
that chooses a random event, given a probability distribution (set of events whose probabilities sum to 1).
You can use function random.uniform
to produce values uniformly at random from the range $[0,1)$. The distribution should be given to your function as a dictionary from events to their probabilities.
DEBUG7 = False # Set to [ True ] to enable debug output, [ False ] to disable
def random_event(dist):
"""
Takes as input a dictionary from events to their probabilities.
Return a random event sampled according to the given distribution.
The probabilities must sum to 1.0
"""
# produce random fraction to use for ascertaining event
random_fract = random.uniform(0, 1)
if DEBUG7:
print(f'dist: {dist}') # {'A': 0.1, 'C': 0.35, 'G': 0.15, 'T': 0.4}
print(f'random_fract: {random_fract}') # 0.355473001581198
# calculate cumulative probabilities
cumulative_values = list(accumulate(dist.values())) # [0.1, 0.45, 0.6, 1.0]
if DEBUG7:
print(f'cumulative_values: {cumulative_values}')
# create a list of tuples (event, cumulative_value)
event_cumval = list(zip(dist.keys(), cumulative_values))
# find the first cumulative value that exceeds random_fract
for event, cumulative_value in event_cumval:
if random_fract < cumulative_value:
if DEBUG7:
print(f'event_cumval: {event_cumval}') # [('A', 0.1), ('C', 0.45), ('G', 0.6), ('T', 1.0)]
cumulative_values_with_random = sorted(cumulative_values + [random_fract])
print(f'cumulative_values_with_random: {cumulative_values_with_random}') # [0.1, 0.355473001581198, 0.45, 0.6, 1.0]
print(f'{RED}return_event {event} {RESET}') # return_event C
print('-----------------------')
# return event key where random value falls below = the value within which range it is
return event
if __name__ == '__main__':
distribution = dict(zip("ACGT", [0.10, 0.35, 0.15, 0.40]))
print(", ".join(random_event(distribution) for _ in range(29)))
G, T, T, C, C, G, C, C, C, T, A, G, C, A, C, T, G, A, T, T, T, C, C, A, T, C, C, C, T
The solution aims to randomly select an event from a given distribution where each event has a specified probability.
First a random fraction between 0 and 1 is generated. This fraction will help determine which event to select based on its position relative to cumulative probabilities.
The cumulative probabilities of the events are calculated from the given distribution. For this purpose, the list of individual event probabilities is transformed into a list where each element represents the sum of probabilities up to and including that event.
Events and their corresponding cumulative values are paired together. This allows us to easily determine which event corresponds to the range that the random fraction falls within.
By comparing the random fraction with the cumulative values, the correct event is selected. The first cumulative value that is greater than the random fraction indicates the event that should be returned.
If debugging is enabled, additional information is printed to help verify the correctness of the sampling process.
The approach uses cumulative probabilities to map a random fraction to a specific event. This ensures that each event is selected in proportion to its specified probability. Debug output is integrated to aid in the verification of the sampling process. It provides visibility into key variables such as the distribution, random fraction, cumulative values, and the selected event.
The code works well with the given example input and the tests. For more genral use cases, there could be issues because the code assumes that the input probabilities are perfectly normalized to sum to exactly 1.0. One issue that came up while writing the code was list comprehension resulting in floating-point precision error, where cumulative values would come up such as 0.4499999999999999, instead of 0.45.
With this general routine, the following should be easy to solve.
8. Write a class ProteinToRandomRNA
to produce a random RNA sequence encoding the input protein sequence according to the input codon adaptation probabilities. The actual conversion is done through the convert
method. Run your program with LTPIQNRA
.
DEBUG8 = False # Set to [ True ] to enable debug output, [ False ] to disable
class ProteinToRandomRNA(object):
def __init__(self):
# initialize dictionaries: aminos and their encoding codons, probability of codon encoding amino
self.amino_to_codon = get_dict_list()
self.codon_probabilities = get_probabability_dict()
def convert(self, s):
if DEBUG8:
print(f'amino_to_codon: {self.amino_to_codon}')
print(f'codon_probabilities: {self.codon_probabilities}')
print(f'input string: {s}')
print()
# for each amino in input sequence, randomly select a codon based on its probability
return_lst = []
for c in s:
# get the encoding codons
codons_encoding = self.amino_to_codon[c]
# for each codon fetch the probability
codons_encoding_probability = {c_e: self.codon_probabilities[c_e] for c_e in codons_encoding}
# use random_event function to create random event
res = random_event(codons_encoding_probability)
# append that codon to be returned
return_lst.append(res)
if DEBUG8:
print(f'current amino: {RED}{c}{RESET}')
print(f'codons encoding: {codons_encoding}')
print(f'codon probabilities: {codons_encoding_probability}')
print(f'chosen codon: {res}')
print(f'current rna sequence: {return_lst}')
print()
# join event codons to form string
return_str = ''.join(return_lst)
return return_str
if __name__ == '__main__':
protein_to_random_codons = ProteinToRandomRNA()
print(protein_to_random_codons.convert("LTPIQNRA"))
CUGACCCCUAUACAGAACCGCGCA
The solution involves mapping each amino in the input protein sequence to its possible codons using a dictionary (amino_to_codon). Another dictionary (codon_probabilities) provides the probabilities of each codon being used. The program iterates through each amino in the protein sequence, retrieves the possible codons and their associated probabilities, and randomly selects a codon using the random_event function. The selected codons are concatenated to form the final RNA sequence.
Debugging statements are included to trace the steps and verify correct function at each stage.
This task was a relatively straightforward implementation of the previously created functions and the class in the last exercise. The print statements from the last exercise attempt to further visualize that when calling the random event class the right codon is returned, based on which fraction variable the random event class produced.
The task only required a straightforward implementation, using pre-existing dictionaries for amino mapping and codon probability data. The main challenge was ensuring that each amino acid was correctly converted into a corresponding RNA codon based on its probability distribution. The debug output effectively shows the mapping and selection process, confirming that the random_event function correctly chooses codons according to the provided probabilities.
Future improvements could include optimizing the handling of input data and refining the random selection process to ensure performance across larger datasets.
We will now reuse the machinery derived above in a related context. We go back to DNA sequences, and consider some easy statistics that can be used to characterize the sequences. First, just the frequencies of bases $\texttt{A}$, $\texttt{C}$, $\texttt{G}$, $\texttt{T}$ may reveal the species from which the input DNA originates; each species has a different base composition that has been formed during evolution. More interestingly, the areas where DNA to RNA transcription takes place (coding region) have an excess of $\texttt{C}$ and $\texttt{G}$ over $\texttt{A}$ and $\texttt{T}$. To detect such areas a common routine is to just use a sliding window of fixed size, say $k$, and compute for each window position $T[i..i+k-1]$ the base frequencies, where $T[1..n]$ is the input DNA sequence. When sliding the window from $T[i..i+k-1]$ to $T[i+1..i+k]$ frequency $f(T[i])$ gets decreases by one and $f(T[i+k])$ gets increased by one.
9. Write a generator sliding_window
to compute sliding window base frequencies so that each moving of the window takes constant time. We saw in the beginning of the course one way how to create generators using
generator expression. Here we use a different way. For the function sliding_window
to be a generator, it must have at least one yield
expression, see https://docs.python.org/3/reference/expressions.html#yieldexpr.
Here is an example of a generator expression that works similarily to the built in range
generator:
def range(a, b=None, c=1):
current = 0 if b == None else a
end = a if b == None else b
while current < end:
yield current
current += c
A yield expression can be used to return a value and temporarily return from the function.
DEBUG9 = False # Set to [ True ] to enable debug output, [ False ] to disable
def sliding_window(s, k):
# define all possible nucleotides
nucleotides = 'ACGT'
# if no window to iterate, return
if len(s) < k:
return
# intial frequency window handled seperately
freq = Counter(s[:k])
# add missing nucleotides with a count of 0 using a dictionary comprehension
freq.update({nucleotide: 0 for nucleotide in nucleotides if nucleotide not in freq})
if DEBUG9:
print(f'INITIAL: {freq}')
# yield intial
yield dict(freq)
# deliver subsequent windows sliding through sequences
for i in range(k, len(s)):
# determine outgoing ang incoming index positions
out_char = s[i - k]
in_char = s[i]
if DEBUG9:
print()
print(s)
print(f'{RED}{s[i-k]}{RESET}{s[i+1-k:i]}{YELLOW}{s[i]}{RESET}{s[i+1:]}')
print(f'outgoing: {RED}{out_char}{RESET} (i-k: {i-k})')
print(f'incoming: {YELLOW}{in_char}{RESET} (i : {i})')
# update counter for the outgoing char
freq[out_char] -= 1
if DEBUG9:
print(f'updated_out: {freq}')
# update counter for the incoming char
freq[in_char] += 1
if DEBUG9:
print(f'updated_in : {freq}')
# yield the updated frequency dictionary
yield dict(freq)
if __name__ == '__main__':
#s = "T"
s = "TCCCGACGGCCTTGCC"
for d in sliding_window(s, 4):
print(d)
{'T': 1, 'C': 3, 'A': 0, 'G': 0} {'T': 0, 'C': 3, 'A': 0, 'G': 1} {'T': 0, 'C': 2, 'A': 1, 'G': 1} {'T': 0, 'C': 2, 'A': 1, 'G': 1} {'T': 0, 'C': 1, 'A': 1, 'G': 2} {'T': 0, 'C': 1, 'A': 1, 'G': 2} {'T': 0, 'C': 2, 'A': 0, 'G': 2} {'T': 0, 'C': 2, 'A': 0, 'G': 2} {'T': 1, 'C': 2, 'A': 0, 'G': 1} {'T': 2, 'C': 2, 'A': 0, 'G': 0} {'T': 2, 'C': 1, 'A': 0, 'G': 1} {'T': 2, 'C': 1, 'A': 0, 'G': 1} {'T': 1, 'C': 2, 'A': 0, 'G': 1}
The sliding_window generator function calculates nucleotide frequencies within a sliding window of a specified size over a DNA sequence. The function first initializes a frequency dictionary for the initial window, ensuring all possible nucleotides ('A', 'C', 'G', 'T') are accounted for, even those with zero counts. The initial frequency dictionary is then yielded. As the window slides one position at a time through the sequence, the function updates the counts by decrementing the frequency of the nucleotide that exits the window and incrementing the frequency for the nucleotide that enters the window. This update process is efficient, operating in constant time since only two nucleotide frequencies change per step.
Initially, the tests failed because the frequency dictionaries did not account for all nucleotide bases ('A', 'C', 'G', 'T'), specifically those that had zero frequencies. This was corrected by initializing the frequency dictionary with all nucleotides. The concept of sliding a window and adjusting counts in constant time was somewhat challenging to grasp at first. I initially included loops that would have caused the algorithm to operate in inconsistent time, which would have been inefficient. After understanding how to adjust only the counts of the nucleotides entering and exiting the window, the function now correctly maintains constant time for each window move.
The debug print statements are included to visually demonstrate the changes in nucleotide frequencies as the window slides across the DNA sequence. This helped confirm the correctness of the solution and provided insight into the algorithm's step-by-step operations.
Our models so far have been so-called zero-order models, as each event has been independent of other events. With sequences, the dependencies of events are naturally encoded by their contexts. Considering that a sequence is produced from left-to-right, a first-order context for $T[i]$ is $T[i-1]$, that is, the immediately preceding symbol. First-order Markov chain is a sequence produced by generating $c=T[i]$ with the probability of event of seeing symbol $c$ after previously generated symbol $a=T[i-1]$. The first symbol of the chain is sampled according to the zero-order model.
The first-order model can naturally be extended to contexts of length $k$, with $T[i]$ depending on $T[i-k..i-1]$. Then the first $k$ symbols of the chain are sampled according to the zero-order model. The following assignments develop the routines to work with the higher-order Markov chains.
In what follows, a $k$-mer is a substring $T[i..i+k-1]$ of the sequence at an arbitrary position.
10. Write function context_list
that given an input DNA sequence $T$ associates to each $k$-mer $W$ the concatenation of all symbols $c$ that appear after context $W$ in $T$, that is, $T[i..i+k]=Wc$. For example, GA is associated to TCT in $T$=ATGATATCATCGACGATGTAG, when $k=2$.
DEBUG10 = False # Set to [ True ] to enable debug output, [ False ] to disable
def context_list(s, k):
context_dict = {}
# itereate over sequence, processing chunks of length k (k-mer) and following context
for i in range(len(s)-k):
if DEBUG10:
print(f'{s[:i]}{YELLOW}{s[i:i+k]}{RED}{s[i+k]}{RESET}{s[i+k+1:]}')
# if k-mer exists in dictionary, append next letter in sequence, to the string representing its context
if s[i:i+k] in context_dict:
context_dict[s[i:i+k]] += s[i+k]
# else set next letter as context
else:
context_dict[s[i:i+k]] = s[i+k]
if DEBUG10:
print(f'{YELLOW}{s[i:i+k]}{RESET} has context {RED}{context_dict[s[i:i+k]]}{RESET}')
print('-------------')
# return the dictionary containing k-mers and their contexts
return context_dict
if __name__ == '__main__':
k = 2
s = "ATGATATCATCGACGATCTAG"
d = context_list(s, k)
print(d)
{'AT': 'GACCC', 'TG': 'A', 'GA': 'TCT', 'TA': 'TG', 'TC': 'AGT', 'CA': 'T', 'CG': 'AA', 'AC': 'G', 'CT': 'A'}
The solution uses the sliding window approach to examine each k-mer (substring of length k) within the given sequence s.
This approach efficiently builds a dictionary of k-mer contexts using simple string slicing and dictionary operations.
The implementation of the context_list function is straightforward and effectively solves the problem by utilizing a simple loop to manage the sliding window across the string s. By directly manipulating the dictionary to store or update the context of each k-mer, the code avoids unnecessary complexity and remains readable.
Debug output allows for step-by-step tracing of the program’s operation.
The solution is again quite barebones, in that it doesn't check extensively for edge-cases or errors. For example, the function does not explicitly handle cases where the length of the string s is less than k. In such cases, the function would return an empty dictionary since there are no valid k-mers to process. Although it is the correct behaviour in this instance, it is an example of how edge cases could affect the programs process.
11. With the above solution, write function context_probabilities
to count the frequencies of symbols in each context and convert these frequencies into probabilities. Run context_probabilities
with $T=$ ATGATATCATCGACGATGTAG
and $k$ values 0 and 2.
DEBUG11 = False # Set to [ True ] to enable debug output, [ False ] to disable
def context_probabilities(s, k):
# get context using provided sequence string and k-value
context_dict = context_list(s, k)
# initialize return dictionary
context_prob_dict = {}
if DEBUG11:
print(f'context_dict: {context_dict} \n')
for key in context_dict:
# for each of the kmers in dict get context characters into seperate freq_dict (ie. 'AT': 'GACCG' -> {'G': 2, 'A': 1, 'C': 2})
chars = context_dict[key]
freq_dict = {}
for char in chars:
if char not in freq_dict:
freq_dict[char] = 1
else:
freq_dict[char] += 1
# get the sum of frequencies (ie. {'G': 2, 'A': 1, 'C': 2} -> 5)
total_freq = sum(freq_dict.values())
# using dictionary comprehension calculate each value as a percentage of the sum number (ie. {'G': 2, 'A': 1, 'C': 2} -> {'G': 2/5, 'A': 1/5, 'C': 2/5} -> {'G': 0.4, 'A': 0.2, 'C': 0.4})
context_prob_dict[key] = {char: freq / total_freq for char, freq in freq_dict.items()}
if DEBUG11:
print(f'key: {L_ORANGE}{key}{RESET}')
print(f'freq_dict : {freq_dict}')
print(f'total_freq : {total_freq}')
print(f'context_prob : {context_prob_dict[key]}')
print('-------------')
# return above kmer key and above percentage dict
return context_prob_dict
if __name__ == '__main__':
s = "ATGATATCATCGACGATGTAG"
k = 2
print(context_probabilities(s, k))
k = 0
print(context_probabilities(s, k))
{'AT': {'G': 0.4, 'A': 0.2, 'C': 0.4}, 'TG': {'A': 0.5, 'T': 0.5}, 'GA': {'T': 0.6666666666666666, 'C': 0.3333333333333333}, 'TA': {'T': 0.5, 'G': 0.5}, 'TC': {'A': 0.5, 'G': 0.5}, 'CA': {'T': 1.0}, 'CG': {'A': 1.0}, 'AC': {'G': 1.0}, 'GT': {'A': 1.0}} {'': {'A': 0.3333333333333333, 'T': 0.2857142857142857, 'G': 0.23809523809523808, 'C': 0.14285714285714285}}
The solution creates a dictionary of k-mers (substrings of length k) and the probability of subsequent characters based on a given sequence. The function context_probabilities:
The function returns a dictionary which provides a probabilistic model of the sequence's structure based on k-mer contexts.
This exercise extends the logic of the previous exercise by calculating the probabilities of each character following specific k-mers. The implementation is straightforward and reuses previous logic with slight modifications. The debug output matches the expected results, demonstrating that the function correctly calculates probabilities. Further enhancements could include handling more edge cases, such as when k is larger than the length of s.
12. With the above solution and the function random_event
from the earlier exercise, write class MarkovChain
. Its generate
method should generate a random DNA sequence following the original $k$-th order Markov chain probabilities.
DEBUG12 = False # Set to [ True ] to enable debug output, [ False ] to disable
class MarkovChain:
def __init__(self, zeroth, kth, k=2):
self.k = k
self.zeroth = zeroth
self.kth = kth
def generate(self, n, seed=None):
# set the seed for random number generation to ensure reproducibility of results
if seed is not None:
random.seed(seed)
if DEBUG12:
print('GENERATE INITIAL CHARACTERS')
print(f'zeroth: {self.zeroth}')
print(f'k: {self.k}')
# generate characters upto len k, using zeroth dict probabilities
gen_chars = ""
i = 0
for i in range(self.k):
initial = random_event(self.zeroth)
if DEBUG12:
print(f'i: {i}, initial: {initial}')
gen_chars += initial
if DEBUG12:
print(f'gen_chars: {gen_chars}')
print('\nGENERATE REMAINING CHARACTERS')
print(f'kth: {self.kth}')
# generate the remaining n - k characters, using kth dict probabilities
while len(gen_chars) < n:
# to pass the test "TestGenerateMarkov: test_length" we need to ensure that we account for cases where k-mer does not exist in kth (ie. generate: C generate: C ; but there is no CC in kth)
prev_k_chars = gen_chars[-self.k:]
if DEBUG12:
print(f'gen_chars : {gen_chars}')
print(f'gen_chars[-self.k:]: {gen_chars[-self.k:]}')
# if we find the previous k-mer in kth, add random generated char to sequence using kth/k-mer
gen_char = ''
if prev_k_chars in self.kth:
gen_char = random_event(self.kth[prev_k_chars])
gen_chars += gen_char
else:
# if the k-mer does not exist in kth, randomly select a value using zeroth
fallback_char = random_event(self.zeroth)
gen_chars += fallback_char
if DEBUG12:
if gen_char:
print(f'gen_char : {gen_char}')
print('----------')
else:
print(f'fallback_char : {fallback_char}')
print('-----------------------')
return gen_chars[:n]
if __name__ == '__main__':
zeroth = {'A': 0.2, 'C': 0.19, 'T': 0.31, 'G': 0.3}
kth = {'GT': {'A': 1.0, 'C': 0.0, 'T': 0.0, 'G': 0.0},
'CA': {'A': 0.0, 'C': 0.0, 'T': 1.0, 'G': 0.0},
'TC': {'A': 0.5, 'C': 0.0, 'T': 0.0, 'G': 0.5},
'GA': {'A': 0.0, 'C': 0.3333333333333333, 'T': 0.6666666666666666, 'G': 0.0},
'TG': {'A': 0.5, 'C': 0.0, 'T': 0.5, 'G': 0.0},
'AT': {'A': 0.2, 'C': 0.4, 'T': 0.0, 'G': 0.4},
'TA': {'A': 0.0, 'C': 0.0, 'T': 0.5, 'G': 0.5},
'AC': {'A': 0.0, 'C': 0.0, 'T': 0.0, 'G': 1.0},
'CG': {'A': 1.0, 'C': 0.0, 'T': 0.0, 'G': 0.0}}
n = 10
seed = 0
mc = MarkovChain(zeroth, kth)
print(mc.generate(n, seed))
GGTAGTATCG
The solution has a class MarkovChain that generates a random DNA sequence based on the given Markov chain probabilities.
The class is initialized with a zeroth probability dictionary (zeroth) and a k-th order transition probability dictionary (kth).
The generate method is the core of the programme. It starts by optionally setting a random seed for reproducibility, ensuring that the sequence generation can be replicated.
The first k characters of the sequence are generated using the zeroth-order probabilities, as the Markov chain cannot determine transitions without an initial state.
After initializing the first k characters, the method generates the remaining characters using k-th order probabilities.
It does this by looking at the last k characters (the current k-mer) and choosing the next character based on the transition probabilities defined in kth.
If a k-mer does not exist in kth, the algorithm falls back to the zeroth-order probabilities, ensuring that the sequence generation continues.
Initially, the problem seemed straightforward, but the tests made it more challenging. The requirement to handle missing k-mers in the kth dictionary was not immediately obvious from the task description. The approach taken uses a while loop to manage sequence generation which ensures flexibility. The use of a fallback mechanism for missing k-mers makes the solution robust against incomplete transition matrices.
Although not strictly related to the functioning of the solution, a potential improvement could be further conciseness of debug statements. These make the code more verbose, but i find the output helpful to visualize the sequence of events.
If you have survived so far without problems, please run your program a few more times with different inputs. At some point you should get a lookup error in your hash-table! The reason for this is not your code, but the way we defined the model: Some $k$-mers may not be among the training data (input sequence $T$), but such can be generated as the first $k$-mer that is generated using the zero-order model.
A general approach to fixing such issues with incomplete training data is to use pseudo counts. That is, all imaginable events are initialized to frequency count 1.
13. Write a new solution context_pseudo_probabilities
based on the solution to problem 11. But this time use pseudo counts in order to obtain a $k$-th order Markov chain that can assign a probability for any DNA sequence. You may use the standard library function itertools.product
to iterate over all $k$-mer of given length (product("ACGT", repeat=k)
).
DEBUG13 = False # Set to [ True ] to enable debug output, [ False ] to disable
def context_pseudo_probabilities(s, k):
nucleotides = "ACGT"
pseudo_count = 1 # pseudo-count value to add
# handle k == 0
if k == 0:
# add dict with pseudo_count value for each nucleotide
counts_dict = {nucleotide: pseudo_count for nucleotide in nucleotides}
if DEBUG13:
print(f'counts_dict (k==0): {counts_dict}')
# increment value for those nucleotides that exist in sequence
for nucleotide in s:
if nucleotide in counts_dict:
counts_dict[nucleotide] += 1
# calculate probabilities
total_count = sum(counts_dict.values())
zeroth_pseudo_probabilities = {nucleotide: count / total_count for nucleotide, count in counts_dict.items()}
if DEBUG13:
print(f'count incremented : {s}')
print(f'counts_dict (k==0): {counts_dict}')
print()
return {"": zeroth_pseudo_probabilities}
# handle k > 0
else:
# generate all possible k-mers
kmers = [''.join(kmer) for kmer in product(nucleotides, repeat=k)]
# initialize k-mers with pseudo-count
counts_dict = {kmer: {nucleotide: pseudo_count for nucleotide in nucleotides} for kmer in kmers}
if DEBUG13:
print(f'\ncounts_dict (k>0): {counts_dict}')
# update counts based on the given sequence
for i in range(len(s) - k):
kmer = s[i:i+k]
next_nucleotide = s[i+k]
if kmer in counts_dict:
counts_dict[kmer][next_nucleotide] += 1
# calculate probabilities
kth_pseudo_probabilities = {}
for kmer, nucleotide_counts in counts_dict.items():
total = sum(nucleotide_counts.values())
kth_pseudo_probabilities[kmer] = {nucleotide: count / total for nucleotide, count in nucleotide_counts.items()}
if DEBUG13:
print(f'count incremented: {s}')
print(f'counts_dict (k>0): {counts_dict}')
print(f'sum values count to divide with: {total}')
print(f'\nkth:')
return kth_pseudo_probabilities
if __name__ == '__main__':
k = 2
s = "ATGATATCATCGACGATGTAG" #"ATG"
zeroth = context_pseudo_probabilities(s, 0)[""]
print(f"zeroth: {zeroth}")
kth = context_pseudo_probabilities(s, k)
print("\n".join(f"{k}: {dict(v)}" for k, v in kth.items()))
print("\n", MarkovChain(zeroth, kth, k).generate(20))
zeroth: {'A': 0.32, 'C': 0.16, 'G': 0.24, 'T': 0.28} AA: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} AC: {'A': 0.2, 'C': 0.2, 'G': 0.4, 'T': 0.2} AG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} AT: {'A': 0.2222222222222222, 'C': 0.3333333333333333, 'G': 0.3333333333333333, 'T': 0.1111111111111111} CA: {'A': 0.2, 'C': 0.2, 'G': 0.2, 'T': 0.4} CC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} CG: {'A': 0.5, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.16666666666666666} CT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} GA: {'A': 0.14285714285714285, 'C': 0.2857142857142857, 'G': 0.14285714285714285, 'T': 0.42857142857142855} GC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} GG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} GT: {'A': 0.4, 'C': 0.2, 'G': 0.2, 'T': 0.2} TA: {'A': 0.16666666666666666, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.3333333333333333} TC: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.16666666666666666} TG: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.3333333333333333} TT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} TGATGATTTTCGTGCAGGTT
The task requires implementing a pseudo-count based, context probability model, to handle unseen k-mers in a DNA sequence. The code has different approches based on the value of k:
k = 0:
k > 0:
This approach ensures that all possible k-mers have a non-zero probability, even if they were not observed in the training data.
Initially, the task was difficult for me to complete, as I wasn't sure how to proceed or to satisfy all the tests. I tried using earlier functions to provide a base for completing this task, but all such implementations quickly became overly complex.
The solution addresses the issue of unseen k-mers by initializing all possible k-mers with a pseudo-count of 1. This approach prevents zero probabilities for k-mers not present in the sequence. The results from the provided example demonstrate that the function correctly computes both zeroth and kth order probabilities. Further testing and handling for edge cases could probably still improve the code.
14. Write class MarkovProb
that given the $k$-th order Markov chain developed above to the constructor, its method probability
computes the probability of a given input DNA sequence.
DEBUG14 = False # Set to [ True ] to enable debug output, [ False ] to disable
class MarkovProb:
def __init__(self, k, zeroth, kth):
self.k = k
self.zeroth = zeroth
self.kth = kth
def probability(self, s):
if DEBUG14:
print(f'zeroth {self.zeroth}')
print(f'kth {self.kth}')
print(f'input sequence: {s}')
print()
#handle k == 0
if self.k == 0:
# Initialize the cumulative probability to 1
prob = 1.0
# Create list containing probability for each nucleotide in the sequence
probs = [self.zeroth[c] for c in s]
# Calculate the cumulative product of all probabilities
prob = reduce(lambda x, y: x * y, probs, 1.0)
if DEBUG14:
print(f'zeroth probabilities of sequences individual neucleotides, placed in list: {probs}')
print(f'cumulative product of all the values in the list: {prob}')
# return product of the probabilities
return(prob)
# handle k > 0
else:
# initialize the cumulative probability to 1
prob = 1.0
# calculate the probability of forming initial k-mer
initial_kmer = s[:self.k]
initial_prob = reduce(lambda x, y: x * y, [self.zeroth.get(c, 0) for c in initial_kmer], 1.0)
if DEBUG14:
print(f'i-mer: {initial_kmer} | | probability {initial_prob}')
# combine the initial k-mer probability with the transition probabilities
prob *= initial_prob
# Calculate the transition probabilities for the rest of the sequence
for i in range(len(s) - self.k):
kmer = s[i:i+self.k]
next_nucleotide = s[i+self.k]
if kmer in self.kth:
prob *= self.kth[kmer][next_nucleotide]
if DEBUG14:
print(f'k-mer: {kmer} | nucleotide {next_nucleotide} | probability {prob}')
# return product of the probabilities
return prob
if __name__ == '__main__':
k = 2
kth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", k)
zeroth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", 0)[""]
mc = MarkovProb(2, zeroth, kth)
#test with k = 0
#mc = MarkovProb(0, zeroth, kth)
s="ATGATATCATCGACGATGTAG"
print(f"Probability of sequence {s} is {mc.probability(s)}")
Probability of sequence ATGATATCATCGACGATGTAG is 2.831270190340017e-10
The solution uses class MarkovProb to calculates the probability of a DNA sequence based on a given Markov model. Again, The code has different approches based on the value of k:
For k = 0: The method iterates through each nucleotide in the sequence, fetches its probability from the zeroth-order probabilities andding it to a list, and computes the product of these probabilities.
For k > 0: The method starts by calculating the probability of the initial k-mer based on individual nucleotide probabilities. It multiplies this with the overall probabiity value, then it iteratively calculates the transition probabilities for each subsequent nucleotide following the k-mers. The total probability is the product of the initial k-mer probability and all transition probabilities.
During implementation, an issue was encountered for cases where k > 0.
Initially, in cases where k > 0, the computation didn't correctly account for the initial k-mer's probability, leading to incorrect results. This was corrected by separately calculating the probability for the initial k-mer and then using it as a base to add with subsequent transitions.
The use of functional constructs helped make the code more concise and readable. However, this also introduces a potential risk for debugging since these constructs can obscure the detailed flow of calculations. I tried to find a balance between these in the code.
Overall, while the code works correctly for the given inputs and tests while also being relatively optimized. More comprehensive testing across various sequences, k-values and probabilities could probably increase robustness.
With the last assignment you might end up in trouble with precision, as multiplying many small probabilities gives a really small number in the end. There is an easy fix by using so-called log-transform. Consider computation of $P=s_1 s_2 \cdots s_n$, where $0\leq s_i\leq 1$ for each $i$. Taking logarithm in base 2 from both sides gives $\log _2 P= \log _2 (s_1 s_2 \cdots s_n)=\log_2 s_1 + \log_2 s_2 + \cdots \log s_n= \sum_{i=1}^n \log s_i$, with repeated application of the property that the logarithm of a multiplication of two numbers is the sum of logarithms of the two numbers taken separately. The results is abbreviated as log-probability.
15. Write class MarkovLog
that given the $k$-th order Markov chain developed above to the constructor, its method log_probability
computes the log-probability of a given input DNA sequence. Run your program with $T=$ ATGATATCATCGACGATGTAG
and $k=2$.
DEBUG15 = False # Set to [ True ] to enable debug output, [ False ] to disable
class MarkovLog(object):
def __init__(self, k, zeroth, kth):
self.k = k
self.zeroth = zeroth
self.kth = kth
def log_probability(self, s):
if DEBUG15:
print(f's = {YELLOW}{s}{RESET}')
print(f'k = {self.k}')
print(f'zeroth = {self.zeroth}')
print(f'kth = {self.kth}')
print()
# initialize probability to log space (log(1) = 0)
log_prob = 0.0
# calculate the log-probability of forming initial k-mer
initial_kmer = s[:self.k]
initial_log_prob = reduce(lambda x, y: x + math.log2(self.zeroth[y]), initial_kmer, 0)
if DEBUG15:
print(f'initial kmer: {initial_kmer} | log probability: {initial_log_prob}')
# calculate log transition probabilities
for i in range(len(s) - self.k):
kmer = s[i:i+self.k]
next_nucleotide = s[i+self.k]
if kmer in self.kth:
log_prob += math.log2(self.kth[kmer][next_nucleotide])
if DEBUG15:
print(f'k-mer: {YELLOW}{kmer}{RESET} | nucleotide {YELLOW}{next_nucleotide}{RESET} | transition probability: {self.kth[kmer][next_nucleotide]:.2f} | transition LOG probability: {math.log2(self.kth[kmer][next_nucleotide]):.2f} | total transition LOG probability {log_prob:.2f}')
# combine the initial k-mer probability with the transition probabilities
total_log_prob = initial_log_prob + log_prob
total_log_prob = round(total_log_prob, 5)
if DEBUG15:
print(f'initial log probability: {initial_log_prob} | total transition log probability: {log_prob} | final log probability: {total_log_prob}')
print()
return(total_log_prob)
if __name__ == '__main__':
k = 2
kth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", k)
zeroth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", 0)[""]
mc = MarkovLog(2, zeroth, kth)
#mc = MarkovLog(0, zeroth, kth)
s="ATGATATCATCGACGATGTAG"
print(f"Log probability of sequence {s} is {mc.log_probability(s)}")
Log probability of sequence ATGATATCATCGACGATGTAG is -31.71783
The goal of this solution is to calculate the log-probability of a given DNA sequence based on the Markov model. Here's how the solution works:
Initialize the Log Probability: Start with a log-probability of 0.0, which is equivalent to a probability of 1 in log space.
Compute Initial k-mer Log-Probability: The probability of observing the initial k-mer (the first k characters of the sequence) is computed by summing the log2 probabilities of each character's occurrence, based on the zeroth-order probabilities.
Iteratively Calculate Transition Log-Probabilities: For each subsequent nucleotide in the sequence, compute the conditional log-probability given the preceding k-mer using the k-th order probabilities. Sum these log-probabilities to the total log-probability.
Combine Results: The total log-probability is obtained by summing the initial k-mer log-probability with the sum of transition log-probabilities. The final log-probability is rounded and returned as the result.
The program follows a similar pattern as the last task, now calculating the log probability. It performs well with the given example sequence ATGATATCATCGACGATGTAG and k=2. The calculated log-probability matches expected values based on the provided zeroth and k-th order probabilities. It also passes all the tests.
The main problem I had with this task was understanding that I should have used log2 to get the correct results, which led quite alot of debugging, although it was clear from reading the task again. A difference compared to the previous task is that there isn't any code sperately handling input where k is 0. It didn't appear to be a requirement from the task description and the tests passed without it, so to keep the code more concise I left that part out.
Finally, if you try to use the code so far for very large inputs, you might observe that the concatenation of symbols following a context occupy considerable amount of space. This is unnecessary, as we only need the frequencies.
16. Optimize the space requirement of your code from exercise 13 for the $k$-th order Markov chain by replacing the concatenations by direct computations of the frequencies. Implement this as the
better_context_probabilities
function.
DEBUG16 = False # Set to [ True ] to enable debug output, [ False ] to disable
def better_context_probabilities(s, k):
# initialize variables
nucleotides = "ACGT"
pseudo_count = 1
# handle k == 0
if k == 0:
counts_dict = defaultdict(lambda: pseudo_count) # SPACE IMPROVMENT
if DEBUG16:
print(f's: {s}')
print(f'counts_dict pseudo count: {counts_dict}')
for nucleotide in s:
counts_dict[nucleotide] += 1
if DEBUG16:
print(f'counts_dict after increment: {counts_dict}')
total = sum(counts_dict.values())
return {"": {n: counts_dict[n] / total for n in nucleotides}}
# initialize counts for all possible k-mers
counts_dict = {
''.join(kmer): {n: pseudo_count for n in nucleotides} # SPACE IMPROVEMENT
for kmer in product(nucleotides, repeat=k)
}
if DEBUG16:
print(f'counts_dict pseudo count: {counts_dict}')
# count occurrences in the sequence
for i in range(len(s) - k):
kmer = s[i:i+k]
next_nucleotide = s[i+k]
counts_dict[kmer][next_nucleotide] += 1
if DEBUG16:
print(f'counts_dict after increment: {counts_dict}')
# calculate probabilities
probabilities = {}
for kmer, nucleotide_counts in counts_dict.items():
total = sum(nucleotide_counts.values())
probabilities[kmer] = {n: count / total for n, count in nucleotide_counts.items()}
return probabilities
if __name__ == '__main__':
k = 2
# test k == 0
#k = 0
s = "ATGATATCATCGACGATGTAG"
d = better_context_probabilities(s, k)
print("\n".join(f"{k}: {v}" for k, v in d.items()))
AA: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} AC: {'A': 0.2, 'C': 0.2, 'G': 0.4, 'T': 0.2} AG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} AT: {'A': 0.2222222222222222, 'C': 0.3333333333333333, 'G': 0.3333333333333333, 'T': 0.1111111111111111} CA: {'A': 0.2, 'C': 0.2, 'G': 0.2, 'T': 0.4} CC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} CG: {'A': 0.5, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.16666666666666666} CT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} GA: {'A': 0.14285714285714285, 'C': 0.2857142857142857, 'G': 0.14285714285714285, 'T': 0.42857142857142855} GC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} GG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25} GT: {'A': 0.4, 'C': 0.2, 'G': 0.2, 'T': 0.2} TA: {'A': 0.16666666666666666, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.3333333333333333} TC: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.16666666666666666} TG: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.3333333333333333} TT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
The test already passed with my original code from task 13. In that implementation - context_dict = context_list(s, k) - concatenating the symbols to string denoting the full context is not called (as it is in exercise 11). In light of this, the solution contains some minor improvements, marked by the comment space improvement above, and detailed further below.
ORIGINAL:
OPTIMIZED:
BENEFIT:
ORIGINAL:
OPTIMIZED:
BENEFIT:
As mentioned in the description of the solution I wasn't quite sure how to proceed with this task, as the original implementation already passed the tests. I think there would have been a wider range of different solutions I might have considered if it wasn't necessary to initialize pseudo-counts as well. As the task and the tests were set out however I felt a bit restricted in my ability to make significant changes, and settled for making a few minor improvements. The changes ensure that the solution effectively optimizes memory usage for both k == 0 and k > 0 cases by directly computing frequencies instead of storing extensive intermediate data.
While the earlier approach of explicit concatenation of symbols following a context suffered from inefficient use of space, it does have a benefit of giving another much simpler strategy to sample from the distribution: observe that an element of the concatenation taken uniformly randomly is sampled exactly with the correct probability.
17. Revisit the solution 12 and modify it to directly sample from the concatenation of symbols following a context. The function np.random.choice
may be convenient here. Implement the modified version as the new SimpleMarkovChain
class.
DEBUG17 = False # Set to [ True ] to enable debug output, [ False ] to disable
class SimpleMarkovChain(object):
def __init__(self, s, k):
self.k = k
self.s = s
def generate(self, n, seed=None):
if DEBUG17:
print(f's: {self.s}')
print(f'n: {n}')
print(f'k: {self.k}')
# initialize seed
if seed is not None:
np.random.seed(seed)
gen_chars = ""
initial_lst = list(self.s)
# handle k = 0 as a special case
if self.k == 0:
gen_chars = ''.join(np.random.choice(initial_lst, n))
if DEBUG17:
print(f'gen_chars: {gen_chars}')
return gen_chars
# generate initial k characters
gen_chars = ''.join(np.random.choice(initial_lst, self.k))
if DEBUG17:
print(f'initial gen_chars: {gen_chars}')
# generate transition list
transition_list = [self.s[i:i+self.k+1] for i in range(len(self.s) - self.k)]
if DEBUG17:
print(f'transition_list: {transition_list}') # ATGATATCATCGACGATGTAG -> ['ATG', 'TGA', 'GAT', 'ATA', 'TAT', 'ATC', 'TCA', 'CAT', 'ATC', 'TCG', 'CGA', 'GAC'....
print('------------------')
# loop through sequence, for each kmer chosing among possible transitions
# if no possible transitions chose randomly from initial characters
while len(gen_chars) < n:
# extraxt kmer
kmer = gen_chars[-self.k:]
if DEBUG17:
print(f'kmer: {kmer}')
# find all possible next characters for kmer
transition_chars = [t[-1] for t in transition_list if t.startswith(kmer)]
if DEBUG17:
print(f'transition_chars: {transition_chars}')
# if the last generated kmer exists in original tranistion list, pick from there
# otherwise use fallback
# add generated transition character to sequence of all generated characters
if transition_chars:
gen_char = np.random.choice(transition_chars)
if DEBUG17:
print(f'gen_char: {gen_char}')
gen_chars += gen_char
if DEBUG17:
print(f'gen_chars: {gen_chars}')
print('------------------')
else:
fallback_char = np.random.choice(initial_lst)
if DEBUG17:
print(f'fallback_char: {fallback_char}')
gen_chars += fallback_char
if DEBUG17:
print(f'gen_chars: {gen_chars}')
print('------------------')
return gen_chars[:n]
if __name__ == '__main__':
k = 2
# test k == 0
#k = 0
s = "ATGATATCATCGACGATGTAG"
n = 10
seed = 7
#seed = 6
mc = SimpleMarkovChain(s, k)
print(mc.generate(n, seed))
'''
s="ATGATATCATCGACGATGTAG"
seed=0
for n in range(40):
for k in range(4):
mc = SimpleMarkovChain(s, k)
s2 = mc.generate(n, seed)
print(f's2: {s2}')
'''
ATATCGATAT
The solution to this task is based on directly using the concatenation of symbols following a given context to sample from the distribution efficiently.
It begins by initializing the random seed using np.random.seed(seed) to ensure reproducibility.
For the special case when k = 0, the context length is zero, which means there are no preceding characters to consider. In this case, the solution generates n random characters directly from the input string s. This is achieved by uniformly sampling characters using np.random.choice.
When k > 0, the solution starts by generating an initial string of k random characters from the input.
It then constructs a list of all possible k+1-length substrings (transition_list) from the input string s. These substrings represent all possible transitions, where each transition is a k-mer followed by a single character that can follow that k-mer.
The core of the solution is a loop that extends the generated sequence character by character.
By directly sampling from the list of possible transitions rather than maintaining a complex probability distribution or dictionary, the solution efficiently generates sequences according to the specified Markov chain order.
The solution correctly handles both the case where k = 0 and cases where k > 0, generating sequences based on the input string's statistical properties. For k = 0, the output is a random sequence of characters from s, which matches expectations since there is no dependency on previous characters.
For k > 0, the generated sequence reflects the input string's character transitions for the specified k-mer length. The debug output shows the step-by-step generation process, including the construction of the transition list and the selection of each character. This helps ensure the logic aligns with the Markov chain principles.
During development, one of the challenges was handling cases where no valid transitions were found for a given k-mer, which could lead to an empty transition list and errors. The current solution mitigates this by including a fallback mechanism that samples from the original input string, thus ensuring that the generation process continues regardless of the transition list content.
Our $k$-th order Markov chain can now be modified to a handy index structure called $k$-mer index. This index structure associates to each $k$-mer its list of occurrence positions in DNA sequence $T$. Given a query $k$-mer $W$, one can thus easily list all positions $i$ with $T[i..k-1]=W$.
18. Implement function kmer_index
inspired by your earlier code for the $k$-th order Markov chain. Test your program with ATGATATCATCGACGATGTAG
and $k=2$.
DEBUG18 = False # Set to [ True ] to enable debug output, [ False ] to disable
def kmer_index(s, k):
kmer_indices_dict = {}
# handle k = 0 as a special case
if k == 0:
zeroth_indices_dict = {}
zeroth_indices_dict[''] = list(range(len(s)+1))
return zeroth_indices_dict
# iterate over all possible k-mer starting positions
for i in range(len(s) - k + 1):
# extract kmer
kmer = s[i:i + k]
# append the index of kmer first character
if kmer not in kmer_indices_dict:
kmer_indices_dict[kmer] = []
kmer_indices_dict[kmer].append(i)
if DEBUG18:
print(f'Final k-mer index dictionary: {kmer_indices_dict}')
print("Indices collected for each k-mer:")
for kmer, indices in kmer_indices_dict.items():
print(f'K-mer: {YELLOW}{kmer}{RESET}, Indices: {YELLOW}{indices}{RESET}')
return kmer_indices_dict
if __name__ == '__main__':
k=2
# test k == 0
#k=0
s = "ATGATATCATCGACGATGTAG"
print("Using string:")
print(s)
print("".join([str(i%10) for i in range(len(s))]))
print(f"\n{k}-mer index is:")
d=kmer_index(s, k)
print(dict(d))
Using string: ATGATATCATCGACGATGTAG 012345678901234567890 2-mer index is: {'AT': [0, 3, 5, 8, 15], 'TG': [1, 16], 'GA': [2, 11, 14], 'TA': [4, 18], 'TC': [6, 9], 'CA': [7], 'CG': [10, 13], 'AC': [12], 'GT': [17], 'AG': [19]}
The objective of this task is to construct a k-mer index for a given DNA sequence s. The function kmer_index is implemented to achieve this:
Handle Special Case (k = 0): The function returns a dictionary with an empty string as the key and a list containing all possible positions in the sequence. This is because a 0-mer effectively means "no context" so every position is a valid occurrence.
General Case (k > 0): For values of k greater than 0, the function iterates over the sequence s to extract all possible k-mers. This is done by sliding a window of length k across the string:
Debugging output is provided to print the final k-mer index dictionary and its contents, which helps in verifying correctness and understanding the distribution of k-mers in the sequence.
The solution creates a k-mer index by directly mapping each k-mer to its list of starting positions in the input string s.
Initially, the solution didn't account for k = 0, resulting in test failures and incorrect outputs. By explicitly handling this case, where every position in the sequence is a valid occurrence of a 0-mer, the solution now correctly outputs a dictionary. .
During development another issue that came up, was that the function did not account for all starting positions. The debugging output was helpful in identifying logical errors and ensuring that the function correctly handles various values of k and the entire sequence s.
Now that we know how to learn probability distributions from data, we might want to compare two such distributions, for example, to test if our programs work as intended.
Let $P=\{p_1,p_2,\ldots, p_n\}$ and $Q=\{q_1,q_2,\ldots, q_n\}$ be two probability distributions for the same set of $n$ events. This means $\sum_{i=1}^n p_i=\sum_{i=1}^n q_i=1$, $0\leq p_j \leq 1$, and $0\leq q_j \leq 1$ for each event $j$.
Kullback-Leibler divergence is a measure $d()$ for the relative entropy of $P$ with respect to $Q$ defined as $d(P||Q)=\sum_{i=1}^n p_i \log\frac{p_i}{q_i}$.
This measure is always non-negative, and 0 only when $P=Q$. It can be interpreted as the gain of knowing $Q$ to encode $P$. Note that this measure is not symmetric.
19. Write function kullback_leibler
to compute $d(P||Q)$. Test your solution by generating a random RNA sequence
encoding the input protein sequence according to the input codon adaptation probabilities.
Then you should learn the codon adaptation probabilities from the RNA sequence you generated.
Then try the same with uniformly random RNA sequences (which don't have to encode any
specific protein sequence). Compute the relative entropies between the
three distribution (original, predicted, uniform) and you should observe a clear difference.
Because $d(P||Q)$ is not symmetric, you can either print both $d(P||Q)$ and $d(Q||P)$,
or their average.
This problem may be fairly tricky. Only the kullback_leibler
function is automatically tested. The codon probabilities is probably a useful helper function. The main guarded section can be completed by filling out the pass
sections using tooling from previous parts and fixing the placeholder lines.
DEBUG19 = False # Set to [ True ] to enable debug output, [ False ] to disable
#______________________________________________________________________________________________________________[codon_probabilities]
def codon_probabilities(rna):
"""
Given an RNA sequence, simply calculates the proability of
all 3-mers empirically based on the sequence
"""
# extract the indices of all 3-mers in the RNA sequence
codon_sequence_index_list = kmer_index(rna, 3)
# calculate the total number of codons
total_count = sum(len(indices) for indices in codon_sequence_index_list.values())
# calculate the probability of each codon
probability_codon_seq_dict = {
codon_seq: len(indices) / total_count
for codon_seq, indices in codon_sequence_index_list.items()
}
if DEBUG19:
print(f'codon_sequence_index_list: {sorted(codon_sequence_index_list.items())}')
print(f'total_count: {total_count}')
print('indices count / total indices count ~')
return probability_codon_seq_dict
#______________________________________________________________________________________________________________[get_codon_probabilities_from_table]
def get_codon_probabilities_from_table(codon_table):
"""
Extracts and normalizes codon probabilities from a codon usage table string.
"""
codon_probabilities = {}
# regular expression to match codons and their frequencies
freq_pattern = re.compile(r'([AUGC]{3})\s+[A-Z\*]\s+\d+\.\d+\s+(\d+\.\d+)\s*\(\s*\d+\s*\)')
# find all matches in the codon_table
matches = freq_pattern.findall(codon_table)
# convert matches to a probability dictionary
for codon, frequency_str in matches:
frequency = float(frequency_str)
if codon != 'UAA' and codon != 'UGA' and codon != 'UAG': # Exclude stop codons
codon_probabilities[codon] = frequency / 1000 # Convert to probability
if DEBUG19:
print(f'matches: {sorted(matches)}')
print(f'codon_probabilities (percentage: frequency / 1000): {sorted(codon_probabilities.items())}')
print(f'normalized probability (percentage / by sum of percentages, ensures that they add up to 1) ~')
# normalize probabilities, ensure that they dd up to 1
total = sum(codon_probabilities.values())
for codon in codon_probabilities:
codon_probabilities[codon] /= total
return codon_probabilities
#______________________________________________________________________________________________________________[generate_uniform_codon_probabilities]
def generate_uniform_codon_probabilities(codon_list):
"""
Generates a uniform probability distribution for a given list of codons.
"""
uniform_prob = 1 / len(codon_list)
return {codon: uniform_prob for codon in codon_list}
#______________________________________________________________________________________________________________[renormalize]
def renormalize(distribution, reference_distribution):
"""
Renormalizes a probability distribution so that it is based only on
the keys present in the reference distribution and normalizes the
probabilities to sum to 1.
"""
# filter distribution to only include keys present in reference_distribution
filtered_distribution = {
key: distribution[key] for key in distribution if key in reference_distribution
}
# sum the filtered probabilities
total = sum(filtered_distribution.values())
# normalize the probabilities if total is greater than 0
if total > 0:
normalized_distribution = {key: value / total for key, value in filtered_distribution.items()}
else:
normalized_distribution = filtered_distribution
return normalized_distribution
#______________________________________________________________________________________________________________[kullback_leibler]
def kullback_leibler(p, q):
"""
Computes Kullback-Leibler divergence between two distributions.
Both p and q must be dictionaries from events to probabilities.
The divergence is defined only when q[event] == 0 implies p[event] == 0.
"""
kl_divergence = 0
for event, p_value in p.items():
if event in q:
q_value = q[event]
if q_value == 0:
if p_value == 0:
continue # 0 * log(0/0) interpreted as 0
else:
raise ZeroDivisionError(f"q[{event}] is zero but p[{event}] is non-zero")
if p_value > 0:
kl_divergence += p_value * math.log(p_value / q_value, 2) # base-2 logarithm
return kl_divergence
#______________________________________________________________________________________________________________[MAIN]
if __name__ == '__main__':
aas = list("*ACDEFGHIKLMNPQRSTVWY") # List of amino acids
n = 10#000
#n = 10000
#* generate a random protein and some associated rna
protein = "".join(choice(aas, n))
print(f'protein: {protein}')
protein_to_random_rna = ProteinToRandomRNA()
random_rna_conversion = protein_to_random_rna.convert(protein)
print(f'random_rna_conversion: {random_rna_conversion}')
#* Maybe check that converting back to protein results in the same sequence
amino_conversion = rna_to_prot(random_rna_conversion)
print(f'amino_conversion: {amino_conversion}')
print()
#* Calculate codon probabilities of the rna sequence
# =========================================================================================================[1]
if DEBUG19:
print(f'{RED}STARTING CALCULATION PROCESS FOR:{RESET} {L_ORANGE}cp_predicted{RESET}')
cp_predicted = codon_probabilities(random_rna_conversion) # placeholder call
cp_predicted = dict(sorted(cp_predicted.items()))
print(f'{L_ORANGE}cp_predicted{RESET}: {cp_predicted}')
if DEBUG19:
for codon, prob in cp_predicted.items():
print(f'codon: {YELLOW}{codon}{RESET}, probability: {YELLOW}{prob}{RESET}')
print('********************************************************************************')
print()
#* Calculate codon probabilities based on the codon usage table
# =========================================================================================================[2]
if DEBUG19:
print(f'{RED}STARTING CALCULATION PROCESS FOR:{RESET} {L_ORANGE}cp_orig{RESET}')
codon_table = '''
UUU F 0.46 17.6 (714298) UCU S 0.19 15.2 (618711) UAU Y 0.44 12.2 (495699) UGU C 0.46 10.6 (430311)
UUC F 0.54 20.3 (824692) UCC S 0.22 17.7 (718892) UAC Y 0.56 15.3 (622407) UGC C 0.54 12.6 (513028)
UUA L 0.08 7.7 (311881) UCA S 0.15 12.2 (496448) UAA * 0.30 1.0 ( 40285) UGA * 0.47 1.6 ( 63237)
UUG L 0.13 12.9 (525688) UCG S 0.05 4.4 (179419) UAG * 0.24 0.8 ( 32109) UGG W 1.00 13.2 (535595)
CUU L 0.13 13.2 (536515) CCU P 0.29 17.5 (713233) CAU H 0.42 10.9 (441711) CGU R 0.08 4.5 (184609)
CUC L 0.20 19.6 (796638) CCC P 0.32 19.8 (804620) CAC H 0.58 15.1 (613713) CGC R 0.18 10.4 (423516)
CUA L 0.07 7.2 (290751) CCA P 0.28 16.9 (688038) CAA Q 0.27 12.3 (501911) CGA R 0.11 6.2 (250760)
CUG L 0.40 39.6 (1611801) CCG P 0.11 6.9 (281570) CAG Q 0.73 34.2 (1391973) CGG R 0.20 11.4 (464485)
AUU I 0.36 16.0 (650473) ACU T 0.25 13.1 (533609) AAU N 0.47 17.0 (689701) AGU S 0.15 12.1 (493429)
AUC I 0.47 20.8 (846466) ACC T 0.36 18.9 (768147) AAC N 0.53 19.1 (776603) AGC S 0.24 19.5 (791383)
AUA I 0.17 7.5 (304565) ACA T 0.28 15.1 (614523) AAA K 0.43 24.4 (993621) AGA R 0.21 12.2 (494682)
AUG M 1.00 22.0 (896005) ACG T 0.11 6.1 (246105) AAG K 0.57 31.9 (1295568) AGG R 0.21 12.0 (486463)
GUU V 0.18 11.0 (448607) GCU A 0.27 18.4 (750096) GAU D 0.46 21.8 (885429) GGU G 0.16 10.8 (437126)
GUC V 0.24 14.5 (588138) GCC A 0.40 27.7 (1127679) GAC D 0.54 25.1 (1020595) GGC G 0.34 22.2 (903565)
GUA V 0.12 7.1 (287712) GCA A 0.23 15.8 (643471) GAA E 0.42 29.0 (1177632) GGA G 0.25 16.5 (669873)
GUG V 0.46 28.1 (1143534) GCG A 0.11 7.4 (299495) GAG E 0.58 39.6 (1609975) GGG G 0.25 16.5 (669768)
'''
cp_orig = get_codon_probabilities_from_table(codon_table)
cp_orig = dict(sorted(cp_orig.items()))
print(f'{L_ORANGE}cp_orig{RESET}: {cp_orig}')
if DEBUG19:
for codon, prob in cp_orig.items():
print(f'codon: {YELLOW}{codon}{RESET}, probability: {YELLOW}{prob}{RESET}')
print('********************************************************************************')
print()
#* Create a completely random RNA sequence and get the codon probabilities
# =========================================================================================================[3]
#cp_uniform = codon_probabilities("<random rna sequence>") # placeholder call
if DEBUG19:
print(f'{RED}STARTING CALCULATION PROCESS FOR:{RESET} {L_ORANGE}cp_uniform{RESET}')
# list of possible codons
codons = [a+b+c for a in "ACGU" for b in "ACGU" for c in "ACGU"]
# generate a uniform codon probability distribution
cp_uniform = generate_uniform_codon_probabilities(codons)
cp_uniform = dict(sorted(cp_uniform.items()))
print(f'{L_ORANGE}cp_uniform{RESET}: {cp_uniform}')
if DEBUG19:
for codon, prob in cp_uniform.items():
print(f'codon: {YELLOW}{codon}{RESET}, probability: {YELLOW}{prob}{RESET}')
print('********************************************************************************')
print()
# **************************************************************************
# we renormalize the probability values to match the length of elements in cp_predicted
# kullback_leibler as implemented only calculates the distance for elements that exist in both
# but the probability before renormalizing is based on all elements
cp_orig_renormalized = renormalize(cp_orig, cp_predicted) # Renormalize based on cp_predicted
print()
print("d(original || predicted) =", kullback_leibler(cp_orig_renormalized, cp_predicted))
print("d(predicted || original) =", kullback_leibler(cp_predicted, cp_orig_renormalized))
print()
print("d(original || uniform) =", kullback_leibler(cp_orig, cp_uniform))
print("d(uniform || original) =", kullback_leibler(cp_uniform, cp_orig))
cp_uniform_renormalized = renormalize(cp_uniform, cp_predicted) # Renormalize based on cp_predicted
print()
print("d(predicted || uniform) =", kullback_leibler(cp_predicted, cp_uniform_renormalized))
print("d(uniform || predicted) =", kullback_leibler(cp_uniform_renormalized, cp_predicted))
'''
print()
p = dict(zip("ACGT", [1.0, 0.0, 0.0, 0.0]))
print(f'p: {p}')
q = dict(zip("ACGT", [0.25]*4))
print(f'q: {q}')
print('TEST2: ', kullback_leibler(p, q))
#self.assertAlmostEqual(kullback_leibler(p, q), 2.0, places=3
'''
protein: *MGWNFS*CK random_rna_conversion: UAAAUGGGGUGGAAUUUCUCCUGAUGUAAG amino_conversion: *MGWNFS*CK cp_predicted: {'AAA': 0.03571428571428571, 'AAG': 0.03571428571428571, 'AAU': 0.07142857142857142, 'AUG': 0.07142857142857142, 'AUU': 0.03571428571428571, 'CCU': 0.03571428571428571, 'CUC': 0.03571428571428571, 'CUG': 0.03571428571428571, 'GAA': 0.03571428571428571, 'GAU': 0.03571428571428571, 'GGA': 0.03571428571428571, 'GGG': 0.07142857142857142, 'GGU': 0.03571428571428571, 'GUA': 0.03571428571428571, 'GUG': 0.03571428571428571, 'UAA': 0.07142857142857142, 'UCC': 0.03571428571428571, 'UCU': 0.03571428571428571, 'UGA': 0.03571428571428571, 'UGG': 0.07142857142857142, 'UGU': 0.03571428571428571, 'UUC': 0.03571428571428571, 'UUU': 0.03571428571428571} cp_orig: {'AAA': 0.02447833065810594, 'AAC': 0.019161316211878016, 'AAG': 0.0320024077046549, 'AAU': 0.017054574638844307, 'ACA': 0.015148475120385235, 'ACC': 0.018960674157303375, 'ACG': 0.006119582664526485, 'ACU': 0.013142054574638845, 'AGA': 0.01223916532905297, 'AGC': 0.01956260032102729, 'AGG': 0.012038523274478333, 'AGU': 0.012138844301765652, 'AUA': 0.007524077046548958, 'AUC': 0.020866773675762444, 'AUG': 0.022070626003210275, 'AUU': 0.01605136436597111, 'CAA': 0.01233948635634029, 'CAC': 0.015148475120385235, 'CAG': 0.03430979133226325, 'CAU': 0.010934991974317819, 'CCA': 0.016954253611556985, 'CCC': 0.01986356340288925, 'CCG': 0.006922150882825042, 'CCU': 0.017556179775280904, 'CGA': 0.006219903691813806, 'CGC': 0.010433386837881222, 'CGG': 0.011436597110754418, 'CGU': 0.004514446227929374, 'CUA': 0.007223113964686999, 'CUC': 0.019662921348314613, 'CUG': 0.0397271268057785, 'CUU': 0.013242375601926166, 'GAA': 0.02909309791332264, 'GAC': 0.025180577849117182, 'GAG': 0.0397271268057785, 'GAU': 0.021869983948635638, 'GCA': 0.015850722311396472, 'GCC': 0.027788924558587485, 'GCG': 0.007423756019261639, 'GCU': 0.01845906902086678, 'GGA': 0.01655296950240771, 'GGC': 0.022271268057784916, 'GGG': 0.01655296950240771, 'GGU': 0.0108346709470305, 'GUA': 0.00712279293739968, 'GUC': 0.01454654895666132, 'GUG': 0.028190208667736763, 'GUU': 0.011035313001605138, 'UAC': 0.015349117174959875, 'UAU': 0.01223916532905297, 'UCA': 0.01223916532905297, 'UCC': 0.01775682182985554, 'UCG': 0.004414125200642056, 'UCU': 0.015248796147672555, 'UGC': 0.01264044943820225, 'UGG': 0.013242375601926166, 'UGU': 0.010634028892455861, 'UUA': 0.007724719101123597, 'UUC': 0.020365168539325847, 'UUG': 0.012941412520064208, 'UUU': 0.017656500802568222} cp_uniform: {'AAA': 0.015625, 'AAC': 0.015625, 'AAG': 0.015625, 'AAU': 0.015625, 'ACA': 0.015625, 'ACC': 0.015625, 'ACG': 0.015625, 'ACU': 0.015625, 'AGA': 0.015625, 'AGC': 0.015625, 'AGG': 0.015625, 'AGU': 0.015625, 'AUA': 0.015625, 'AUC': 0.015625, 'AUG': 0.015625, 'AUU': 0.015625, 'CAA': 0.015625, 'CAC': 0.015625, 'CAG': 0.015625, 'CAU': 0.015625, 'CCA': 0.015625, 'CCC': 0.015625, 'CCG': 0.015625, 'CCU': 0.015625, 'CGA': 0.015625, 'CGC': 0.015625, 'CGG': 0.015625, 'CGU': 0.015625, 'CUA': 0.015625, 'CUC': 0.015625, 'CUG': 0.015625, 'CUU': 0.015625, 'GAA': 0.015625, 'GAC': 0.015625, 'GAG': 0.015625, 'GAU': 0.015625, 'GCA': 0.015625, 'GCC': 0.015625, 'GCG': 0.015625, 'GCU': 0.015625, 'GGA': 0.015625, 'GGC': 0.015625, 'GGG': 0.015625, 'GGU': 0.015625, 'GUA': 0.015625, 'GUC': 0.015625, 'GUG': 0.015625, 'GUU': 0.015625, 'UAA': 0.015625, 'UAC': 0.015625, 'UAG': 0.015625, 'UAU': 0.015625, 'UCA': 0.015625, 'UCC': 0.015625, 'UCG': 0.015625, 'UCU': 0.015625, 'UGA': 0.015625, 'UGC': 0.015625, 'UGG': 0.015625, 'UGU': 0.015625, 'UUA': 0.015625, 'UUC': 0.015625, 'UUG': 0.015625, 'UUU': 0.015625} d(original || predicted) = 0.3489037718934694 d(predicted || original) = 0.02399482746957388 d(original || uniform) = 0.2256062647904233 d(uniform || original) = 0.09069417636570291 d(predicted || uniform) = 0.07334989114226581 d(uniform || predicted) = 0.06640166165276526
The Kullback-Leibler (KL) divergence quantifies how one probability distribution differs from a second, reference distribution. In this task, we compare the original codon distribution derived from a codon usage table, the codon distribution predicted from an RNA sequence, and a uniform distribution. The KL divergence is calculated for each pair of distributions to assess the difference in their probability distributions.
To address the task of comparing probability distributions using Kullback-Leibler (KL) divergence, we can break down the solution into several key steps:
Generate RNA Sequences and Protein Encoding:
Retrieve Predicted Codon Probabilities:
Retrieve Original Codon Probabilities:
Retrieve Uniform Codon Probabilities:
Renormalize Distributions:
Compute Kullback-Leibler Divergence:
By comparing these values, we can understand how well our predicted distribution matches the original codon usage and how it differs from a random distribution. Through this process, we effectively evaluate the quality of our probability predictions and assess their relevance compared to known and random baselines.
The Kullback-Leibler (KL) divergence values obtained from the comparisons of the original, predicted, and uniform codon probability distributions provide insights into the quality of our predicted distribution relative to known baselines.
A higher KL divergence value indicates a greater divergence between the two distributions, meaning more information is lost when one distribution is used to approximate the other. This suggests a poorer fit or approximation.
A lower KL divergence value indicates that the two distributions are more similar, meaning less information is lost. This suggests a better fit or more accurate approximation.
In general, we expect the KL divergence between the original and predicted distributions to be lower than between either of them and the uniform distribution, demonstrating that specific codon usage patterns are captured better than a uniform distribution would suggest. The divergence values should clearly reflect the non-uniform nature of codon usage in real biological sequences compared to random or uniform distributions.
Here’s an analysis of the results:
KL Divergence Between Original and Predicted Distributions:
d(original || predicted)
= 0.15330031582470696d(predicted || original)
= 0.14047162716213069KL Divergence Between Original and Uniform Distributions:
d(original || uniform)
= 0.2256062647904233d(uniform || original)
= 0.09069417636570291KL Divergence Between Predicted and Uniform Distributions:
(predicted || uniform)
= 0.06332186437783757d(uniform || predicted)
= 0.055725754669781385Overall, these results illustrate that while the predicted distribution captures some characteristics of the original codon usage, there is still a significant difference, indicating that the model used to generate the predicted distribution might benefit from further refinement to better approximate the complex patterns seen in the original codon probabilities.
Let us consider a Markov chain of order one on the set of nucleotides. Its transition probabilities can be expressed as a $4 \times 4$ matrix $P=(p_{ij})$, where the element $p_{ij}$ gives the probability of the $j$th nucleotide on the condition the previous nucleotide was the $i$th. An example of a transition matrix is
\begin{array}{l|rrrr} & A & C & G & T \\ \hline A & 0.30 & 0.0 & 0.70 & 0.0 \\ C & 0.00 & 0.4 & 0.00 & 0.6 \\ G & 0.35 & 0.0 & 0.65 & 0.0 \\ T & 0.00 & 0.2 & 0.00 & 0.8 \\ \end{array}.
A distribution $\pi=(\pi_1,\pi_2,\pi_3,\pi_4)$ is called stationary, if $\pi = \pi P$ (the product here is matrix product).
20. Write function get_stationary_distributions
that gets a transition matrix as parameter,
and returns the list of stationary distributions. You can do this with NumPy by
first taking transposition of both sides of the above equation to get equation
$\pi^T = P^T \pi^T$. Using numpy.linalg.eig take all eigenvectors related to
eigenvalue 1.0. By normalizing these vectors to sum up to one get the stationary distributions
of the original transition matrix. In the main
function print the stationary distributions
of the above transition matrix.
DEBUG20 = False # Set to [ True ] to enable debug output, [ False ] to disable
def get_stationary_distributions(transition):
"""
The function get a transition matrix of a degree one Markov chain as parameter.
It returns a list of stationary distributions, in vector form, for that chain.
"""
# compute the transpose of the transition matrix
transition_T = np.transpose(transition)
# find eigenvalues and eigenvectors of the transposed matrix
eigenvalues, eigenvectors = np.linalg.eig(transition_T)
if DEBUG20:
print('Transition Matrix: Probabilities of Moving Between States')
for i,t in enumerate(transition): print(f'Transition from state {i+1}: {t} - Probabilities of moving to each state')
print()
print('Transpose of Transition Matrix: Probabilities of Ending Up in Each State')
for i,t in enumerate(transition_T): print(f'Column {i+1} of transposed matrix: {t} - Probabilities for ending up in state {i+1}')
print()
print('Eigenvalues of the Transposed Matrix: Indicate Steady-State and Dynamic Characteristic')
print(f'eigenvalues:\n{eigenvalues}')
print()
print('Eigenvectors of the Transposed Matrix: Represent Steady-State Distributions')
print(f'eigenvectors:\n{eigenvectors}')
print()
# identify eigenvectors corresponding to eigenvalue 1.0
stationary_distributions = []
for i, eigenvalue in enumerate(eigenvalues):
# check if the eigenvalue is approximately 1
if np.isclose(eigenvalue, 1.0, atol=1e-8):
eigenvector = np.real(eigenvectors[:, i]) # Take the real part of the eigenvector
# normalize the eigenvector to sum to 1
eigenvector = eigenvector / np.sum(eigenvector)
stationary_distributions.append(eigenvector)
return stationary_distributions
if __name__ == "__main__":
transition=np.array([[0.3, 0, 0.7, 0],
[0, 0.4, 0, 0.6],
[0.35, 0, 0.65, 0],
[0, 0.2, 0, 0.8]])
print("\n".join(
", ".join(
f"{pv:+.3f}"
for pv in p)
for p in get_stationary_distributions(transition)))
+0.333, +0.000, +0.667, +0.000 -0.000, +0.250, -0.000, +0.750
To solution aims to solve the problem of finding stationary distributions from a Markov transition matrix. Here’s a step-by-step outline:
Transpose the Transition Matrix: To convert the problem of finding stationary distributions into one involving eigenvectors, we first transpose the transition matrix. This transformation helps us solve the problem in terms of the matrix P^T, where P is the original transition matrix.
Compute Eigenvalues and Eigenvectors: Using NumPy's np.linalg.eig function, we compute the eigenvalues and eigenvectors of the transposed matrix. The eigenvalues provide insight into the nature of the steady-state behavior, particularly focusing on eigenvalue 1.0.
Extract Stationary Distributions: Eigenvectors corresponding to the eigenvalue 1.0 represent the stationary distributions of the Markov chain. We normalize these eigenvectors so that their components sum to 1, converting them into valid probability distributions.
Return the Results: The function returns these normalized eigenvectors as the list of stationary distributions.
The initial function returned only one stationary distribution, likely due to filtering criteria that excluded eigenvectors with minor negative values. To address this, the filtering criteria were adjusted to account for very small negative values by treating them as zero. The function now successfully identifies stationary distributions by focusing on eigenvectors associated with the eigenvalue 1.0. This approach is appropriate for understanding the long-term behavior of the Markov chain, as stationary distributions represent the probabilities of being in each state after many transitions.
21. Implement the kl_divergence
function below so that the main guarded code runs properly. Using your modified Markov chain generator generate a nucleotide sequence $s$ of length $10\;000$. Choose prefixes of $s$ of lengths $1, 10, 100, 1000$, and $10\;000$. For each of these prefixes find out their nucleotide distribution (of order 0) using your earlier tool. Use 1 as the pseudo count. Then, for each prefix, compute the KL divergence between the initial distribution and the normalized nucleotide distribution.
def kl_divergences(initial, transition):
"""
Calculates the the Kullback-Leibler divergences between empirical distributions
generated using a markov model seeded with an initial distributin and a transition
matrix, and the initial distribution.
Sequences of length [1, 10, 100, 1000, 10000] are generated.
"""
return zip([1, 10, 100, 1000, 10000], np.random.rand(5))
if __name__ == "__main__":
transition=np.array([[0.3, 0, 0.7, 0],
[0, 0.4, 0, 0.6],
[0.35, 0, 0.65, 0],
[0, 0.2, 0, 0.8]])
print("Transition probabilities are:")
print(transition)
stationary_distributions = get_stationary_distributions(transition)
print("Stationary distributions:")
print(np.stack(stationary_distributions))
initial = stationary_distributions[1]
print("Using [{}] as initial distribution\n".format(", ".join(f"{v:.2f}" for v in initial)))
results = kl_divergences(initial, transition)
for prefix_length, divergence in results: # iterate on prefix lengths in order (1, 10, 100...)
print("KL divergence of stationary distribution prefix " \
"of length {:5d} is {:.8f}".format(prefix_length, divergence))
Transition probabilities are: [[0.3 0. 0.7 0. ] [0. 0.4 0. 0.6 ] [0.35 0. 0.65 0. ] [0. 0.2 0. 0.8 ]] Transition Matrix: Probabilities of Moving Between States Transition from state 1: [0.3 0. 0.7 0. ] - Probabilities of moving to each state Transition from state 2: [0. 0.4 0. 0.6] - Probabilities of moving to each state Transition from state 3: [0.35 0. 0.65 0. ] - Probabilities of moving to each state Transition from state 4: [0. 0.2 0. 0.8] - Probabilities of moving to each state Transpose of Transition Matrix: Probabilities of Ending Up in Each State Column 1 of transposed matrix: [0.3 0. 0.35 0. ] - Probabilities for ending up in state 1 Column 2 of transposed matrix: [0. 0.4 0. 0.2] - Probabilities for ending up in state 2 Column 3 of transposed matrix: [0.7 0. 0.65 0. ] - Probabilities for ending up in state 3 Column 4 of transposed matrix: [0. 0.6 0. 0.8] - Probabilities for ending up in state 4 Eigenvalues of the Transposed Matrix: Indicate Steady-State and Dynamic Characteristic eigenvalues: [-0.05 1. 0.2 1. ] Eigenvectors of the Transposed Matrix: Represent Steady-State Distributions eigenvectors: [[-0.70710678 0.4472136 0. 0. ] [-0. 0. 0.70710678 -0.31622777] [ 0.70710678 0.89442719 -0. 0. ] [ 0. 0. -0.70710678 -0.9486833 ]] Stationary distributions: [[ 0.33333333 0. 0.66666667 0. ] [-0. 0.25 -0. 0.75 ]] Using [-0.00, 0.25, -0.00, 0.75] as initial distribution KL divergence of stationary distribution prefix of length 1 is 0.18802993 KL divergence of stationary distribution prefix of length 10 is 0.30765459 KL divergence of stationary distribution prefix of length 100 is 0.24639364 KL divergence of stationary distribution prefix of length 1000 is 0.59604912 KL divergence of stationary distribution prefix of length 10000 is 0.09190485
NOTE. Exercises in section "Stationary and equilibrium distributions (extra)" (exercises 20, 21, and 22) are not obligatory. Thus, you only need to do 19 exercises, if you are aiming to get full points.
22. Implement the following in the main
function.
Find the stationary distribution for the following transition matrix:
Since there is only one stationary distribution, it is called the equilibrium distribution. Choose randomly two nucleotide distributions. You can take these from your sleeve or sample them from the Dirichlet distribution. Then for each of these distributions as the initial distribution of the Markov chain, repeat the above experiment.
The main
function should return tuples, where the first element is the (random) initial distribution and the second element contains the results as a list of tuples where the first element is the kl divergence and the second element the empirical nucleotide distribution, for the different prefix lengths.
The state distribution should converge to the equilibrium distribution no matter how we start the Markov chain! That is the last line of the tables should have KL-divergence very close to $0$ and an empirical distribution very close to the equilibrium distribution.
def main(transition, equilibrium_distribution):
vals = list(zip(np.random.rand(10), np.random.rand(10, 4) - 0.5))
return zip(np.random.rand(2, 4) - 0.5,
[vals[:5], vals[5:]])
if __name__ == "__main__":
transition = np.array([[0.3, 0.1, 0.5, 0.1],
[0.2, 0.3, 0.15, 0.35],
[0.25, 0.15, 0.2, 0.4],
[0.35, 0.2, 0.4, 0.05]])
print("Transition probabilities are:", transition, sep="\n")
stationary_distributions = get_stationary_distributions(transition)
# Uncomment the below line to check that there actually is only one stationary distribution
# assert len(stationary_distributions) == 1
equilibrium_distribution = stationary_distributions[0]
print("Equilibrium distribution:")
print(equilibrium_distribution)
for initial_distribution, results in main(transition, equilibrium_distribution):
print("\nUsing {} as initial distribution:".format(initial_distribution))
print("kl-divergence empirical distribution")
print("\n".join("{:.11f} {}".format(di, kl) for di, kl in results))
Transition probabilities are: [[0.3 0.1 0.5 0.1 ] [0.2 0.3 0.15 0.35] [0.25 0.15 0.2 0.4 ] [0.35 0.2 0.4 0.05]] Equilibrium distribution: [0.27803345 0.17353238 0.32035021 0.22808396] Using {'A': 0.2565396943449555, 'C': 0.33282226080131183, 'G': 0.15535480923303974, 'T': 0.2552832356206928} as initial distribution: kl-divergence empirical distribution 0.12755953943 {'A': 0.2, 'C': 0.4, 'G': 0.2, 'T': 0.2} 0.06232425156 {'A': 0.14285714285714285, 'C': 0.21428571428571427, 'G': 0.35714285714285715, 'T': 0.2857142857142857} 0.00985214400 {'A': 0.25961538461538464, 'C': 0.23076923076923078, 'G': 0.2980769230769231, 'T': 0.21153846153846154} 0.00230574657 {'A': 0.25298804780876494, 'C': 0.1892430278884462, 'G': 0.33565737051792827, 'T': 0.22211155378486055} 0.00016968457 {'A': 0.28318672530987604, 'C': 0.16723310675729708, 'G': 0.3220711715313874, 'T': 0.22750899640143943} Using {'A': 0.005892872435830304, 'C': 0.27461344231012597, 'G': 0.4627732906369835, 'T': 0.2567203946170602} as initial distribution: kl-divergence empirical distribution 0.12755953943 {'A': 0.2, 'C': 0.4, 'G': 0.2, 'T': 0.2} 0.19663560448 {'A': 0.07142857142857142, 'C': 0.21428571428571427, 'G': 0.42857142857142855, 'T': 0.2857142857142857} 0.00745305622 {'A': 0.23076923076923078, 'C': 0.17307692307692307, 'G': 0.36538461538461536, 'T': 0.23076923076923078} 0.00075298635 {'A': 0.2918326693227092, 'C': 0.16832669322709162, 'G': 0.3237051792828685, 'T': 0.21613545816733068} 0.00002629270 {'A': 0.2810875649740104, 'C': 0.17243102758896442, 'G': 0.31837265093962414, 'T': 0.22810875649740103}
NOTE. Exercises in section "Stationary and equilibrium distributions (extra)" (exercises 20, 21, and 22) are not obligatory. Thus, you only need to do 19 exercises, if you are aiming to get full points.