Notebook

Sequence Analysis with Python¶

The following assignments introduce applications of hashing with dict() primitive of Python. While doing so, a rudimentary introduction to biological sequences is given. This framework is then enhanced with probabilities, leading to routines to generate random sequences under some constraints, including a general concept of Markov-chains. All these components illustrate the usage of dict(), but at the same time introduce some other computational routines to efficiently deal with probabilities.
The function collections.defaultdict can be useful.

Below are some "suggested" imports. Feel free to use and modify these, or not. Generally it's good practice to keep most or all imports in one place. Typically very close to the start of notebooks.

In [2]:

from collections import defaultdict
from itertools import product
from itertools import accumulate

import numpy as np
from numpy.random import choice

import re
import random

from collections import Counter

from functools import reduce

import math


# Define ANSI escape codes for red text
RED = "\033[91m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
ORANGE = "\033[38;5;208m"
L_ORANGE = "\033[38;5;214m"
RESET = "\033[0m"

The automated TMC tests do not test cell outputs. These are intended to be evaluated in the peer reviews. So it is still be a good idea to make the outputs as clear and informative as possible.

To keep TMC tests running as well as possible it is recommended to keep global variable assignments in the notebook to a minimum to avoid potential name clashes and confusion. Additionally you should keep all actual code exection in main guards to keep the test running smoothly. If you run check_sequence.py in the part07-e01_sequence_analysis folder, the script should finish very quickly and optimally produce no output.

If you download data from the internet during execution (codon usage table), the parts where downloading is done should not work if you decide to submit to the tmc server. Local tests should work fine.

DNA and RNA¶

A DNA molecule consist, in principle, of a chain of smaller molecules. These smaller molecules have some common basic components (bases) that repeat. For our purposes it is sufficient to know that these bases are nucleotides adenine, cytosine, guanine, and thymine with abbreviations A, C, G, and T. Given a DNA sequence e.g. ACGATGAGGCTCAT, one can reverse engineer (with negligible loss of information) the corresponding DNA molecule.

Parts of a DNA molecule can transcribe into an RNA molecule. In this process, thymine gets replaced by uracil (U).

1. Write a function dna_to_rna to convert a given DNA sequence $s$ into an RNA sequence. For the sake of exercise, use dict() to store the symbol to symbol encoding rules. Create a program to test your function.

In [109]:

DEBUG1 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def dna_to_rna(s):
    dict_dna_to_rna = {"A":"A","C":"C","G":"G","T":"U"}
    rna_sequence = "".join(dict_dna_to_rna[nucleotide] for nucleotide in s)

    if DEBUG1:
        print(f"DNA sequence: {s}")
        print(f"RNA sequence: {rna_sequence}")

    return rna_sequence

    
if __name__ == '__main__':
    print(dna_to_rna("AACGTGATTTC"))
    #print(dna_to_rna(""))

AACGUGAUUUC

Idea of solution¶

The solution involves converting a DNA sequence into an RNA sequence by replacing each occurrence of the nucleotide thymine (T) with uracil (U) and leaving all other nucleotides unchanged. This transformation is achieved using a dictionary that maps each nucleotide to its corresponding RNA counterpart. The function iterates over the input string, performs the replacement using the dictionary, and constructs the RNA sequence using Python's join method.

Discussion¶

The problem of converting a DNA sequence to an RNA sequence is relatively straightforward and can be efficiently handled using a dictionary for nucleotide mapping. Further improvements might include f.ex. input validation, but that doesn't appear necessary in this context.

Proteins¶

Like DNA and RNA, protein molecule can be interpreted as a chain of smaller molecules, where the bases are now amino acids. RNA molecule may translate into a protein molecule, but instead of base by base, three bases of RNA correspond to one base of protein. That is, RNA sequence is read triplet (called codon) at a time.

2. Consider the codon to amino acid conversion table in http://htmlpreview.github.io/?https://github.com/csmastersUH/data_analysis_with_python_2020/blob/master/Codon%20usage%20table.html. Write a function get_dict to read the table into a dict(), such that for each RNA sequence of length 3, say $\texttt{AGU}$, the hash table stores the conversion rule to the corresponding amino acid. You may store the html page to your local src directory, and parse that file.

In [114]:

DEBUG2 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def get_dict():
    
    # initialize variables
    filepath = "src/data.txt"
    codon_to_amino_dict = {}
    codon_pattern = re.compile(r'([AUGC]{3})\s+([A-Z\*])')
    
    # get the file
    try:
        with open(filepath, 'r') as file:
            lines = file.readlines()
    except FileNotFoundError:
            raise FileNotFoundError(f"The file at {filepath} was not found.")
    except IOError as e:
        raise IOError(f"An error occurred while reading the file: {e}")
    
    # find codon-amino matches
    for line in lines:
        matches = codon_pattern.findall(line)
        # input matches into dictionary
        for codon, amino_acid in matches: 
             codon_to_amino_dict[codon] = amino_acid

    if DEBUG2:
        print(f"File:")
        for line in lines:
             print(line, end = '')
        print('\n')
        print(f"Codon to Amino dictionary: {codon_to_amino_dict}")

    return codon_to_amino_dict

    
if __name__ == '__main__':
    codon_to_aa = get_dict()
    print(codon_to_aa)

{'UUU': 'F', 'UCU': 'S', 'UAU': 'Y', 'UGU': 'C', 'UUC': 'F', 'UCC': 'S', 'UAC': 'Y', 'UGC': 'C', 'UUA': 'L', 'UCA': 'S', 'UAA': '*', 'UGA': '*', 'UUG': 'L', 'UCG': 'S', 'UAG': '*', 'UGG': 'W', 'CUU': 'L', 'CCU': 'P', 'CAU': 'H', 'CGU': 'R', 'CUC': 'L', 'CCC': 'P', 'CAC': 'H', 'CGC': 'R', 'CUA': 'L', 'CCA': 'P', 'CAA': 'Q', 'CGA': 'R', 'CUG': 'L', 'CCG': 'P', 'CAG': 'Q', 'CGG': 'R', 'AUU': 'I', 'ACU': 'T', 'AAU': 'N', 'AGU': 'S', 'AUC': 'I', 'ACC': 'T', 'AAC': 'N', 'AGC': 'S', 'AUA': 'I', 'ACA': 'T', 'AAA': 'K', 'AGA': 'R', 'AUG': 'M', 'ACG': 'T', 'AAG': 'K', 'AGG': 'R', 'GUU': 'V', 'GCU': 'A', 'GAU': 'D', 'GGU': 'G', 'GUC': 'V', 'GCC': 'A', 'GAC': 'D', 'GGC': 'G', 'GUA': 'V', 'GCA': 'A', 'GAA': 'E', 'GGA': 'G', 'GUG': 'V', 'GCG': 'A', 'GAG': 'E', 'GGG': 'G'}

Idea of solution¶

The get_dict() function parses a file containing codon-to-amino acid mappings and creates a dictionary from this data. The function uses a regular expression to find and extract codon-amino acid pairs from each line of the file. The regex pattern \b([AUGC]{3})\s+([A-Z*])\b captures RNA triplets (codons) and their corresponding amino acids. The function then constructs a dictionary with codons as keys and amino acids as values, and returns this dictionary.

Discussion¶

The initial approach of using simple string operations to extract codon-amino acid pairs was prone to errors due to inconsistent spacing and formatting in the data. By employing regular expressions, the revised solution provides a more robust and precise method for parsing the file. Debugging output helps verify that the dictionary is correctly populated. This method appears both efficient and adaptable, making it a reliable solution for the task.

3. Use the same conversion table as above, but now write function get_dict_list to read the table into a dict(), such that for each amino acid the hash table stores the list of codons encoding it.

In [113]:

DEBUG3 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def get_dict_list():

    # initialize variables
    filepath = "src/data.txt"
    codon_to_amino_dict = {}
    codon_pattern = re.compile(r'([AUGC]{3})\s+([A-Z\*])')
    amino_to_codon_dict_list = {}

    # get the file
    try:
        with open(filepath, 'r') as file:
            lines = file.readlines()
    except FileNotFoundError:
            raise FileNotFoundError(f"The file at {filepath} was not found.")
    except IOError as e:
        raise IOError(f"An error occurred while reading the file: {e}")
    
    # find codon-amino matches
    for line in lines:
        matches = codon_pattern.findall(line)
        # input matches into dictionary
        for codon, amino_acid in matches: 
             codon_to_amino_dict[codon] = amino_acid

    # input: values from codon->amino as keys / keys from codon->amino as values, in list
    for key, value in codon_to_amino_dict.items():
        if DEBUG3:
            print(f"CODON -> AMINO | Key - {key}, Value - {value}")
        if value not in amino_to_codon_dict_list:
              amino_to_codon_dict_list[value] = []
              amino_to_codon_dict_list[value].append(key)
        else:
            amino_to_codon_dict_list[value].append(key)
        if DEBUG3:
            print(f"AMINO -> CODON | Key - {value}, Value - {amino_to_codon_dict_list[value]}")

    return amino_to_codon_dict_list
    
if __name__ == '__main__':
    aa_to_codons = get_dict_list()
    print(aa_to_codons)

{'F': ['UUU', 'UUC'], 'S': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'], 'Y': ['UAU', 'UAC'], 'C': ['UGU', 'UGC'], 'L': ['UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'], '*': ['UAA', 'UGA', 'UAG'], 'W': ['UGG'], 'P': ['CCU', 'CCC', 'CCA', 'CCG'], 'H': ['CAU', 'CAC'], 'R': ['CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'Q': ['CAA', 'CAG'], 'I': ['AUU', 'AUC', 'AUA'], 'T': ['ACU', 'ACC', 'ACA', 'ACG'], 'N': ['AAU', 'AAC'], 'K': ['AAA', 'AAG'], 'M': ['AUG'], 'V': ['GUU', 'GUC', 'GUA', 'GUG'], 'A': ['GCU', 'GCC', 'GCA', 'GCG'], 'D': ['GAU', 'GAC'], 'G': ['GGU', 'GGC', 'GGA', 'GGG'], 'E': ['GAA', 'GAG']}

Idea of solution¶

The function get_dict_list() aims to create a dictionary, mapping each amino to a list of codons that encode it.

The process begins by reading a file containing codon-to-amino mappings.
These mappings are extracted using the same regular expression pattern used in the last task.
Each codon-amino pair is stored in the codon_to_amino_dict dictionary.
Subsequently, the function converts this dictionary into the desired format, amino_to_codon_dict_list, which maps each amino to a list of codons.
The conversion is done by iterating through the codon_to_amino_dict and appending codons to the appropriate list in amino_to_codon_dict_list.

The debug output, if enabled, provides intermediate verification of these mappings.

Discussion¶

The solution achieves the task by first populating a codon_to_amino_dict with codon-amino acid pairs and then transforming this dictionary into the final format where amino acids are keys and lists of codons are values. This two-step approach ensures clarity in the process but may involve some redundant steps, combining these steps into a single pass without intermediate storage could improve efficiency.

With the conversion tables at hand, the following should be trivial to solve.

4. Fill in function rna_to_prot in the stub solution to convert a given DNA sequence $s$ into a protein sequence. You may use the dictionaries from exercises 2 and 3. You can test your program with ATGATATCATCGACGATGTAG.

In [6]:

DEBUG4 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def rna_to_prot(s):

    # get the codon to amino dictionary
    codon_to_amino_dict = get_dict()

    # split the RNA sequence into codons (3-letter segments)
    codon_lst = [s[i:i+3] for i in range(0, len(s), 3)]
    if DEBUG4:
        print(f'\ncodon list: {codon_lst}')
    
   # translate each codon into its corresponding amino
    amino_lst = [codon_to_amino_dict[c] for c in codon_lst]
    if DEBUG4:
        print(f'amino list: {amino_lst}\n')

    # join the list of amino acids into a single string
    amino_str = ''.join(amino_lst)
    return amino_str

def dna_to_prot(s):
    return rna_to_prot(dna_to_rna(s))

if __name__ == '__main__':
    print(dna_to_prot("ATGATATCATCGACGATGTAG"))

MISSTM*

Idea of solution¶

The function rna_to_prot converts an RNA sequence into a protein sequence leveraging the codon-to-amino dictionary.

First the DNA string is converted to RNA using earlier dna_to_rna function, and rna_to_prot function called with the result.
In this function the codon-to-amino dictionary is fetched.
Then the RNA sequence is split into codons (3-letter segments).
Each codon is translated into its corresponding amino acid using the fetched dictionary.
- This is achieved through list comprehension, which maps each codon to its corresponding amino and collects the aminos into a list.
Finally, the list of amino acids is concatenated into a single string to form the protein sequence.

This approach ensures that each codon is accurately translated and the resulting protein sequence is produced.

Discussion¶

The rna_to_prot function translates an RNA sequence into a protein sequence in a straightforward manner by using the previously created functions. The use of list comprehension for splitting the RNA sequence into codons and then translating each codon into an amino acid provides a compact and readable solution. Debugging output helps verify the correctness of intermediate steps, by showing the list of codons and the resulting amino acids.

You may notice that there are $4^3=64$ different codons, but only 20 amino acids. That is, some triplets encode the same amino acid.

Reverse translation¶

It has been observed that among the codons coding the same amino acid, some are more frequent than others. These frequencies can be converted to probabilities. E.g. consider codons AUU, AUC, and AUA that code for amino acid isoleucine. If they are observed, say, 36, 47, 17 times, respectively, to code isoleucine in a dataset, the probability that a random such event is AUU $\to$ isoleucine is 36/100.

This phenomenon is called codon adaptation, and for our purposes it works as a good introduction to generation of random sequences under constraints.

5. Consider the codon adaptation frequencies in http://htmlpreview.github.io/?https://github.com/csmastersUH/data_analysis_with_python_2020/blob/master/Codon%20usage%20table.html and read them into a dict(), such that for each RNA sequence of length 3, say AGU, the hash table stores the probability of that codon among codons encoding the same amino acid. Put your solution in the get_probabability_dict function. Use the column "([number])" to estimate the probabilities, as the two preceding columns contain truncated values.

In [116]:

DEBUG5 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def get_probabability_dict():
    filepath = "src/data.txt"
    codon_adaptation_freq_dict = {}
    amino_to_codon_dict_list = get_dict_list()
     
    # read the file 
    with open(filepath, 'r') as file:
        lines = file.readlines()
        if DEBUG5:
            print()
            print(f'lines: {lines}')
    
    # regular expression to match codons and their frequencies
    freq_pattern = re.compile(r'([AUGC]{3})\s+[A-Z\*]\s+\d+\.\d+\s+\d+\.\d+\s*\(\s*(\d+)\s*\)')
    
    # for each line in original table find the groups: (codon, frequency)
    for line in lines:
        matches = freq_pattern.findall(line)
        if DEBUG5:
            print()
            print(f'{YELLOW}MATCHES FROM LINE & LINE {RESET}')
            print(f'{RED}matches: {matches} {RESET}')
            print(f'{RED}line: {line} {RESET}')

        # extract frequencies
        for codon, frequency in matches: 
            frequency = int(frequency)
            amino = None
            if DEBUG5:
                print(f'{YELLOW}ITERATED ITEMS & DICTIONARY ADDITION {RESET}')
                print(f'codon: {codon}')
                print(f'frequency: {frequency}')

            # find the corresponding amino for the codon
            for key, values_list in amino_to_codon_dict_list.items():
                if codon in values_list:
                    amino = key
                    if DEBUG5:
                        print(f'amino: {amino}')
                    break

            # add amino, codon, frequency (amino -> [codon -> frequency])     
            if amino not in codon_adaptation_freq_dict:
                codon_adaptation_freq_dict[amino] = {}
                codon_adaptation_freq_dict[amino][codon] = frequency 
            else:
                codon_adaptation_freq_dict[amino][codon] = frequency 
            if DEBUG5:    
                print(f'codon_adaptation_freq_dict: {codon_adaptation_freq_dict}')
                print('-------------------------')
    

    # compute probabilities from the frequencies
    codon_encoding_amino_probability_dict = {}
    for amino, codon_freqs in codon_adaptation_freq_dict.items():
        total_frequency = sum(codon_freqs.values())
        codon_encoding_amino_probability_dict [amino] = {codon: freq / total_frequency for codon, freq in codon_freqs.items()}

    if DEBUG5:
        print()
        print(f'codon_adaptation_freq_dict: {codon_adaptation_freq_dict}')
        print(f'codon_encoding_amino_probability_dict: {codon_encoding_amino_probability_dict}')

    # flatten the dictionary for compatibility with the main block
    codon_encoding_amino_probability_dict_flat = {}
    for amino, codon_probs in codon_encoding_amino_probability_dict.items():
        for codon, prob in codon_probs.items():
            codon_encoding_amino_probability_dict_flat[codon] = prob
    if DEBUG5:
        print(f'codon_encoding_amino_probability_dict_flat: {codon_encoding_amino_probability_dict_flat}')
        print()

    return codon_encoding_amino_probability_dict_flat


    
if __name__ == '__main__':
    codon_to_prob = get_probabability_dict()
    items = sorted(codon_to_prob.items(), key=lambda x: x[0])
    for i in range(1 + len(items)//6):
        print("\t".join(
            f"{k}: {v:.6f}"
            for k, v in items[i*6:6+i*6]
        ))

AAA: 0.434049	AAC: 0.529633	AAG: 0.565951	AAU: 0.470367	ACA: 0.284188	ACC: 0.355232
ACG: 0.113812	ACU: 0.246769	AGA: 0.214658	AGC: 0.239938	AGG: 0.211091	AGU: 0.149602
AUA: 0.169062	AUC: 0.469866	AUG: 1.000000	AUU: 0.361072	CAA: 0.265017	CAC: 0.581485
CAG: 0.734983	CAU: 0.418515	CCA: 0.276603	CCC: 0.323470	CCG: 0.113196	CCU: 0.286731
CGA: 0.108812	CGC: 0.183777	CGG: 0.201554	CGU: 0.080108	CUA: 0.071380	CUC: 0.195577
CUG: 0.395702	CUU: 0.131716	GAA: 0.422453	GAC: 0.535458	GAG: 0.577547	GAU: 0.464542
GCA: 0.228121	GCC: 0.399781	GCG: 0.106176	GCU: 0.265922	GGA: 0.249922	GGC: 0.337109
GGG: 0.249882	GGU: 0.163087	GUA: 0.116577	GUC: 0.238306	GUG: 0.463346	GUU: 0.181770
UAA: 0.297019	UAC: 0.556662	UAG: 0.236738	UAU: 0.443338	UCA: 0.150517	UCC: 0.217960
UCG: 0.054398	UCU: 0.187586	UGA: 0.466243	UGC: 0.543843	UGG: 1.000000	UGU: 0.456157
UUA: 0.076568	UUC: 0.535866	UUG: 0.129058	UUU: 0.464134

Idea of solution¶

The goal of this solution is to calculate the probability distribution of codons encoding each amino based on their frequency in a provided dataset. The program reads codon frequencies from a file and processes these frequencies to produce a normalized probability distribution for each codon-amino pairing.

The process consists of several steps:

Read Codon Frequency Data: The program reads the codon frequency data from a file. This file contains lines with codons, amino acids, and their respective frequencies.
Extract Frequencies Using Regular Expressions: A regular expression is used to parse each line of the file and extract codons along with their frequencies. The regular expression pattern matches a codon and its frequency value.
Map Codons to Amino Acids: The program maps each codon to its corresponding amino, fetching the amino (using the previously defined dictionary, amino_to_codon_dict_list).
Calculate Frequency Dictionary: For each codon, the program calculates the frequency for each aminos codon. These frequencies are stored in a nested dictionary structure (codon_adaptation_freq_dict), where the outer keys are aminos, and the inner dictionaries map codons to their frequencies.
Compute Probabilities: Once all codon frequencies are collected, the program calculates the probability of each codon for each amino acid. This is done by dividing the frequency of each codon by the total frequency of all codons encoding the same amino. The result is a probability distribution stored in a new dictionary (codon_encoding_amino_probability_dict).
Flatten the Probability Dictionary: The nested probability dictionary is then flattened for use in the main block of the program, resulting in a single dictionary that maps each codon directly to its probability.
Output the Results: Finally, the program sorts the codon-probability pairs alphabetically by codon and prints them in a formatted table.

Discussion¶

The solution leverages a structured approach to convert raw frequency data into a usable probability distribution format. The use of regular expressions allows for flexible and robust parsing of the input data, ensuring that only correctly formatted lines are processed. The mapping from codons to aminos is managed using dictionaries, allowing quick lookups and updates.

A significant challenge encountered during development was managing inconsistent spacing and avoiding unnecessary data capture in the regex pattern. Initially, the regular expression was too broad, capturing more information than needed and struggling to handle varying spaces between data elements. To resolve this, the regex pattern was refined to focus solely on essential components: the three-letter codon sequence and the frequency count inside parentheses.

The program includes optional debug output. When enabled, the program prints detailed information about the internal state at various steps.

Now you should have everything in place to easily solve the following.

6. Write a class ProteinToMaxRNA with a convert method which converts a protein sequence into the most likely RNA sequence to be the source of this protein. Run your program with LTPIQNRA.

In [121]:

DEBUG6 = False  # Set to [ True ] to enable debug output, [ False ] to disable


class ProteinToMaxRNA:

    def __init__(self):
        self.amino_to_codon = get_dict_list()
        self.codon_probabilities = get_probabability_dict()

    def convert(self, s):
        
        if DEBUG6:
            print('amino_to_codon')
            print(self.amino_to_codon)
            print()
            print('codon_probabilities')
            print(self.codon_probabilities)
            print()

        # convert string to chars for use in loop
        chars = list(s) 
        if DEBUG6:
            print(f'chars: {chars}')

        # initialize other variables for use in loop
        max_codon_lst = []

        # loop over each character in sequence
        for char in chars:    
            prob_lst = []
            # fetch the codons encoding amino
            codons = self.amino_to_codon[char]
            if DEBUG6:
                print(f'codons encoding {char}')
                print(codons)
            # loop over each codon 
            for codon in codons:
                # fetch the probability it encodes amino and append to list containing probabilities
                probability = self.codon_probabilities[codon]
                prob_lst.append(probability)
            # find codon with max probability of encoding and index, then use index to get codon
            max_value = max(prob_lst) 
            max_index = prob_lst.index(max_value)
            max_codon = codons[max_index]
            if DEBUG6:
                print(f'max_codon: {max_codon}')
            # append to list being returned
            max_codon_lst.append(max_codon)
        # convert list to string
        max_codon_str = ''.join(max_codon_lst)
    
        return max_codon_str
        

if __name__ == '__main__':
    protein_to_rna = ProteinToMaxRNA()
    print(protein_to_rna.convert("LTPIQNRA"))

CUGACCCCCAUCCAGAACAGAGCC

Idea of solution¶

The solution involves creating a class, ProteinToMaxRNA, that translates a protein sequence into its most likely corresponding RNA sequence based on codon usage probabilities. This is done using two key dictionaries: one that maps each amino to its possible codons (amino_to_codon), and another that provides the probability of each codon being used (codon_probabilities).

The convert method takes a protein sequence as input and processes each amino one by one. For each amino, the method retrieves its associated codons and their probabilities from the respective dictionaries. It then determines which codon has the highest probability of encoding the amino and appends this codon to a list. After processing all aminos in the input sequence, the method returns a concatenated string of the selected codons, forming the RNA sequence that is most likely.

Discussion¶

While the current implementation maintains a clear structure there may be some redundancy within the loop iteration. With some refinements, it could possibly be made even more efficient and concise. However, the code correctly implements the required functionality and provides the correct output for the example input and tests.

Now we are almost ready to produce random RNA sequences that code a given protein sequence. For this, we need a subroutine to sample from a probability distribution. Consider our earlier example of probabilities 36/100, 47/100, and 17/100 for AUU, AUC, and AUA, respectively. Let us assume we have a random number generator random() that returns a random number from interval $[0,1)$. We may then partition the unit interval according to cumulative probabilities to $[0,36/100), [36/100,83/100), [83/100,1)$, respectively. Depending which interval the number random() hits, we select the codon accordingly.

7. Write a function random_event that chooses a random event, given a probability distribution (set of events whose probabilities sum to 1). You can use function random.uniform to produce values uniformly at random from the range $[0,1)$. The distribution should be given to your function as a dictionary from events to their probabilities.

In [120]:

DEBUG7 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def random_event(dist):
    """
    Takes as input a dictionary from events to their probabilities.
    Return a random event sampled according to the given distribution.
    The probabilities must sum to 1.0
    """
    
    # produce random fraction to use for ascertaining event
    random_fract = random.uniform(0, 1) 
    if DEBUG7:
        print(f'dist: {dist}') # {'A': 0.1, 'C': 0.35, 'G': 0.15, 'T': 0.4}
        print(f'random_fract: {random_fract}') # 0.355473001581198

    # calculate cumulative probabilities
    cumulative_values = list(accumulate(dist.values())) # [0.1, 0.45, 0.6, 1.0]
    if DEBUG7:
        print(f'cumulative_values: {cumulative_values}')

    # create a list of tuples (event, cumulative_value)
    event_cumval = list(zip(dist.keys(), cumulative_values))

    # find the first cumulative value that exceeds random_fract
    for event, cumulative_value in event_cumval:
        if random_fract < cumulative_value:
            if DEBUG7:
                print(f'event_cumval: {event_cumval}') # [('A', 0.1), ('C', 0.45), ('G', 0.6), ('T', 1.0)]
                cumulative_values_with_random = sorted(cumulative_values + [random_fract])
                print(f'cumulative_values_with_random: {cumulative_values_with_random}') # [0.1, 0.355473001581198, 0.45, 0.6, 1.0]
                print(f'{RED}return_event {event} {RESET}') # return_event C 
                print('-----------------------')
            # return event key where random value falls below = the value within which range it is
            return event
        

if __name__ == '__main__':
    distribution = dict(zip("ACGT", [0.10, 0.35, 0.15, 0.40]))
    print(", ".join(random_event(distribution) for _ in range(29)))

G, T, T, C, C, G, C, C, C, T, A, G, C, A, C, T, G, A, T, T, T, C, C, A, T, C, C, C, T

Idea of solution¶

The solution aims to randomly select an event from a given distribution where each event has a specified probability.

First a random fraction between 0 and 1 is generated. This fraction will help determine which event to select based on its position relative to cumulative probabilities.
The cumulative probabilities of the events are calculated from the given distribution. For this purpose, the list of individual event probabilities is transformed into a list where each element represents the sum of probabilities up to and including that event.
Events and their corresponding cumulative values are paired together. This allows us to easily determine which event corresponds to the range that the random fraction falls within.
By comparing the random fraction with the cumulative values, the correct event is selected. The first cumulative value that is greater than the random fraction indicates the event that should be returned.

If debugging is enabled, additional information is printed to help verify the correctness of the sampling process.

Discussion¶

The approach uses cumulative probabilities to map a random fraction to a specific event. This ensures that each event is selected in proportion to its specified probability. Debug output is integrated to aid in the verification of the sampling process. It provides visibility into key variables such as the distribution, random fraction, cumulative values, and the selected event.

The code works well with the given example input and the tests. For more genral use cases, there could be issues because the code assumes that the input probabilities are perfectly normalized to sum to exactly 1.0. One issue that came up while writing the code was list comprehension resulting in floating-point precision error, where cumulative values would come up such as 0.4499999999999999, instead of 0.45.

With this general routine, the following should be easy to solve.

8. Write a class ProteinToRandomRNA to produce a random RNA sequence encoding the input protein sequence according to the input codon adaptation probabilities. The actual conversion is done through the convert method. Run your program with LTPIQNRA.

In [123]:

DEBUG8 = False  # Set to [ True ] to enable debug output, [ False ] to disable


class ProteinToRandomRNA(object):
    
    def __init__(self):
        # initialize dictionaries: aminos and their encoding codons, probability of codon encoding amino
        self.amino_to_codon = get_dict_list()
        self.codon_probabilities = get_probabability_dict()

    def convert(self, s):

        if DEBUG8:
            print(f'amino_to_codon: {self.amino_to_codon}')
            print(f'codon_probabilities: {self.codon_probabilities}')
            print(f'input string: {s}') 
            print()

        # for each amino in input sequence, randomly select a codon based on its probability
        return_lst = []
        for c in s: 

            # get the encoding codons
            codons_encoding = self.amino_to_codon[c] 
            # for each codon fetch the probability
            codons_encoding_probability = {c_e: self.codon_probabilities[c_e] for c_e in codons_encoding}

            # use random_event function to create random event 
            res = random_event(codons_encoding_probability)
            # append that codon to be returned
            return_lst.append(res)
            
            if DEBUG8:
                print(f'current amino: {RED}{c}{RESET}')
                print(f'codons encoding: {codons_encoding}')
                print(f'codon probabilities: {codons_encoding_probability}')
                print(f'chosen codon: {res}')
                print(f'current rna sequence: {return_lst}')
                print()
        
        # join event codons to form string
        return_str = ''.join(return_lst)

        return return_str
        
if __name__ == '__main__':
    protein_to_random_codons = ProteinToRandomRNA()
    print(protein_to_random_codons.convert("LTPIQNRA"))

CUGACCCCUAUACAGAACCGCGCA

Idea of solution¶

The solution involves mapping each amino in the input protein sequence to its possible codons using a dictionary (amino_to_codon). Another dictionary (codon_probabilities) provides the probabilities of each codon being used. The program iterates through each amino in the protein sequence, retrieves the possible codons and their associated probabilities, and randomly selects a codon using the random_event function. The selected codons are concatenated to form the final RNA sequence.

Debugging statements are included to trace the steps and verify correct function at each stage.

Discussion¶

This task was a relatively straightforward implementation of the previously created functions and the class in the last exercise. The print statements from the last exercise attempt to further visualize that when calling the random event class the right codon is returned, based on which fraction variable the random event class produced.

The task only required a straightforward implementation, using pre-existing dictionaries for amino mapping and codon probability data. The main challenge was ensuring that each amino acid was correctly converted into a corresponding RNA codon based on its probability distribution. The debug output effectively shows the mapping and selection process, confirming that the random_event function correctly chooses codons according to the provided probabilities.

Future improvements could include optimizing the handling of input data and refining the random selection process to ensure performance across larger datasets.

Generating DNA sequences with higher-order Markov chains¶

We will now reuse the machinery derived above in a related context. We go back to DNA sequences, and consider some easy statistics that can be used to characterize the sequences. First, just the frequencies of bases $\texttt{A}$, $\texttt{C}$, $\texttt{G}$, $\texttt{T}$ may reveal the species from which the input DNA originates; each species has a different base composition that has been formed during evolution. More interestingly, the areas where DNA to RNA transcription takes place (coding region) have an excess of $\texttt{C}$ and $\texttt{G}$ over $\texttt{A}$ and $\texttt{T}$. To detect such areas a common routine is to just use a sliding window of fixed size, say $k$, and compute for each window position $T[i..i+k-1]$ the base frequencies, where $T[1..n]$ is the input DNA sequence. When sliding the window from $T[i..i+k-1]$ to $T[i+1..i+k]$ frequency $f(T[i])$ gets decreases by one and $f(T[i+k])$ gets increased by one.

9. Write a generator sliding_window to compute sliding window base frequencies so that each moving of the window takes constant time. We saw in the beginning of the course one way how to create generators using generator expression. Here we use a different way. For the function sliding_window to be a generator, it must have at least one yield expression, see https://docs.python.org/3/reference/expressions.html#yieldexpr.

Here is an example of a generator expression that works similarily to the built in range generator:

def range(a, b=None, c=1):
    current = 0 if b == None else a
    end = a if b == None else b
    while current < end:
        yield current
        current += c

A yield expression can be used to return a value and temporarily return from the function.

In [125]:

DEBUG9 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def sliding_window(s, k):

    # define all possible nucleotides
    nucleotides = 'ACGT'
    
    # if no window to iterate, return
    if len(s) < k:
        return
    
    # intial frequency window handled seperately
    freq = Counter(s[:k])
    # add missing nucleotides with a count of 0 using a dictionary comprehension
    freq.update({nucleotide: 0 for nucleotide in nucleotides if nucleotide not in freq})
    if DEBUG9:
        print(f'INITIAL: {freq}')
    # yield intial
    yield dict(freq)


    # deliver subsequent windows sliding through sequences
    for i in range(k, len(s)):

        # determine outgoing ang incoming index positions
        out_char = s[i - k]
        in_char = s[i]

        if DEBUG9:
            print()
            print(s)
            print(f'{RED}{s[i-k]}{RESET}{s[i+1-k:i]}{YELLOW}{s[i]}{RESET}{s[i+1:]}')
            print(f'outgoing: {RED}{out_char}{RESET} (i-k: {i-k})')
            print(f'incoming: {YELLOW}{in_char}{RESET} (i  : {i})')
        
        # update counter for the outgoing char
        freq[out_char] -= 1
        if DEBUG9:
            print(f'updated_out: {freq}')

        # update counter for the incoming char
        freq[in_char] += 1
        if DEBUG9:
            print(f'updated_in : {freq}')

        # yield the updated frequency dictionary
        yield dict(freq)
    
if __name__ == '__main__':
    #s = "T"
    s = "TCCCGACGGCCTTGCC"
    for d in sliding_window(s, 4):
        print(d)

{'T': 1, 'C': 3, 'A': 0, 'G': 0}
{'T': 0, 'C': 3, 'A': 0, 'G': 1}
{'T': 0, 'C': 2, 'A': 1, 'G': 1}
{'T': 0, 'C': 2, 'A': 1, 'G': 1}
{'T': 0, 'C': 1, 'A': 1, 'G': 2}
{'T': 0, 'C': 1, 'A': 1, 'G': 2}
{'T': 0, 'C': 2, 'A': 0, 'G': 2}
{'T': 0, 'C': 2, 'A': 0, 'G': 2}
{'T': 1, 'C': 2, 'A': 0, 'G': 1}
{'T': 2, 'C': 2, 'A': 0, 'G': 0}
{'T': 2, 'C': 1, 'A': 0, 'G': 1}
{'T': 2, 'C': 1, 'A': 0, 'G': 1}
{'T': 1, 'C': 2, 'A': 0, 'G': 1}

Idea of solution¶

The sliding_window generator function calculates nucleotide frequencies within a sliding window of a specified size over a DNA sequence. The function first initializes a frequency dictionary for the initial window, ensuring all possible nucleotides ('A', 'C', 'G', 'T') are accounted for, even those with zero counts. The initial frequency dictionary is then yielded. As the window slides one position at a time through the sequence, the function updates the counts by decrementing the frequency of the nucleotide that exits the window and incrementing the frequency for the nucleotide that enters the window. This update process is efficient, operating in constant time since only two nucleotide frequencies change per step.

Discussion¶

Initially, the tests failed because the frequency dictionaries did not account for all nucleotide bases ('A', 'C', 'G', 'T'), specifically those that had zero frequencies. This was corrected by initializing the frequency dictionary with all nucleotides. The concept of sliding a window and adjusting counts in constant time was somewhat challenging to grasp at first. I initially included loops that would have caused the algorithm to operate in inconsistent time, which would have been inefficient. After understanding how to adjust only the counts of the nucleotides entering and exiting the window, the function now correctly maintains constant time for each window move.

The debug print statements are included to visually demonstrate the changes in nucleotide frequencies as the window slides across the DNA sequence. This helped confirm the correctness of the solution and provided insight into the algorithm's step-by-step operations.

Our models so far have been so-called zero-order models, as each event has been independent of other events. With sequences, the dependencies of events are naturally encoded by their contexts. Considering that a sequence is produced from left-to-right, a first-order context for $T[i]$ is $T[i-1]$, that is, the immediately preceding symbol. First-order Markov chain is a sequence produced by generating $c=T[i]$ with the probability of event of seeing symbol $c$ after previously generated symbol $a=T[i-1]$. The first symbol of the chain is sampled according to the zero-order model.
The first-order model can naturally be extended to contexts of length $k$, with $T[i]$ depending on $T[i-k..i-1]$. Then the first $k$ symbols of the chain are sampled according to the zero-order model. The following assignments develop the routines to work with the higher-order Markov chains. In what follows, a $k$-mer is a substring $T[i..i+k-1]$ of the sequence at an arbitrary position.

10. Write function context_list that given an input DNA sequence $T$ associates to each $k$-mer $W$ the concatenation of all symbols $c$ that appear after context $W$ in $T$, that is, $T[i..i+k]=Wc$. For example, GA is associated to TCT in $T$=ATGATATCATCGACGATGTAG, when $k=2$.

In [127]:

DEBUG10 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def context_list(s, k):

    context_dict = {}

    # itereate over sequence, processing chunks of length k (k-mer) and following context
    for i in range(len(s)-k):

        if DEBUG10:
            print(f'{s[:i]}{YELLOW}{s[i:i+k]}{RED}{s[i+k]}{RESET}{s[i+k+1:]}')

        # if k-mer exists in dictionary, append next letter in sequence, to the string representing its context
        if s[i:i+k] in context_dict:
            context_dict[s[i:i+k]] += s[i+k]
        # else set next letter as context 
        else: 
            context_dict[s[i:i+k]] = s[i+k]

        if DEBUG10:    
            print(f'{YELLOW}{s[i:i+k]}{RESET} has context {RED}{context_dict[s[i:i+k]]}{RESET}')
            print('-------------')

    # return the dictionary containing k-mers and their contexts
    return context_dict
    
if __name__ == '__main__':
    k = 2
    s = "ATGATATCATCGACGATCTAG"
    d = context_list(s, k)
    print(d)

{'AT': 'GACCC', 'TG': 'A', 'GA': 'TCT', 'TA': 'TG', 'TC': 'AGT', 'CA': 'T', 'CG': 'AA', 'AC': 'G', 'CT': 'A'}

Idea of solution¶

The solution uses the sliding window approach to examine each k-mer (substring of length k) within the given sequence s.

For each k-mer, it captures the k-mer and the character that immediately follows it in the sequence.
These k-mers serve as keys in a dictionary (context_dict), where each key's value is a concatenated string of all characters that appeared after this k-mer throughout the sequence.
If a k-mer already exists in the dictionary, the following character is concatenated to its existing values.
Otherwise, a new entry is created with the k-mer as the key and the subsequent character as its value.

This approach efficiently builds a dictionary of k-mer contexts using simple string slicing and dictionary operations.

Discussion¶

The implementation of the context_list function is straightforward and effectively solves the problem by utilizing a simple loop to manage the sliding window across the string s. By directly manipulating the dictionary to store or update the context of each k-mer, the code avoids unnecessary complexity and remains readable.

Debug output allows for step-by-step tracing of the program’s operation.

The solution is again quite barebones, in that it doesn't check extensively for edge-cases or errors. For example, the function does not explicitly handle cases where the length of the string s is less than k. In such cases, the function would return an empty dictionary since there are no valid k-mers to process. Although it is the correct behaviour in this instance, it is an example of how edge cases could affect the programs process.

11. With the above solution, write function context_probabilities to count the frequencies of symbols in each context and convert these frequencies into probabilities. Run context_probabilities with $T=$ ATGATATCATCGACGATGTAG and $k$ values 0 and 2.

In [134]:

DEBUG11 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def context_probabilities(s, k):
    # get context using provided sequence string and k-value  
    context_dict = context_list(s, k)
    # initialize return dictionary
    context_prob_dict = {}
    
    if DEBUG11:
        print(f'context_dict: {context_dict} \n')

    for key in context_dict:
        # for each of the kmers in dict get context characters into seperate freq_dict (ie. 'AT': 'GACCG' -> {'G': 2, 'A': 1, 'C': 2})
        chars = context_dict[key] 
        freq_dict = {}
        for char in chars:
            if char not in freq_dict:
                freq_dict[char] = 1
            else:
                freq_dict[char] += 1

        # get the sum of frequencies (ie. {'G': 2, 'A': 1, 'C': 2} -> 5)
        total_freq = sum(freq_dict.values()) 
        # using dictionary comprehension calculate each value as a percentage of the sum number (ie. {'G': 2, 'A': 1, 'C': 2} -> {'G': 2/5, 'A': 1/5, 'C': 2/5} -> {'G': 0.4, 'A': 0.2, 'C': 0.4})
        context_prob_dict[key] = {char: freq / total_freq for char, freq in freq_dict.items()}

        if DEBUG11:
            print(f'key: {L_ORANGE}{key}{RESET}')
            print(f'freq_dict    : {freq_dict}')
            print(f'total_freq   : {total_freq}')
            print(f'context_prob : {context_prob_dict[key]}')
            print('-------------')

    # return above kmer key and above percentage dict
    return context_prob_dict
    
if __name__ == '__main__':
        s = "ATGATATCATCGACGATGTAG"
        k = 2
        print(context_probabilities(s, k))
        k = 0
        print(context_probabilities(s, k))

{'AT': {'G': 0.4, 'A': 0.2, 'C': 0.4}, 'TG': {'A': 0.5, 'T': 0.5}, 'GA': {'T': 0.6666666666666666, 'C': 0.3333333333333333}, 'TA': {'T': 0.5, 'G': 0.5}, 'TC': {'A': 0.5, 'G': 0.5}, 'CA': {'T': 1.0}, 'CG': {'A': 1.0}, 'AC': {'G': 1.0}, 'GT': {'A': 1.0}}
{'': {'A': 0.3333333333333333, 'T': 0.2857142857142857, 'G': 0.23809523809523808, 'C': 0.14285714285714285}}

Idea of solution¶

The solution creates a dictionary of k-mers (substrings of length k) and the probability of subsequent characters based on a given sequence. The function context_probabilities:

Uses function context_list to generate a dict that maps each k-mer in the sequence s to the characters that immediately follow it in the sequence
For each k-mer, it counts the frequency of each character that follows it, storing these counts in a separate frequency dictionary (freq_dict).
These frequencies are then converted into probabilities by dividing each frequency by the total number of occurrences for that k-mer.
The result is stored in context_probability_dict, where each k-mer is a key, and its value is another dictionary containing characters and their corresponding probabilities.

The function returns a dictionary which provides a probabilistic model of the sequence's structure based on k-mer contexts.

Discussion¶

This exercise extends the logic of the previous exercise by calculating the probabilities of each character following specific k-mers. The implementation is straightforward and reuses previous logic with slight modifications. The debug output matches the expected results, demonstrating that the function correctly calculates probabilities. Further enhancements could include handling more edge cases, such as when k is larger than the length of s.

12. With the above solution and the function random_event from the earlier exercise, write class MarkovChain. Its generate method should generate a random DNA sequence following the original $k$-th order Markov chain probabilities.

In [135]:

DEBUG12 = False  # Set to [ True ] to enable debug output, [ False ] to disable


class MarkovChain:
    
    def __init__(self, zeroth, kth, k=2):
        self.k = k
        self.zeroth = zeroth
        self.kth = kth
        
    def generate(self, n, seed=None):
        # set the seed for random number generation to ensure reproducibility of results
        if seed is not None:
            random.seed(seed)


        if DEBUG12:
            print('GENERATE INITIAL CHARACTERS')
            print(f'zeroth: {self.zeroth}')
            print(f'k: {self.k}')

        # generate characters upto len k, using zeroth dict probabilities
        gen_chars = ""
        i = 0
        for i in range(self.k):
            initial = random_event(self.zeroth)
            if DEBUG12:
                print(f'i: {i}, initial: {initial}')
            gen_chars += initial
       
        if DEBUG12:
            print(f'gen_chars: {gen_chars}')
            print('\nGENERATE REMAINING CHARACTERS')
            print(f'kth: {self.kth}')

       # generate the remaining n - k characters, using kth dict probabilities
        while len(gen_chars) < n:
            # to pass the test "TestGenerateMarkov: test_length" we need to ensure that we account for cases where k-mer does not exist in kth (ie. generate: C generate: C ; but there is no CC in kth)
            prev_k_chars = gen_chars[-self.k:] 
            if DEBUG12:
                print(f'gen_chars          : {gen_chars}')
                print(f'gen_chars[-self.k:]: {gen_chars[-self.k:]}')
            # if we find the previous k-mer in kth, add random generated char to sequence using kth/k-mer
            gen_char = ''
            if prev_k_chars in self.kth:
                gen_char = random_event(self.kth[prev_k_chars])
                gen_chars += gen_char
            else:
                # if the k-mer does not exist in kth, randomly select a value using zeroth
                fallback_char = random_event(self.zeroth)
                gen_chars += fallback_char
            if DEBUG12:
                if gen_char:
                    print(f'gen_char           : {gen_char}')
                    print('----------')
                else:
                    print(f'fallback_char      : {fallback_char}')
                    print('-----------------------')


        return gen_chars[:n]

if __name__ == '__main__':
    zeroth = {'A': 0.2, 'C': 0.19, 'T': 0.31, 'G': 0.3}
    kth = {'GT': {'A': 1.0, 'C': 0.0, 'T': 0.0, 'G': 0.0},
           'CA': {'A': 0.0, 'C': 0.0, 'T': 1.0, 'G': 0.0},
           'TC': {'A': 0.5, 'C': 0.0, 'T': 0.0, 'G': 0.5},
           'GA': {'A': 0.0, 'C': 0.3333333333333333, 'T': 0.6666666666666666, 'G': 0.0},
           'TG': {'A': 0.5, 'C': 0.0, 'T': 0.5, 'G': 0.0},
           'AT': {'A': 0.2, 'C': 0.4, 'T': 0.0, 'G': 0.4},
           'TA': {'A': 0.0, 'C': 0.0, 'T': 0.5, 'G': 0.5},
           'AC': {'A': 0.0, 'C': 0.0, 'T': 0.0, 'G': 1.0},
           'CG': {'A': 1.0, 'C': 0.0, 'T': 0.0, 'G': 0.0}}
    n = 10    
    seed = 0
    mc = MarkovChain(zeroth, kth)
    print(mc.generate(n, seed))

GGTAGTATCG

Idea of solution¶

The solution has a class MarkovChain that generates a random DNA sequence based on the given Markov chain probabilities.

The class is initialized with a zeroth probability dictionary (zeroth) and a k-th order transition probability dictionary (kth).
The generate method is the core of the programme. It starts by optionally setting a random seed for reproducibility, ensuring that the sequence generation can be replicated.
The first k characters of the sequence are generated using the zeroth-order probabilities, as the Markov chain cannot determine transitions without an initial state.
After initializing the first k characters, the method generates the remaining characters using k-th order probabilities.
It does this by looking at the last k characters (the current k-mer) and choosing the next character based on the transition probabilities defined in kth.
If a k-mer does not exist in kth, the algorithm falls back to the zeroth-order probabilities, ensuring that the sequence generation continues.

Discussion¶

Initially, the problem seemed straightforward, but the tests made it more challenging. The requirement to handle missing k-mers in the kth dictionary was not immediately obvious from the task description. The approach taken uses a while loop to manage sequence generation which ensures flexibility. The use of a fallback mechanism for missing k-mers makes the solution robust against incomplete transition matrices.

Although not strictly related to the functioning of the solution, a potential improvement could be further conciseness of debug statements. These make the code more verbose, but i find the output helpful to visualize the sequence of events.

If you have survived so far without problems, please run your program a few more times with different inputs. At some point you should get a lookup error in your hash-table! The reason for this is not your code, but the way we defined the model: Some $k$-mers may not be among the training data (input sequence $T$), but such can be generated as the first $k$-mer that is generated using the zero-order model.

A general approach to fixing such issues with incomplete training data is to use pseudo counts. That is, all imaginable events are initialized to frequency count 1.

13. Write a new solution context_pseudo_probabilities based on the solution to problem 11. But this time use pseudo counts in order to obtain a $k$-th order Markov chain that can assign a probability for any DNA sequence. You may use the standard library function itertools.product to iterate over all $k$-mer of given length (product("ACGT", repeat=k)).

In [136]:

DEBUG13 = False  # Set to [ True ] to enable debug output, [ False ] to disable


def context_pseudo_probabilities(s, k):

    nucleotides = "ACGT"
    pseudo_count = 1  # pseudo-count value to add

    # handle k == 0
    if k == 0:

        # add dict with pseudo_count value for each nucleotide
        counts_dict = {nucleotide: pseudo_count for nucleotide in nucleotides}
        if DEBUG13:
            print(f'counts_dict (k==0): {counts_dict}') 
        
        # increment value for those nucleotides that exist in sequence 
        for nucleotide in s:
            if nucleotide in counts_dict:
                counts_dict[nucleotide] += 1
        
        # calculate probabilities
        total_count = sum(counts_dict.values())
        zeroth_pseudo_probabilities = {nucleotide: count / total_count for nucleotide, count in counts_dict.items()}

        if DEBUG13:
            print(f'count incremented : {s}') 
            print(f'counts_dict (k==0): {counts_dict}') 
            print()

        return {"": zeroth_pseudo_probabilities} 

    # handle k > 0
    else:

        # generate all possible k-mers
        kmers = [''.join(kmer) for kmer in product(nucleotides, repeat=k)]
        # initialize k-mers with pseudo-count
        counts_dict = {kmer: {nucleotide: pseudo_count for nucleotide in nucleotides} for kmer in kmers}
        if DEBUG13:
            print(f'\ncounts_dict (k>0): {counts_dict}') 

        # update counts based on the given sequence
        for i in range(len(s) - k):
            kmer = s[i:i+k]
            next_nucleotide = s[i+k]
            if kmer in counts_dict:
                counts_dict[kmer][next_nucleotide] += 1

        # calculate probabilities
        kth_pseudo_probabilities = {}
        for kmer, nucleotide_counts in counts_dict.items():
            total = sum(nucleotide_counts.values())
            kth_pseudo_probabilities[kmer] = {nucleotide: count / total for nucleotide, count in nucleotide_counts.items()}

        if DEBUG13:
            print(f'count incremented: {s}') 
            print(f'counts_dict (k>0): {counts_dict}') 
            print(f'sum values count to divide with: {total}') 
            print(f'\nkth:')


        return kth_pseudo_probabilities

    
if __name__ == '__main__':
    k = 2
    s = "ATGATATCATCGACGATGTAG" #"ATG"

    zeroth = context_pseudo_probabilities(s, 0)[""]
    print(f"zeroth: {zeroth}")
    
    kth = context_pseudo_probabilities(s, k)
    print("\n".join(f"{k}: {dict(v)}" for k, v in kth.items()))

    print("\n", MarkovChain(zeroth, kth, k).generate(20))

zeroth: {'A': 0.32, 'C': 0.16, 'G': 0.24, 'T': 0.28}
AA: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
AC: {'A': 0.2, 'C': 0.2, 'G': 0.4, 'T': 0.2}
AG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
AT: {'A': 0.2222222222222222, 'C': 0.3333333333333333, 'G': 0.3333333333333333, 'T': 0.1111111111111111}
CA: {'A': 0.2, 'C': 0.2, 'G': 0.2, 'T': 0.4}
CC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
CG: {'A': 0.5, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.16666666666666666}
CT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
GA: {'A': 0.14285714285714285, 'C': 0.2857142857142857, 'G': 0.14285714285714285, 'T': 0.42857142857142855}
GC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
GG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
GT: {'A': 0.4, 'C': 0.2, 'G': 0.2, 'T': 0.2}
TA: {'A': 0.16666666666666666, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.3333333333333333}
TC: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.16666666666666666}
TG: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.3333333333333333}
TT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}

 TGATGATTTTCGTGCAGGTT

Idea of solution¶

The task requires implementing a pseudo-count based, context probability model, to handle unseen k-mers in a DNA sequence. The code has different approches based on the value of k:

k = 0:
- Create a dictionary for nucleotide frequencies, initialized with a pseudo-count.
- Increment the counts based on the given sequence.
- Compute probabilities by normalizing counts.
k > 0:
- Generate all possible k-mers using itertools.product.
- Initialize k-mers with pseudo-counts.
- Update k-mer counts based on the given sequence by sliding through the string.
- Compute probabilities for each k-mer by normalizing counts.

This approach ensures that all possible k-mers have a non-zero probability, even if they were not observed in the training data.

Discussion¶

Initially, the task was difficult for me to complete, as I wasn't sure how to proceed or to satisfy all the tests. I tried using earlier functions to provide a base for completing this task, but all such implementations quickly became overly complex.

The solution addresses the issue of unseen k-mers by initializing all possible k-mers with a pseudo-count of 1. This approach prevents zero probabilities for k-mers not present in the sequence. The results from the provided example demonstrate that the function correctly computes both zeroth and kth order probabilities. Further testing and handling for edge cases could probably still improve the code.

14. Write class MarkovProb that given the $k$-th order Markov chain developed above to the constructor, its method probability computes the probability of a given input DNA sequence.

In [138]:

DEBUG14 = False  # Set to [ True ] to enable debug output, [ False ] to disable

class MarkovProb:
    def __init__(self, k, zeroth, kth):
        self.k = k
        self.zeroth = zeroth
        self.kth = kth
        
    def probability(self, s):
        
        if DEBUG14:
            print(f'zeroth {self.zeroth}')
            print(f'kth {self.kth}')
            print(f'input sequence: {s}')
            print()
        
        #handle k == 0
        if self.k == 0:
            # Initialize the cumulative probability to 1
            prob = 1.0
            # Create list containing probability for each nucleotide in the sequence
            probs = [self.zeroth[c] for c in s] 
            # Calculate the cumulative product of all probabilities
            prob = reduce(lambda x, y: x * y, probs, 1.0)

            if DEBUG14:
                print(f'zeroth probabilities of sequences individual neucleotides, placed in list: {probs}')
                print(f'cumulative product of all the values in the list: {prob}')
            
            # return product of the probabilities
            return(prob)
        
        # handle k > 0
        else:
            # initialize the cumulative probability to 1
            prob = 1.0

            # calculate the probability of forming initial k-mer 
            initial_kmer = s[:self.k]
            initial_prob = reduce(lambda x, y: x * y, [self.zeroth.get(c, 0) for c in initial_kmer], 1.0)
            if DEBUG14:
                print(f'i-mer: {initial_kmer} |              | probability {initial_prob}')
            
            # combine the initial k-mer probability with the transition probabilities
            prob *= initial_prob

            # Calculate the transition probabilities for the rest of the sequence
            for i in range(len(s) - self.k):
                kmer = s[i:i+self.k]
                next_nucleotide = s[i+self.k]
                if kmer in self.kth: 
                    prob *= self.kth[kmer][next_nucleotide]
                if DEBUG14:
                    print(f'k-mer: {kmer} | nucleotide {next_nucleotide} | probability {prob}')

            # return product of the probabilities
            return prob
            
    
if __name__ == '__main__':
    k = 2
    kth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", k)
    zeroth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", 0)[""]
    mc = MarkovProb(2, zeroth, kth)
    #test with k = 0
    #mc = MarkovProb(0, zeroth, kth)
    s="ATGATATCATCGACGATGTAG"
    print(f"Probability of sequence {s} is {mc.probability(s)}")

Probability of sequence ATGATATCATCGACGATGTAG is 2.831270190340017e-10

Idea of solution¶

The solution uses class MarkovProb to calculates the probability of a DNA sequence based on a given Markov model. Again, The code has different approches based on the value of k:

For k = 0: The method iterates through each nucleotide in the sequence, fetches its probability from the zeroth-order probabilities andding it to a list, and computes the product of these probabilities.
For k > 0: The method starts by calculating the probability of the initial k-mer based on individual nucleotide probabilities. It multiplies this with the overall probabiity value, then it iteratively calculates the transition probabilities for each subsequent nucleotide following the k-mers. The total probability is the product of the initial k-mer probability and all transition probabilities.

Discussion¶

During implementation, an issue was encountered for cases where k > 0.

Initially, in cases where k > 0, the computation didn't correctly account for the initial k-mer's probability, leading to incorrect results. This was corrected by separately calculating the probability for the initial k-mer and then using it as a base to add with subsequent transitions.

The use of functional constructs helped make the code more concise and readable. However, this also introduces a potential risk for debugging since these constructs can obscure the detailed flow of calculations. I tried to find a balance between these in the code.

Overall, while the code works correctly for the given inputs and tests while also being relatively optimized. More comprehensive testing across various sequences, k-values and probabilities could probably increase robustness.

With the last assignment you might end up in trouble with precision, as multiplying many small probabilities gives a really small number in the end. There is an easy fix by using so-called log-transform. Consider computation of $P=s_1 s_2 \cdots s_n$, where $0\leq s_i\leq 1$ for each $i$. Taking logarithm in base 2 from both sides gives $\log _2 P= \log _2 (s_1 s_2 \cdots s_n)=\log_2 s_1 + \log_2 s_2 + \cdots \log s_n= \sum_{i=1}^n \log s_i$, with repeated application of the property that the logarithm of a multiplication of two numbers is the sum of logarithms of the two numbers taken separately. The results is abbreviated as log-probability.

15. Write class MarkovLog that given the $k$-th order Markov chain developed above to the constructor, its method log_probability computes the log-probability of a given input DNA sequence. Run your program with $T=$ ATGATATCATCGACGATGTAG and $k=2$.

In [140]:

DEBUG15 = False  # Set to [ True ] to enable debug output, [ False ] to disable


class MarkovLog(object):

    def __init__(self, k, zeroth, kth):
        self.k = k
        self.zeroth = zeroth
        self.kth = kth

    def log_probability(self, s):

        if DEBUG15:
            print(f's = {YELLOW}{s}{RESET}')
            print(f'k = {self.k}')
            print(f'zeroth = {self.zeroth}')
            print(f'kth = {self.kth}')
            print()

        # initialize probability to log space (log(1) = 0)
        log_prob = 0.0

        # calculate the log-probability of forming initial k-mer 
        initial_kmer = s[:self.k]
        initial_log_prob = reduce(lambda x, y: x + math.log2(self.zeroth[y]), initial_kmer, 0)
        if DEBUG15:
            print(f'initial kmer: {initial_kmer} | log probability: {initial_log_prob}')

        # calculate log transition probabilities
        for i in range(len(s) - self.k):
            kmer = s[i:i+self.k]
            next_nucleotide = s[i+self.k]
            if kmer in self.kth: 
                    log_prob += math.log2(self.kth[kmer][next_nucleotide])
            if DEBUG15:
                print(f'k-mer: {YELLOW}{kmer}{RESET} | nucleotide {YELLOW}{next_nucleotide}{RESET} | transition probability: {self.kth[kmer][next_nucleotide]:.2f} | transition LOG probability: {math.log2(self.kth[kmer][next_nucleotide]):.2f} | total transition LOG probability {log_prob:.2f}')

        # combine the initial k-mer probability with the transition probabilities
        total_log_prob = initial_log_prob + log_prob
        total_log_prob = round(total_log_prob, 5)

        if DEBUG15:
            print(f'initial log probability: {initial_log_prob} | total transition log probability: {log_prob} | final log probability: {total_log_prob}')
            print()

        return(total_log_prob)

        
if __name__ == '__main__':

    k = 2
    kth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", k)
    zeroth = context_pseudo_probabilities("ATGATATCATCGACGATGTAG", 0)[""]
    mc = MarkovLog(2, zeroth, kth)
    #mc = MarkovLog(0, zeroth, kth)
    s="ATGATATCATCGACGATGTAG"
    print(f"Log probability of sequence {s} is {mc.log_probability(s)}")
    

Log probability of sequence ATGATATCATCGACGATGTAG is -31.71783

Idea of solution¶

The goal of this solution is to calculate the log-probability of a given DNA sequence based on the Markov model. Here's how the solution works:

Initialize the Log Probability: Start with a log-probability of 0.0, which is equivalent to a probability of 1 in log space.
Compute Initial k-mer Log-Probability: The probability of observing the initial k-mer (the first k characters of the sequence) is computed by summing the log2 probabilities of each character's occurrence, based on the zeroth-order probabilities.
Iteratively Calculate Transition Log-Probabilities: For each subsequent nucleotide in the sequence, compute the conditional log-probability given the preceding k-mer using the k-th order probabilities. Sum these log-probabilities to the total log-probability.
Combine Results: The total log-probability is obtained by summing the initial k-mer log-probability with the sum of transition log-probabilities. The final log-probability is rounded and returned as the result.

Discussion¶

The program follows a similar pattern as the last task, now calculating the log probability. It performs well with the given example sequence ATGATATCATCGACGATGTAG and k=2. The calculated log-probability matches expected values based on the provided zeroth and k-th order probabilities. It also passes all the tests.

The main problem I had with this task was understanding that I should have used log2 to get the correct results, which led quite alot of debugging, although it was clear from reading the task again. A difference compared to the previous task is that there isn't any code sperately handling input where k is 0. It didn't appear to be a requirement from the task description and the tests passed without it, so to keep the code more concise I left that part out.

Finally, if you try to use the code so far for very large inputs, you might observe that the concatenation of symbols following a context occupy considerable amount of space. This is unnecessary, as we only need the frequencies.

16. Optimize the space requirement of your code from exercise 13 for the $k$-th order Markov chain by replacing the concatenations by direct computations of the frequencies. Implement this as the better_context_probabilities function.

In [143]:

DEBUG16 = False  # Set to [ True ] to enable debug output, [ False ] to disable

def better_context_probabilities(s, k):

    # initialize variables
    nucleotides = "ACGT"
    pseudo_count = 1

    # handle k == 0
    if k == 0:
        counts_dict = defaultdict(lambda: pseudo_count)  # SPACE IMPROVMENT
        if DEBUG16:
            print(f's: {s}')
            print(f'counts_dict pseudo count: {counts_dict}')
        for nucleotide in s:
            counts_dict[nucleotide] += 1
        if DEBUG16:
            print(f'counts_dict after increment: {counts_dict}')
        total = sum(counts_dict.values())
        return {"": {n: counts_dict[n] / total for n in nucleotides}}
    
    # initialize counts for all possible k-mers
    counts_dict = {
        ''.join(kmer): {n: pseudo_count for n in nucleotides} # SPACE IMPROVEMENT
        for kmer in product(nucleotides, repeat=k)
    }
    if DEBUG16:
        print(f'counts_dict pseudo count: {counts_dict}')

    # count occurrences in the sequence
    for i in range(len(s) - k):
        kmer = s[i:i+k]
        next_nucleotide = s[i+k]
        counts_dict[kmer][next_nucleotide] += 1
    if DEBUG16:
        print(f'counts_dict after increment: {counts_dict}')

    # calculate probabilities
    probabilities = {}
    for kmer, nucleotide_counts in counts_dict.items():
        total = sum(nucleotide_counts.values())
        probabilities[kmer] = {n: count / total for n, count in nucleotide_counts.items()}

    return probabilities


if __name__ == '__main__':
    k = 2
    # test k == 0
    #k = 0
    s = "ATGATATCATCGACGATGTAG"
    d = better_context_probabilities(s, k)
    print("\n".join(f"{k}: {v}" for k, v in d.items()))

AA: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
AC: {'A': 0.2, 'C': 0.2, 'G': 0.4, 'T': 0.2}
AG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
AT: {'A': 0.2222222222222222, 'C': 0.3333333333333333, 'G': 0.3333333333333333, 'T': 0.1111111111111111}
CA: {'A': 0.2, 'C': 0.2, 'G': 0.2, 'T': 0.4}
CC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
CG: {'A': 0.5, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.16666666666666666}
CT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
GA: {'A': 0.14285714285714285, 'C': 0.2857142857142857, 'G': 0.14285714285714285, 'T': 0.42857142857142855}
GC: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
GG: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
GT: {'A': 0.4, 'C': 0.2, 'G': 0.2, 'T': 0.2}
TA: {'A': 0.16666666666666666, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.3333333333333333}
TC: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.3333333333333333, 'T': 0.16666666666666666}
TG: {'A': 0.3333333333333333, 'C': 0.16666666666666666, 'G': 0.16666666666666666, 'T': 0.3333333333333333}
TT: {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}

Idea of solution¶

The test already passed with my original code from task 13. In that implementation - context_dict = context_list(s, k) - concatenating the symbols to string denoting the full context is not called (as it is in exercise 11). In light of this, the solution contains some minor improvements, marked by the comment space improvement above, and detailed further below.

ORIGINAL:
- counts_dict = {nucleotide: pseudo_count for nucleotide in nucleotides}
OPTIMIZED:
- counts_dict = defaultdict(lambda: pseudo_count)
BENEFIT:
- By using a defaultdict with a default value (pseudo_count), we avoid pre-initializing each nucleotide key. This approach automatically initializes any accessed key with pseudo_count, thus saving memory that would be otherwise used for storing all potential keys upfronT.
ORIGINAL:
- kmers = [''.join(kmer) for kmer in product(nucleotides, repeat=k)]
- counts_dict = {kmer: {nucleotide: pseudo_count for nucleotide in nucleotides} for kmer in kmers}
OPTIMIZED:
- counts = {''.join(kmer): {n: pseudo_count for n in nucleotides} for kmer in product(nucleotides, repeat=k)}
BENEFIT:
- The optimized approach directly initializes the count dictionary for all possible $k$-mers. By using this method, we avoid creating an excessive number of intermediate objects and maintain a leaner data structure that only contains the essential keys and values.

Discussion¶

As mentioned in the description of the solution I wasn't quite sure how to proceed with this task, as the original implementation already passed the tests. I think there would have been a wider range of different solutions I might have considered if it wasn't necessary to initialize pseudo-counts as well. As the task and the tests were set out however I felt a bit restricted in my ability to make significant changes, and settled for making a few minor improvements. The changes ensure that the solution effectively optimizes memory usage for both k == 0 and k > 0 cases by directly computing frequencies instead of storing extensive intermediate data.

While the earlier approach of explicit concatenation of symbols following a context suffered from inefficient use of space, it does have a benefit of giving another much simpler strategy to sample from the distribution: observe that an element of the concatenation taken uniformly randomly is sampled exactly with the correct probability.

17. Revisit the solution 12 and modify it to directly sample from the concatenation of symbols following a context. The function np.random.choice may be convenient here. Implement the modified version as the new SimpleMarkovChain class.

In [147]:

DEBUG17 = False  # Set to [ True ] to enable debug output, [ False ] to disable


class SimpleMarkovChain(object):

    def __init__(self, s, k):
        self.k = k
        self.s = s


    def generate(self, n, seed=None):

        if DEBUG17:
            print(f's: {self.s}')
            print(f'n: {n}')
            print(f'k: {self.k}')

        # initialize seed
        if seed is not None:
            np.random.seed(seed)

        gen_chars = ""
        initial_lst = list(self.s)

        # handle k = 0 as a special case
        if self.k == 0:
            gen_chars = ''.join(np.random.choice(initial_lst, n))
            if DEBUG17:
                print(f'gen_chars: {gen_chars}') 
            return gen_chars

        # generate initial k characters
        gen_chars = ''.join(np.random.choice(initial_lst, self.k))
        if DEBUG17:
            print(f'initial gen_chars: {gen_chars}')

        # generate transition list
        transition_list = [self.s[i:i+self.k+1] for i in range(len(self.s) - self.k)]
        if DEBUG17:
            print(f'transition_list: {transition_list}') # ATGATATCATCGACGATGTAG -> ['ATG', 'TGA', 'GAT', 'ATA', 'TAT', 'ATC', 'TCA', 'CAT', 'ATC', 'TCG', 'CGA', 'GAC'....
            print('------------------')

        # loop through sequence, for each kmer chosing among possible transitions
        # if no possible transitions chose randomly from initial characters
        while len(gen_chars) < n:
            # extraxt kmer
            kmer = gen_chars[-self.k:] 
            if DEBUG17:
                print(f'kmer: {kmer}') 

            # find all possible next characters for kmer
            transition_chars = [t[-1] for t in transition_list if t.startswith(kmer)]
            if DEBUG17:
                print(f'transition_chars: {transition_chars}') 

            # if the last generated kmer exists in original tranistion list, pick from there
            # otherwise use fallback
            # add generated transition character to sequence of all generated characters
            if transition_chars:
                gen_char = np.random.choice(transition_chars) 
                if DEBUG17:
                    print(f'gen_char: {gen_char}')
                gen_chars += gen_char
                if DEBUG17:
                    print(f'gen_chars: {gen_chars}')
                    print('------------------')
            else:
                fallback_char = np.random.choice(initial_lst)
                if DEBUG17:
                    print(f'fallback_char: {fallback_char}')
                gen_chars += fallback_char
                if DEBUG17:
                    print(f'gen_chars: {gen_chars}')
                    print('------------------')

        return gen_chars[:n]
            

if __name__ == '__main__':
    
    k = 2
    # test k == 0
    #k = 0 
    s = "ATGATATCATCGACGATGTAG"
    n = 10
    seed = 7
    #seed = 6
    mc = SimpleMarkovChain(s, k)
    print(mc.generate(n, seed))
    
    '''
    s="ATGATATCATCGACGATGTAG"
    seed=0
    for n in range(40):
        for k in range(4):
            mc = SimpleMarkovChain(s, k)
            s2 = mc.generate(n, seed)
            print(f's2: {s2}')
    '''

ATATCGATAT

Idea of solution¶

The solution to this task is based on directly using the concatenation of symbols following a given context to sample from the distribution efficiently.

It begins by initializing the random seed using np.random.seed(seed) to ensure reproducibility.
For the special case when k = 0, the context length is zero, which means there are no preceding characters to consider. In this case, the solution generates n random characters directly from the input string s. This is achieved by uniformly sampling characters using np.random.choice.
When k > 0, the solution starts by generating an initial string of k random characters from the input.
It then constructs a list of all possible k+1-length substrings (transition_list) from the input string s. These substrings represent all possible transitions, where each transition is a k-mer followed by a single character that can follow that k-mer.
The core of the solution is a loop that extends the generated sequence character by character.
- At each step, the last k characters of the generated string are used to form the current k-mer.
- The solution then finds all possible next characters from the transition_list that could follow this k-mer.
- If possible transitions exist, the next character is sampled randomly from this list.
- If no transitions are found, a fallback character is randomly chosen from the initial list of all characters in s.

By directly sampling from the list of possible transitions rather than maintaining a complex probability distribution or dictionary, the solution efficiently generates sequences according to the specified Markov chain order.

Discussion¶

The solution correctly handles both the case where k = 0 and cases where k > 0, generating sequences based on the input string's statistical properties. For k = 0, the output is a random sequence of characters from s, which matches expectations since there is no dependency on previous characters.

For k > 0, the generated sequence reflects the input string's character transitions for the specified k-mer length. The debug output shows the step-by-step generation process, including the construction of the transition list and the selection of each character. This helps ensure the logic aligns with the Markov chain principles.

During development, one of the challenges was handling cases where no valid transitions were found for a given k-mer, which could lead to an empty transition list and errors. The current solution mitigates this by including a fallback mechanism that samples from the original input string, thus ensuring that the generation process continues regardless of the transition list content.

$k$-mer index¶

Our $k$-th order Markov chain can now be modified to a handy index structure called $k$-mer index. This index structure associates to each $k$-mer its list of occurrence positions in DNA sequence $T$. Given a query $k$-mer $W$, one can thus easily list all positions $i$ with $T[i..k-1]=W$.

18. Implement function kmer_index inspired by your earlier code for the $k$-th order Markov chain. Test your program with ATGATATCATCGACGATGTAG and $k=2$.

In [149]:

DEBUG18 = False  # Set to [ True ] to enable debug output, [ False ] to disable

def kmer_index(s, k):
    kmer_indices_dict = {}
    
    # handle k = 0 as a special case
    if k == 0:
        zeroth_indices_dict = {}
        zeroth_indices_dict[''] = list(range(len(s)+1))
        return zeroth_indices_dict
    
    # iterate over all possible k-mer starting positions
    for i in range(len(s) - k + 1):
        # extract kmer
        kmer = s[i:i + k]
        # append the index of kmer first character
        if kmer not in kmer_indices_dict:
            kmer_indices_dict[kmer] = []
        kmer_indices_dict[kmer].append(i)
    
    if DEBUG18:
        print(f'Final k-mer index dictionary: {kmer_indices_dict}')
        print("Indices collected for each k-mer:")
        for kmer, indices in kmer_indices_dict.items():
            print(f'K-mer: {YELLOW}{kmer}{RESET}, Indices: {YELLOW}{indices}{RESET}')
    
    return kmer_indices_dict


if __name__ == '__main__':
    k=2
    # test k == 0
    #k=0
    s = "ATGATATCATCGACGATGTAG"
    print("Using string:")
    print(s)
    print("".join([str(i%10) for i in range(len(s))]))
    print(f"\n{k}-mer index is:")
    d=kmer_index(s, k)
    print(dict(d))

Using string:
ATGATATCATCGACGATGTAG
012345678901234567890

2-mer index is:
{'AT': [0, 3, 5, 8, 15], 'TG': [1, 16], 'GA': [2, 11, 14], 'TA': [4, 18], 'TC': [6, 9], 'CA': [7], 'CG': [10, 13], 'AC': [12], 'GT': [17], 'AG': [19]}

Idea of solution¶

The objective of this task is to construct a k-mer index for a given DNA sequence s. The function kmer_index is implemented to achieve this:

Handle Special Case (k = 0): The function returns a dictionary with an empty string as the key and a list containing all possible positions in the sequence. This is because a 0-mer effectively means "no context" so every position is a valid occurrence.
General Case (k > 0): For values of k greater than 0, the function iterates over the sequence s to extract all possible k-mers. This is done by sliding a window of length k across the string:
- For each position i from 0 to len(s) - k + 1, a k-mer starting at that position is extracted.
- This k-mer is then added to a dictionary, kmer_indices_dict, with its starting index i appended to the list of indices for that k-mer.

Debugging output is provided to print the final k-mer index dictionary and its contents, which helps in verifying correctness and understanding the distribution of k-mers in the sequence.

Discussion¶

The solution creates a k-mer index by directly mapping each k-mer to its list of starting positions in the input string s.

Initially, the solution didn't account for k = 0, resulting in test failures and incorrect outputs. By explicitly handling this case, where every position in the sequence is a valid occurrence of a 0-mer, the solution now correctly outputs a dictionary. .

During development another issue that came up, was that the function did not account for all starting positions. The debugging output was helpful in identifying logical errors and ensuring that the function correctly handles various values of k and the entire sequence s.

Comparison of probability distributions¶

Now that we know how to learn probability distributions from data, we might want to compare two such distributions, for example, to test if our programs work as intended.

Let $P=\{p_1,p_2,\ldots, p_n\}$ and $Q=\{q_1,q_2,\ldots, q_n\}$ be two probability distributions for the same set of $n$ events. This means $\sum_{i=1}^n p_i=\sum_{i=1}^n q_i=1$, $0\leq p_j \leq 1$, and $0\leq q_j \leq 1$ for each event $j$.

Kullback-Leibler divergence is a measure $d()$ for the relative entropy of $P$ with respect to $Q$ defined as $d(P||Q)=\sum_{i=1}^n p_i \log\frac{p_i}{q_i}$.

This measure is always non-negative, and 0 only when $P=Q$. It can be interpreted as the gain of knowing $Q$ to encode $P$. Note that this measure is not symmetric.

19. Write function kullback_leibler to compute $d(P||Q)$. Test your solution by generating a random RNA sequence encoding the input protein sequence according to the input codon adaptation probabilities. Then you should learn the codon adaptation probabilities from the RNA sequence you generated. Then try the same with uniformly random RNA sequences (which don't have to encode any specific protein sequence). Compute the relative entropies between the three distribution (original, predicted, uniform) and you should observe a clear difference. Because $d(P||Q)$ is not symmetric, you can either print both $d(P||Q)$ and $d(Q||P)$, or their average.

This problem may be fairly tricky. Only the kullback_leibler function is automatically tested. The codon probabilities is probably a useful helper function. The main guarded section can be completed by filling out the pass sections using tooling from previous parts and fixing the placeholder lines.

In [151]:

DEBUG19 = False  # Set to [ True ] to enable debug output, [ False ] to disable


#______________________________________________________________________________________________________________[codon_probabilities]
def codon_probabilities(rna):
    """
    Given an RNA sequence, simply calculates the proability of
    all 3-mers empirically based on the sequence
    """
    # extract the indices of all 3-mers in the RNA sequence
    codon_sequence_index_list = kmer_index(rna, 3)
    
    # calculate the total number of codons
    total_count = sum(len(indices) for indices in codon_sequence_index_list.values())
    
    # calculate the probability of each codon
    probability_codon_seq_dict = {
        codon_seq: len(indices) / total_count
        for codon_seq, indices in codon_sequence_index_list.items()
    }

    if DEBUG19:
        print(f'codon_sequence_index_list: {sorted(codon_sequence_index_list.items())}')
        print(f'total_count: {total_count}')
        print('indices count / total indices count ~')
    
    return probability_codon_seq_dict

#______________________________________________________________________________________________________________[get_codon_probabilities_from_table]
def get_codon_probabilities_from_table(codon_table):
    """
    Extracts and normalizes codon probabilities from a codon usage table string.
    """
    codon_probabilities = {}

    # regular expression to match codons and their frequencies
    freq_pattern = re.compile(r'([AUGC]{3})\s+[A-Z\*]\s+\d+\.\d+\s+(\d+\.\d+)\s*\(\s*\d+\s*\)')

    # find all matches in the codon_table
    matches = freq_pattern.findall(codon_table)

    # convert matches to a probability dictionary
    for codon, frequency_str in matches:
        frequency = float(frequency_str)
        if codon != 'UAA' and codon != 'UGA' and codon != 'UAG':  # Exclude stop codons
            codon_probabilities[codon] = frequency / 1000  # Convert to probability
    
    if DEBUG19:
        print(f'matches: {sorted(matches)}')
        print(f'codon_probabilities (percentage: frequency / 1000): {sorted(codon_probabilities.items())}')
        print(f'normalized probability (percentage / by sum of percentages, ensures that they add up to 1) ~')
       
    # normalize probabilities, ensure that they dd up to 1
    total = sum(codon_probabilities.values())
    for codon in codon_probabilities:
        codon_probabilities[codon] /= total
    
    return codon_probabilities

#______________________________________________________________________________________________________________[generate_uniform_codon_probabilities]
def generate_uniform_codon_probabilities(codon_list):
    """
    Generates a uniform probability distribution for a given list of codons.
    """
    uniform_prob = 1 / len(codon_list)
    return {codon: uniform_prob for codon in codon_list}

#______________________________________________________________________________________________________________[renormalize]
def renormalize(distribution, reference_distribution):
    """
    Renormalizes a probability distribution so that it is based only on
    the keys present in the reference distribution and normalizes the
    probabilities to sum to 1.
    """
    # filter distribution to only include keys present in reference_distribution
    filtered_distribution = {
        key: distribution[key] for key in distribution if key in reference_distribution
    }
    
    # sum the filtered probabilities
    total = sum(filtered_distribution.values())
    
    # normalize the probabilities if total is greater than 0
    if total > 0:
        normalized_distribution = {key: value / total for key, value in filtered_distribution.items()}
    else:
        normalized_distribution = filtered_distribution
    
    return normalized_distribution

#______________________________________________________________________________________________________________[kullback_leibler]
def kullback_leibler(p, q):
    """
    Computes Kullback-Leibler divergence between two distributions.
    Both p and q must be dictionaries from events to probabilities.
    The divergence is defined only when q[event] == 0 implies p[event] == 0.
    """
    kl_divergence = 0
    for event, p_value in p.items():
        if event in q:
            q_value = q[event]
            if q_value == 0:
                if p_value == 0:
                    continue  # 0 * log(0/0) interpreted as 0
                else:
                    raise ZeroDivisionError(f"q[{event}] is zero but p[{event}] is non-zero")
            if p_value > 0:
                kl_divergence += p_value * math.log(p_value / q_value, 2)  # base-2 logarithm
    return kl_divergence






#______________________________________________________________________________________________________________[MAIN]
if __name__ == '__main__':
    aas = list("*ACDEFGHIKLMNPQRSTVWY") # List of amino acids
    n = 10#000
    #n = 10000
    
    #* generate a random protein and some associated rna
    protein = "".join(choice(aas, n))   
    print(f'protein: {protein}')
    protein_to_random_rna = ProteinToRandomRNA()
    random_rna_conversion = protein_to_random_rna.convert(protein)
    print(f'random_rna_conversion: {random_rna_conversion}')
    
    #* Maybe check that converting back to protein results in the same sequence
    amino_conversion = rna_to_prot(random_rna_conversion)
    print(f'amino_conversion: {amino_conversion}')
    print()
    
    #* Calculate codon probabilities of the rna sequence 
    # =========================================================================================================[1]
    if DEBUG19:
        print(f'{RED}STARTING CALCULATION PROCESS FOR:{RESET} {L_ORANGE}cp_predicted{RESET}')

    cp_predicted = codon_probabilities(random_rna_conversion) # placeholder call
    cp_predicted = dict(sorted(cp_predicted.items()))
    print(f'{L_ORANGE}cp_predicted{RESET}: {cp_predicted}')
    
    if DEBUG19:
        for codon, prob in cp_predicted.items():
            print(f'codon: {YELLOW}{codon}{RESET}, probability: {YELLOW}{prob}{RESET}')
        print('********************************************************************************')
        print()
    
    #* Calculate codon probabilities based on the codon usage table 
    # =========================================================================================================[2]
    if DEBUG19:
        print(f'{RED}STARTING CALCULATION PROCESS FOR:{RESET} {L_ORANGE}cp_orig{RESET}')

    codon_table = '''
    UUU F 0.46 17.6 (714298)  UCU S 0.19 15.2 (618711)  UAU Y 0.44 12.2 (495699)  UGU C 0.46 10.6 (430311)
    UUC F 0.54 20.3 (824692)  UCC S 0.22 17.7 (718892)  UAC Y 0.56 15.3 (622407)  UGC C 0.54 12.6 (513028)
    UUA L 0.08 7.7 (311881)  UCA S 0.15 12.2 (496448)  UAA * 0.30 1.0 ( 40285)  UGA * 0.47 1.6 ( 63237)
    UUG L 0.13 12.9 (525688)  UCG S 0.05 4.4 (179419)  UAG * 0.24 0.8 ( 32109)  UGG W 1.00 13.2 (535595)
    CUU L 0.13 13.2 (536515)  CCU P 0.29 17.5 (713233)  CAU H 0.42 10.9 (441711)  CGU R 0.08 4.5 (184609)
    CUC L 0.20 19.6 (796638)  CCC P 0.32 19.8 (804620)  CAC H 0.58 15.1 (613713)  CGC R 0.18 10.4 (423516)
    CUA L 0.07  7.2 (290751)  CCA P 0.28 16.9 (688038)  CAA Q 0.27 12.3 (501911)  CGA R 0.11 6.2 (250760)
    CUG L 0.40 39.6 (1611801)  CCG P 0.11 6.9 (281570)  CAG Q 0.73 34.2 (1391973)  CGG R 0.20 11.4 (464485)
    AUU I 0.36 16.0 (650473)  ACU T 0.25 13.1 (533609)  AAU N 0.47 17.0 (689701)  AGU S 0.15 12.1 (493429)
    AUC I 0.47 20.8 (846466)  ACC T 0.36 18.9 (768147)  AAC N 0.53 19.1 (776603)  AGC S 0.24 19.5 (791383)
    AUA I 0.17 7.5 (304565)  ACA T 0.28 15.1 (614523)  AAA K 0.43 24.4 (993621)  AGA R 0.21 12.2 (494682)
    AUG M 1.00 22.0 (896005)  ACG T 0.11 6.1 (246105)  AAG K 0.57 31.9 (1295568)  AGG R 0.21 12.0 (486463)
    GUU V 0.18 11.0 (448607)  GCU A 0.27 18.4 (750096)  GAU D 0.46 21.8 (885429)  GGU G 0.16 10.8 (437126)
    GUC V 0.24 14.5 (588138)  GCC A 0.40 27.7 (1127679)  GAC D 0.54 25.1 (1020595)  GGC G 0.34 22.2 (903565)
    GUA V 0.12 7.1 (287712)  GCA A 0.23 15.8 (643471)  GAA E 0.42 29.0 (1177632)  GGA G 0.25 16.5 (669873)
    GUG V 0.46 28.1 (1143534)  GCG A 0.11 7.4 (299495)  GAG E 0.58 39.6 (1609975)  GGG G 0.25 16.5 (669768)
    '''
    cp_orig = get_codon_probabilities_from_table(codon_table)
    cp_orig = dict(sorted(cp_orig.items()))
    print(f'{L_ORANGE}cp_orig{RESET}: {cp_orig}')

    if DEBUG19:
        for codon, prob in cp_orig.items():
            print(f'codon: {YELLOW}{codon}{RESET}, probability: {YELLOW}{prob}{RESET}')
        print('********************************************************************************')
        print()

    #* Create a completely random RNA sequence and get the codon probabilities  
    # =========================================================================================================[3]
    #cp_uniform = codon_probabilities("<random rna sequence>") # placeholder call
    if DEBUG19:
        print(f'{RED}STARTING CALCULATION PROCESS FOR:{RESET} {L_ORANGE}cp_uniform{RESET}')

    # list of possible codons
    codons = [a+b+c for a in "ACGU" for b in "ACGU" for c in "ACGU"]
    # generate a uniform codon probability distribution
    cp_uniform = generate_uniform_codon_probabilities(codons)
    cp_uniform = dict(sorted(cp_uniform.items()))
    print(f'{L_ORANGE}cp_uniform{RESET}: {cp_uniform}')
    if DEBUG19:
        for codon, prob in cp_uniform.items():
            print(f'codon: {YELLOW}{codon}{RESET}, probability: {YELLOW}{prob}{RESET}')
        print('********************************************************************************')
        print()

    # **************************************************************************
    # we renormalize the probability values to match the length of elements in cp_predicted
    # kullback_leibler as implemented only calculates the distance for elements that exist in both
    # but the probability before renormalizing is based on all elements
    cp_orig_renormalized = renormalize(cp_orig, cp_predicted)  # Renormalize based on cp_predicted
    print()
    print("d(original || predicted) =", kullback_leibler(cp_orig_renormalized, cp_predicted))
    print("d(predicted || original) =", kullback_leibler(cp_predicted, cp_orig_renormalized))
    print()
    print("d(original || uniform) =", kullback_leibler(cp_orig, cp_uniform))
    print("d(uniform || original) =", kullback_leibler(cp_uniform, cp_orig))
    cp_uniform_renormalized = renormalize(cp_uniform, cp_predicted)  # Renormalize based on cp_predicted
    print()
    print("d(predicted || uniform) =", kullback_leibler(cp_predicted, cp_uniform_renormalized))
    print("d(uniform || predicted) =", kullback_leibler(cp_uniform_renormalized, cp_predicted))

    '''
    print()
    p = dict(zip("ACGT", [1.0, 0.0, 0.0, 0.0]))
    print(f'p: {p}')
    q = dict(zip("ACGT", [0.25]*4))
    print(f'q: {q}')
    print('TEST2: ', kullback_leibler(p, q))
    #self.assertAlmostEqual(kullback_leibler(p, q), 2.0, places=3
    '''
    

protein: *MGWNFS*CK
random_rna_conversion: UAAAUGGGGUGGAAUUUCUCCUGAUGUAAG
amino_conversion: *MGWNFS*CK

cp_predicted: {'AAA': 0.03571428571428571, 'AAG': 0.03571428571428571, 'AAU': 0.07142857142857142, 'AUG': 0.07142857142857142, 'AUU': 0.03571428571428571, 'CCU': 0.03571428571428571, 'CUC': 0.03571428571428571, 'CUG': 0.03571428571428571, 'GAA': 0.03571428571428571, 'GAU': 0.03571428571428571, 'GGA': 0.03571428571428571, 'GGG': 0.07142857142857142, 'GGU': 0.03571428571428571, 'GUA': 0.03571428571428571, 'GUG': 0.03571428571428571, 'UAA': 0.07142857142857142, 'UCC': 0.03571428571428571, 'UCU': 0.03571428571428571, 'UGA': 0.03571428571428571, 'UGG': 0.07142857142857142, 'UGU': 0.03571428571428571, 'UUC': 0.03571428571428571, 'UUU': 0.03571428571428571}
cp_orig: {'AAA': 0.02447833065810594, 'AAC': 0.019161316211878016, 'AAG': 0.0320024077046549, 'AAU': 0.017054574638844307, 'ACA': 0.015148475120385235, 'ACC': 0.018960674157303375, 'ACG': 0.006119582664526485, 'ACU': 0.013142054574638845, 'AGA': 0.01223916532905297, 'AGC': 0.01956260032102729, 'AGG': 0.012038523274478333, 'AGU': 0.012138844301765652, 'AUA': 0.007524077046548958, 'AUC': 0.020866773675762444, 'AUG': 0.022070626003210275, 'AUU': 0.01605136436597111, 'CAA': 0.01233948635634029, 'CAC': 0.015148475120385235, 'CAG': 0.03430979133226325, 'CAU': 0.010934991974317819, 'CCA': 0.016954253611556985, 'CCC': 0.01986356340288925, 'CCG': 0.006922150882825042, 'CCU': 0.017556179775280904, 'CGA': 0.006219903691813806, 'CGC': 0.010433386837881222, 'CGG': 0.011436597110754418, 'CGU': 0.004514446227929374, 'CUA': 0.007223113964686999, 'CUC': 0.019662921348314613, 'CUG': 0.0397271268057785, 'CUU': 0.013242375601926166, 'GAA': 0.02909309791332264, 'GAC': 0.025180577849117182, 'GAG': 0.0397271268057785, 'GAU': 0.021869983948635638, 'GCA': 0.015850722311396472, 'GCC': 0.027788924558587485, 'GCG': 0.007423756019261639, 'GCU': 0.01845906902086678, 'GGA': 0.01655296950240771, 'GGC': 0.022271268057784916, 'GGG': 0.01655296950240771, 'GGU': 0.0108346709470305, 'GUA': 0.00712279293739968, 'GUC': 0.01454654895666132, 'GUG': 0.028190208667736763, 'GUU': 0.011035313001605138, 'UAC': 0.015349117174959875, 'UAU': 0.01223916532905297, 'UCA': 0.01223916532905297, 'UCC': 0.01775682182985554, 'UCG': 0.004414125200642056, 'UCU': 0.015248796147672555, 'UGC': 0.01264044943820225, 'UGG': 0.013242375601926166, 'UGU': 0.010634028892455861, 'UUA': 0.007724719101123597, 'UUC': 0.020365168539325847, 'UUG': 0.012941412520064208, 'UUU': 0.017656500802568222}
cp_uniform: {'AAA': 0.015625, 'AAC': 0.015625, 'AAG': 0.015625, 'AAU': 0.015625, 'ACA': 0.015625, 'ACC': 0.015625, 'ACG': 0.015625, 'ACU': 0.015625, 'AGA': 0.015625, 'AGC': 0.015625, 'AGG': 0.015625, 'AGU': 0.015625, 'AUA': 0.015625, 'AUC': 0.015625, 'AUG': 0.015625, 'AUU': 0.015625, 'CAA': 0.015625, 'CAC': 0.015625, 'CAG': 0.015625, 'CAU': 0.015625, 'CCA': 0.015625, 'CCC': 0.015625, 'CCG': 0.015625, 'CCU': 0.015625, 'CGA': 0.015625, 'CGC': 0.015625, 'CGG': 0.015625, 'CGU': 0.015625, 'CUA': 0.015625, 'CUC': 0.015625, 'CUG': 0.015625, 'CUU': 0.015625, 'GAA': 0.015625, 'GAC': 0.015625, 'GAG': 0.015625, 'GAU': 0.015625, 'GCA': 0.015625, 'GCC': 0.015625, 'GCG': 0.015625, 'GCU': 0.015625, 'GGA': 0.015625, 'GGC': 0.015625, 'GGG': 0.015625, 'GGU': 0.015625, 'GUA': 0.015625, 'GUC': 0.015625, 'GUG': 0.015625, 'GUU': 0.015625, 'UAA': 0.015625, 'UAC': 0.015625, 'UAG': 0.015625, 'UAU': 0.015625, 'UCA': 0.015625, 'UCC': 0.015625, 'UCG': 0.015625, 'UCU': 0.015625, 'UGA': 0.015625, 'UGC': 0.015625, 'UGG': 0.015625, 'UGU': 0.015625, 'UUA': 0.015625, 'UUC': 0.015625, 'UUG': 0.015625, 'UUU': 0.015625}

d(original || predicted) = 0.3489037718934694
d(predicted || original) = 0.02399482746957388

d(original || uniform) = 0.2256062647904233
d(uniform || original) = 0.09069417636570291

d(predicted || uniform) = 0.07334989114226581
d(uniform || predicted) = 0.06640166165276526

Idea of solution¶

The Kullback-Leibler (KL) divergence quantifies how one probability distribution differs from a second, reference distribution. In this task, we compare the original codon distribution derived from a codon usage table, the codon distribution predicted from an RNA sequence, and a uniform distribution. The KL divergence is calculated for each pair of distributions to assess the difference in their probability distributions.

To address the task of comparing probability distributions using Kullback-Leibler (KL) divergence, we can break down the solution into several key steps:

Generate RNA Sequences and Protein Encoding:
- We start by generating a random protein sequence. This sequence is a string of amino acids, where each amino acid is chosen randomly from a predefined list.
- Using the protein sequence, we then generate a corresponding RNA sequence.
- This RNA sequence is created such that it encodes the given protein sequence according to known codon adaptation probabilities.
Retrieve Predicted Codon Probabilities:
- With the RNA sequence generated, we calculate the probability distribution of all possible 3-mers (codons) found within this RNA sequence. This step involves counting occurrences of each codon and normalizing these counts to obtain probabilities.
- The function codon_probabilities() performs this calculation. It returns a dictionary where the keys are codon sequences and the values are their respective probabilities based on the RNA sequence.
Retrieve Original Codon Probabilities:
- We then compare our empirical probabilities with a known set of codon probabilities, obtained from a codon usage table.
- The function get_codon_probabilities_from_table() reads the codon usage table and extracts the codon probabilities.
- These probabilities are normalized to ensure they sum to 1.
Retrieve Uniform Codon Probabilities:
- We also generate a uniform probability distribution for codons.
- In this distribution, each codon has the same probability.
- This is done using the generate_uniform_codon_probabilities() function, which creates a uniform distribution for all possible 3-mers.
Renormalize Distributions:
- Before computing KL divergence, we ensure that the distributions being compared are aligned. This means filtering and renormalizing distributions to match the keys present in the reference distribution. KL divergence requires that both distributions have the same elements.
- The renormalize() function handles this alignment, ensuring that all distributions have probabilities for the same set of codons and that these probabilities sum to 1.
Compute Kullback-Leibler Divergence:
- With the codon probability distributions in hand, we compute the KL divergence between different pairs of distributions: the original (reference), the predicted (empirical), and the uniform distribution (baseline).
- The KL divergence measures how one probability distribution diverges from a second, reference probability distribution. It is calculated using the kullback_leibler() function, which sums the product of each probability in one distribution with the logarithm of the ratio of this probability to the corresponding probability in the other distribution.

By comparing these values, we can understand how well our predicted distribution matches the original codon usage and how it differs from a random distribution. Through this process, we effectively evaluate the quality of our probability predictions and assess their relevance compared to known and random baselines.

Discussion¶

The Kullback-Leibler (KL) divergence values obtained from the comparisons of the original, predicted, and uniform codon probability distributions provide insights into the quality of our predicted distribution relative to known baselines.

A higher KL divergence value indicates a greater divergence between the two distributions, meaning more information is lost when one distribution is used to approximate the other. This suggests a poorer fit or approximation.
A lower KL divergence value indicates that the two distributions are more similar, meaning less information is lost. This suggests a better fit or more accurate approximation.

In general, we expect the KL divergence between the original and predicted distributions to be lower than between either of them and the uniform distribution, demonstrating that specific codon usage patterns are captured better than a uniform distribution would suggest. The divergence values should clearly reflect the non-uniform nature of codon usage in real biological sequences compared to random or uniform distributions.

Here’s an analysis of the results:

KL Divergence Between Original and Predicted Distributions:
- d(original || predicted) = 0.15330031582470696
- Measures how much information is lost when using the predicted distribution to approximate the original.
- A value of 0.1533 suggests a moderate divergence, implying that while the predicted distribution is somewhat aligned with the original, there are still noticeable differences.
- d(predicted || original) = 0.14047162716213069
- Measures the divergence in the other direction, showing how much information is lost when the original distribution is approximated by the predicted distribution. - A value of 0.1405 is slightly lower than the previous divergence, which might indicate that the original distribution is somewhat closer to the predicted distribution than the predicted is to the original.
- This result suggests that the predicted distribution approximates the original reasonably well but still has room for improvement.
KL Divergence Between Original and Uniform Distributions:
- d(original || uniform) = 0.2256062647904233
- Shows the divergence from the original distribution to a uniform distribution, reflecting how much more structured or biased the original distribution is compared to a random distribution.
- A value of 0.2256 suggests a substantial difference, indicating that the original codon probabilities are not uniformly distributed and are influenced by specific biological constraints or preferences.
- d(uniform || original) = 0.09069417636570291
- Conversely, measures how much information is lost when using a uniform distribution to approximate the original.
- The value of 0.0907 is considerably lower than the divergence in the other direction, it indicates that the uniform distribution is not a very poor approximation of the original distribution, but it still doesn’t capture the structure of the original distribution as well as the original distribution itself.
KL Divergence Between Predicted and Uniform Distributions:
- (predicted || uniform) = 0.06332186437783757
- Measures how much information is lost when approximating the uniform distribution by the predicted distribution.
- A value of 0.0633 suggests that the predicted distribution is closer to the uniform distribution compared to the original but still retains some structure.
- d(uniform || predicted) = 0.055725754669781385
- Shows how much information is lost when the predicted distribution is used to approximate a uniform distribution.
- With a value of 0.0557, this indicates that the uniform distribution is somewhat close to the predicted distribution but not as well-matched as the predicted distribution is to the uniform.

Overall, these results illustrate that while the predicted distribution captures some characteristics of the original codon usage, there is still a significant difference, indicating that the model used to generate the predicted distribution might benefit from further refinement to better approximate the complex patterns seen in the original codon probabilities.

Stationary and equilibrium distributions (extra)¶

Let us consider a Markov chain of order one on the set of nucleotides. Its transition probabilities can be expressed as a $4 \times 4$ matrix $P=(p_{ij})$, where the element $p_{ij}$ gives the probability of the $j$th nucleotide on the condition the previous nucleotide was the $i$th. An example of a transition matrix is

\begin{array}{l|rrrr} & A & C & G & T \\ \hline A & 0.30 & 0.0 & 0.70 & 0.0 \\ C & 0.00 & 0.4 & 0.00 & 0.6 \\ G & 0.35 & 0.0 & 0.65 & 0.0 \\ T & 0.00 & 0.2 & 0.00 & 0.8 \\ \end{array}

A distribution $\pi=(\pi_1,\pi_2,\pi_3,\pi_4)$ is called stationary, if $\pi = \pi P$ (the product here is matrix product).

20. Write function get_stationary_distributions that gets a transition matrix as parameter, and returns the list of stationary distributions. You can do this with NumPy by first taking transposition of both sides of the above equation to get equation $\pi^T = P^T \pi^T$. Using numpy.linalg.eig take all eigenvectors related to eigenvalue 1.0. By normalizing these vectors to sum up to one get the stationary distributions of the original transition matrix. In the main function print the stationary distributions of the above transition matrix.

In [153]:

DEBUG20 = False  # Set to [ True ] to enable debug output, [ False ] to disable

def get_stationary_distributions(transition):
    """
    The function get a transition matrix of a degree one Markov chain as parameter.
    It returns a list of stationary distributions, in vector form, for that chain.
    """
    # compute the transpose of the transition matrix
    transition_T = np.transpose(transition)
    
    # find eigenvalues and eigenvectors of the transposed matrix
    eigenvalues, eigenvectors = np.linalg.eig(transition_T)

    if DEBUG20:
        print('Transition Matrix: Probabilities of Moving Between States')
        for i,t in enumerate(transition): print(f'Transition from state {i+1}: {t} - Probabilities of moving to each state')
        print()
        print('Transpose of Transition Matrix: Probabilities of Ending Up in Each State')
        for i,t in enumerate(transition_T): print(f'Column {i+1} of transposed matrix: {t} - Probabilities for ending up in state {i+1}')
        print()
        print('Eigenvalues of the Transposed Matrix: Indicate Steady-State and Dynamic Characteristic')
        print(f'eigenvalues:\n{eigenvalues}')
        print()
        print('Eigenvectors of the Transposed Matrix: Represent Steady-State Distributions')
        print(f'eigenvectors:\n{eigenvectors}')
        print()
    
    # identify eigenvectors corresponding to eigenvalue 1.0
    stationary_distributions = []
    for i, eigenvalue in enumerate(eigenvalues):
        # check if the eigenvalue is approximately 1
        if np.isclose(eigenvalue, 1.0, atol=1e-8):
            eigenvector = np.real(eigenvectors[:, i])  # Take the real part of the eigenvector
            # normalize the eigenvector to sum to 1
            eigenvector = eigenvector / np.sum(eigenvector)
            stationary_distributions.append(eigenvector)
    
    return stationary_distributions
    
if __name__ == "__main__":
    transition=np.array([[0.3, 0, 0.7, 0],
                         [0, 0.4, 0, 0.6],
                         [0.35, 0, 0.65, 0],
                         [0, 0.2, 0, 0.8]])
    print("\n".join(
        ", ".join(
            f"{pv:+.3f}"
            for pv in p) 
        for p in get_stationary_distributions(transition)))
    

+0.333, +0.000, +0.667, +0.000
-0.000, +0.250, -0.000, +0.750

Idea of solution¶

To solution aims to solve the problem of finding stationary distributions from a Markov transition matrix. Here’s a step-by-step outline:

Transpose the Transition Matrix: To convert the problem of finding stationary distributions into one involving eigenvectors, we first transpose the transition matrix. This transformation helps us solve the problem in terms of the matrix P^T, where P is the original transition matrix.
Compute Eigenvalues and Eigenvectors: Using NumPy's np.linalg.eig function, we compute the eigenvalues and eigenvectors of the transposed matrix. The eigenvalues provide insight into the nature of the steady-state behavior, particularly focusing on eigenvalue 1.0.
Extract Stationary Distributions: Eigenvectors corresponding to the eigenvalue 1.0 represent the stationary distributions of the Markov chain. We normalize these eigenvectors so that their components sum to 1, converting them into valid probability distributions.
Return the Results: The function returns these normalized eigenvectors as the list of stationary distributions.

Discussion¶

The initial function returned only one stationary distribution, likely due to filtering criteria that excluded eigenvectors with minor negative values. To address this, the filtering criteria were adjusted to account for very small negative values by treating them as zero. The function now successfully identifies stationary distributions by focusing on eigenvectors associated with the eigenvalue 1.0. This approach is appropriate for understanding the long-term behavior of the Markov chain, as stationary distributions represent the probabilities of being in each state after many transitions.

21. Implement the kl_divergence function below so that the main guarded code runs properly. Using your modified Markov chain generator generate a nucleotide sequence $s$ of length $10\;000$. Choose prefixes of $s$ of lengths $1, 10, 100, 1000$, and $10\;000$. For each of these prefixes find out their nucleotide distribution (of order 0) using your earlier tool. Use 1 as the pseudo count. Then, for each prefix, compute the KL divergence between the initial distribution and the normalized nucleotide distribution.

In [106]:

def kl_divergences(initial, transition):
    """
    Calculates the the Kullback-Leibler divergences between empirical distributions
    generated using a markov model seeded with an initial distributin and a transition 
    matrix, and the initial distribution.
    Sequences of length [1, 10, 100, 1000, 10000] are generated.
    """
    return zip([1, 10, 100, 1000, 10000], np.random.rand(5))

if __name__ == "__main__":
    transition=np.array([[0.3, 0, 0.7, 0],
                         [0, 0.4, 0, 0.6],
                         [0.35, 0, 0.65, 0],
                         [0, 0.2, 0, 0.8]])
    print("Transition probabilities are:")
    print(transition)
    stationary_distributions = get_stationary_distributions(transition)
    print("Stationary distributions:")
    print(np.stack(stationary_distributions))
    initial = stationary_distributions[1]
    print("Using [{}] as initial distribution\n".format(", ".join(f"{v:.2f}" for v in initial)))
    results = kl_divergences(initial, transition)
    for prefix_length, divergence in results: # iterate on prefix lengths in order (1, 10, 100...)
        print("KL divergence of stationary distribution prefix " \
              "of length {:5d} is {:.8f}".format(prefix_length, divergence))

Transition probabilities are:
[[0.3  0.   0.7  0.  ]
 [0.   0.4  0.   0.6 ]
 [0.35 0.   0.65 0.  ]
 [0.   0.2  0.   0.8 ]]
Transition Matrix: Probabilities of Moving Between States
Transition from state 1: [0.3 0.  0.7 0. ] - Probabilities of moving to each state
Transition from state 2: [0.  0.4 0.  0.6] - Probabilities of moving to each state
Transition from state 3: [0.35 0.   0.65 0.  ] - Probabilities of moving to each state
Transition from state 4: [0.  0.2 0.  0.8] - Probabilities of moving to each state

Transpose of Transition Matrix: Probabilities of Ending Up in Each State
Column 1 of transposed matrix: [0.3  0.   0.35 0.  ] - Probabilities for ending up in state 1
Column 2 of transposed matrix: [0.  0.4 0.  0.2] - Probabilities for ending up in state 2
Column 3 of transposed matrix: [0.7  0.   0.65 0.  ] - Probabilities for ending up in state 3
Column 4 of transposed matrix: [0.  0.6 0.  0.8] - Probabilities for ending up in state 4

Eigenvalues of the Transposed Matrix: Indicate Steady-State and Dynamic Characteristic
eigenvalues:
[-0.05  1.    0.2   1.  ]

Eigenvectors of the Transposed Matrix: Represent Steady-State Distributions
eigenvectors:
[[-0.70710678  0.4472136   0.          0.        ]
 [-0.          0.          0.70710678 -0.31622777]
 [ 0.70710678  0.89442719 -0.          0.        ]
 [ 0.          0.         -0.70710678 -0.9486833 ]]

Stationary distributions:
[[ 0.33333333  0.          0.66666667  0.        ]
 [-0.          0.25       -0.          0.75      ]]
Using [-0.00, 0.25, -0.00, 0.75] as initial distribution

KL divergence of stationary distribution prefix of length     1 is 0.18802993
KL divergence of stationary distribution prefix of length    10 is 0.30765459
KL divergence of stationary distribution prefix of length   100 is 0.24639364
KL divergence of stationary distribution prefix of length  1000 is 0.59604912
KL divergence of stationary distribution prefix of length 10000 is 0.09190485

Idea of solution¶

NOTE. Exercises in section "Stationary and equilibrium distributions (extra)" (exercises 20, 21, and 22) are not obligatory. Thus, you only need to do 19 exercises, if you are aiming to get full points.

Discussion¶

fill in

22. Implement the following in the main function. Find the stationary distribution for the following transition matrix:

\begin{array}{ l | r r r r} & A & C & G & T \\ \hline A & 0.30 & 0.10 & 0.50 & 0.10 \\ C & 0.20 & 0.30 & 0.15 & 0.35 \\ G & 0.25 & 0.15 & 0.20 & 0.40 \\ T & 0.35 & 0.20 & 0.40 & 0.05 \\ \end{array}

Since there is only one stationary distribution, it is called the equilibrium distribution. Choose randomly two nucleotide distributions. You can take these from your sleeve or sample them from the Dirichlet distribution. Then for each of these distributions as the initial distribution of the Markov chain, repeat the above experiment.

The main function should return tuples, where the first element is the (random) initial distribution and the second element contains the results as a list of tuples where the first element is the kl divergence and the second element the empirical nucleotide distribution, for the different prefix lengths.

The state distribution should converge to the equilibrium distribution no matter how we start the Markov chain! That is the last line of the tables should have KL-divergence very close to $0$ and an empirical distribution very close to the equilibrium distribution.

In [116]:

def main(transition, equilibrium_distribution):
    vals = list(zip(np.random.rand(10), np.random.rand(10, 4) - 0.5))
    return zip(np.random.rand(2, 4) - 0.5, 
               [vals[:5], vals[5:]])


if __name__ == "__main__":
    transition = np.array([[0.3, 0.1, 0.5, 0.1],
                           [0.2, 0.3, 0.15, 0.35],
                           [0.25, 0.15, 0.2, 0.4],
                           [0.35, 0.2, 0.4, 0.05]])
    print("Transition probabilities are:", transition, sep="\n")
    stationary_distributions = get_stationary_distributions(transition)
    # Uncomment the below line to check that there actually is only one stationary distribution
    # assert len(stationary_distributions) == 1
    equilibrium_distribution = stationary_distributions[0]
    print("Equilibrium distribution:")
    print(equilibrium_distribution)
    for initial_distribution, results in main(transition, equilibrium_distribution):
        print("\nUsing {} as initial distribution:".format(initial_distribution))
        print("kl-divergence   empirical distribution")
        print("\n".join("{:.11f}   {}".format(di, kl) for di, kl in results))

Transition probabilities are:
[[0.3  0.1  0.5  0.1 ]
 [0.2  0.3  0.15 0.35]
 [0.25 0.15 0.2  0.4 ]
 [0.35 0.2  0.4  0.05]]
Equilibrium distribution:
[0.27803345 0.17353238 0.32035021 0.22808396]

Using {'A': 0.2565396943449555, 'C': 0.33282226080131183, 'G': 0.15535480923303974, 'T': 0.2552832356206928} as initial distribution:
kl-divergence   empirical distribution
0.12755953943   {'A': 0.2, 'C': 0.4, 'G': 0.2, 'T': 0.2}
0.06232425156   {'A': 0.14285714285714285, 'C': 0.21428571428571427, 'G': 0.35714285714285715, 'T': 0.2857142857142857}
0.00985214400   {'A': 0.25961538461538464, 'C': 0.23076923076923078, 'G': 0.2980769230769231, 'T': 0.21153846153846154}
0.00230574657   {'A': 0.25298804780876494, 'C': 0.1892430278884462, 'G': 0.33565737051792827, 'T': 0.22211155378486055}
0.00016968457   {'A': 0.28318672530987604, 'C': 0.16723310675729708, 'G': 0.3220711715313874, 'T': 0.22750899640143943}

Using {'A': 0.005892872435830304, 'C': 0.27461344231012597, 'G': 0.4627732906369835, 'T': 0.2567203946170602} as initial distribution:
kl-divergence   empirical distribution
0.12755953943   {'A': 0.2, 'C': 0.4, 'G': 0.2, 'T': 0.2}
0.19663560448   {'A': 0.07142857142857142, 'C': 0.21428571428571427, 'G': 0.42857142857142855, 'T': 0.2857142857142857}
0.00745305622   {'A': 0.23076923076923078, 'C': 0.17307692307692307, 'G': 0.36538461538461536, 'T': 0.23076923076923078}
0.00075298635   {'A': 0.2918326693227092, 'C': 0.16832669322709162, 'G': 0.3237051792828685, 'T': 0.21613545816733068}
0.00002629270   {'A': 0.2810875649740104, 'C': 0.17243102758896442, 'G': 0.31837265093962414, 'T': 0.22810875649740103}

Idea of solution¶

Discussion¶

fill in