Explorations into Genomics¶

Inspired/stolen from work by John Jacobsen: http://eigenhombre.com/2013/11/03/nucleotide-repetition-lengths/

An exercise in laziness and counting.

We represent genomes as strings on an alphabet of four letters, A, C, G, and T. Thus part of a genome might look like the following:

"...AACCGTGTGCGTTTATTAATTATTGCTTTA..."

Each letter corresponds to a particular nucleic acid (TODO verify this) in a very long strand of DNA. A single genome can be quite large and scientists generate them at an alarming rate.

In this section we'll process genomes and perform simple counting analytics on them. We'll start by counting frequencies of the individual nucleic acids (how often does A appear?), move on to counting lengths of repeated nucleic acids (how often do we see long repeated strings like ('AAAAAA'), and finish with a quick markov chain.

Throughout we'll see exercise laziness and become familiar with the following functions from PyToolz.

In [1]:

from toolz.curried import (map, frequencies, pipe, take, concat, drop, count,
                           partitionby, identity, compose, sliding_window, 
                           merge_with, merge)

We've stored some actual genomes in the data folder

In [2]:

from glob import glob

file_pattern = 'data/yeast-genome/chr*.fa'

Snoot - an infinite and uniformly random organism¶

We'll compare our results against the genome of a snoot. A snoot is a magical creature with an infinitely long gene sequence where each letter is uniformly chosen. We can implement a snoot genome with a Python generator.

By comparing our physical genomes against that of the snoot we'll highlight ways in which we deviate from randomness.

In [3]:

import random

def random_genome():
    while True:
        yield random.choice('ACTG')
        
snoot = random_genome()        

list(take(10, snoot))

Out[3]:

['A', 'C', 'A', 'T', 'C', 'T', 'G', 'C', 'C', 'T']

In [4]:

# Set up matplotlib and build a small function to plot histograms of count dictionaries

from pylab import hist, show, xticks, title
%matplotlib inline

def histdict(d, *args, **kwargs):
    """ Plot a histogram given a dictionary of counts 
    
    See Also:
        matplotlib.hist
    """
    if all(isinstance(k, (int, float)) for k in d.keys()):
        hist(d.keys(), *args, weights=d.values(), **kwargs)
    else:
        keydict = dict(enumerate(d.keys()))
        revdict = {v: k for k, v in keydict.items()}
        hist(list(map(revdict.get, d.keys())), *args, weights=d.values(), **kwargs)
        xticks(keydict.keys(), keydict.values())
        
histdict({'dog': 3, 'cat': 5})
show()

In [5]:

counts = frequencies(take(10000, snoot))
histdict(counts)

In [6]:

counts = pipe(file_pattern, glob, map(open), map(drop(1)), 
                            concat, map(str.upper), map(str.strip), 
                            concat, frequencies)
histdict(counts)

Counting long strings¶

How often do very long repetitions of base pair occur? E.g. are sequences like "TTTTTTTTT" very rare or are they commonplace?

In [7]:

# Using partitionby - also known as itertools.groupby
from toolz.curried import partitionby, identity

animals = ['cat', 'dog', 'hen', 'goose', 'moose', 'rat', 'giraffe']
list(partitionby(len, animals))  # Parition animals by their name length

Out[7]:

[['cat', 'dog', 'hen'], ['goose', 'moose'], ['rat'], ['giraffe']]

In [8]:

# We want to partition our gene sequences 'AACCTTTTTGCT' by the letters themselves
# Instead of `len`, our key function is just the trivial identity function
identity('A')

Out[8]:

'A'

In [9]:

# Lets compose partitionby with identity to collect our sequence 
# into groups of repeated elements

partitions = list(partitionby(identity, 'AACCTTTTTGCT'))
partitions

Out[9]:

[['A', 'A'], ['C', 'C'], ['T', 'T', 'T', 'T', 'T'], ['G'], ['C'], ['T']]

In [10]:

# From here we just want to compute the length of each group

list(map(len, partitions))

Out[10]:

[2, 2, 5, 1, 1, 1]

Counting recurrences on the random genome¶

Lets pull this together and count recurring base pairs on our random genome.

In [11]:

pipe(snoot, take(20), partitionby(identity), map(len), list)

Out[11]:

[1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1]

In [12]:

result = pipe(snoot, take(1000000), partitionby(identity), map(len), frequencies)

histdict(result, log=True)
title('Counts of Repeated Base Pairs - Random')
result

Out[12]:

{1: 562222,
 2: 140330,
 3: 35289,
 4: 8843,
 5: 2219,
 6: 577,
 7: 138,
 8: 33,
 9: 8,
 10: 2}

Counting recurrences on the Yeast genome¶

In [13]:

result = pipe(file_pattern, glob, map(open), map(drop(1)), 
                            concat, map(str.upper), map(str.strip), concat,
                            partitionby(identity), map(len), frequencies)

histdict(result, log=True)
title('Counts of Repeated Base Pairs - Yeast')
result

Out[13]:

{1: 6118373,
 2: 1729275,
 3: 488751,
 4: 158660,
 5: 52793,
 6: 17076,
 7: 7079,
 8: 2700,
 9: 1305,
 10: 778,
 11: 496,
 12: 344,
 13: 247,
 14: 135,
 15: 91,
 16: 64,
 17: 52,
 18: 26,
 19: 34,
 20: 23,
 21: 14,
 22: 17,
 23: 13,
 24: 19,
 25: 8,
 26: 10,
 27: 4,
 28: 4,
 29: 3,
 30: 2,
 31: 4,
 32: 1,
 33: 1,
 34: 1,
 35: 2,
 36: 1,
 37: 1,
 42: 1}

Markov Chains¶

Do certain base pairs commonly occur after others? Given that we've just seen "AG" is it more or less likely that we'll see 'G' again? What about 'T'? These questions can be answered with Markov chains.

To do this exercise we'll need to use toolz.sliding_window. Sliding window is a generalization on functions like sliding average or sliding max. Here are some examples demonstrating its use

In [14]:

list(sliding_window(3, range(10)))

Out[14]:

[(0, 1, 2),
 (1, 2, 3),
 (2, 3, 4),
 (3, 4, 5),
 (4, 5, 6),
 (5, 6, 7),
 (6, 7, 8),
 (7, 8, 9)]

In [15]:

list(sliding_window(5, range(10)))

Out[15]:

[(0, 1, 2, 3, 4),
 (1, 2, 3, 4, 5),
 (2, 3, 4, 5, 6),
 (3, 4, 5, 6, 7),
 (4, 5, 6, 7, 8),
 (5, 6, 7, 8, 9)]

In [16]:

list(map(sum, sliding_window(5, range(10))))

Out[16]:

[10, 15, 20, 25, 30, 35]

In [17]:

from numpy import mean
data = [4, 4, 6, 3, 4, 6, 7, 3, 8, 9, 10, 10, 10, 10, 8, 7]

moving_average = compose(map(mean), sliding_window(3))

list(moving_average(data))

Out[17]:

[4.666666666666667,
 4.333333333333333,
 4.333333333333333,
 4.333333333333333,
 5.666666666666667,
 5.333333333333333,
 6.0,
 6.666666666666667,
 9.0,
 9.6666666666666661,
 10.0,
 10.0,
 9.3333333333333339,
 8.3333333333333339]

We'll use sliding window without a function like sum or mean. We'll use it to chop up our sequence of base-pairs into a sequence of tuples of base-pairs, called k-mers.

In [18]:

seq = list(take(20, snoot))
print seq
pipe(seq, sliding_window(3), list)

['C', 'T', 'A', 'C', 'C', 'A', 'G', 'A', 'G', 'G', 'T', 'G', 'C', 'C', 'T', 'C', 'C', 'T', 'A', 'A']

Out[18]:

[('C', 'T', 'A'),
 ('T', 'A', 'C'),
 ('A', 'C', 'C'),
 ('C', 'C', 'A'),
 ('C', 'A', 'G'),
 ('A', 'G', 'A'),
 ('G', 'A', 'G'),
 ('A', 'G', 'G'),
 ('G', 'G', 'T'),
 ('G', 'T', 'G'),
 ('T', 'G', 'C'),
 ('G', 'C', 'C'),
 ('C', 'C', 'T'),
 ('C', 'T', 'C'),
 ('T', 'C', 'C'),
 ('C', 'C', 'T'),
 ('C', 'T', 'A'),
 ('T', 'A', 'A')]

In [19]:

pipe(snoot, take(1000000), sliding_window(3), frequencies)

Out[19]:

{('A', 'A', 'A'): 15769,
 ('A', 'A', 'C'): 15635,
 ('A', 'A', 'G'): 15442,
 ('A', 'A', 'T'): 15915,
 ('A', 'C', 'A'): 15599,
 ('A', 'C', 'C'): 15733,
 ('A', 'C', 'G'): 15629,
 ('A', 'C', 'T'): 15649,
 ('A', 'G', 'A'): 15747,
 ('A', 'G', 'C'): 15731,
 ('A', 'G', 'G'): 15585,
 ('A', 'G', 'T'): 15609,
 ('A', 'T', 'A'): 15923,
 ('A', 'T', 'C'): 15486,
 ('A', 'T', 'G'): 15648,
 ('A', 'T', 'T'): 15609,
 ('C', 'A', 'A'): 15641,
 ('C', 'A', 'C'): 15628,
 ('C', 'A', 'G'): 15758,
 ('C', 'A', 'T'): 15424,
 ('C', 'C', 'A'): 15629,
 ('C', 'C', 'C'): 15548,
 ('C', 'C', 'G'): 15426,
 ('C', 'C', 'T'): 15793,
 ('C', 'G', 'A'): 15466,
 ('C', 'G', 'C'): 15468,
 ('C', 'G', 'G'): 15654,
 ('C', 'G', 'T'): 15599,
 ('C', 'T', 'A'): 15867,
 ('C', 'T', 'C'): 15621,
 ('C', 'T', 'G'): 15367,
 ('C', 'T', 'T'): 15654,
 ('G', 'A', 'A'): 15487,
 ('G', 'A', 'C'): 15518,
 ('G', 'A', 'G'): 15614,
 ('G', 'A', 'T'): 15710,
 ('G', 'C', 'A'): 15634,
 ('G', 'C', 'C'): 15702,
 ('G', 'C', 'G'): 15518,
 ('G', 'C', 'T'): 15571,
 ('G', 'G', 'A'): 15578,
 ('G', 'G', 'C'): 15626,
 ('G', 'G', 'G'): 15692,
 ('G', 'G', 'T'): 15613,
 ('G', 'T', 'A'): 15655,
 ('G', 'T', 'C'): 15478,
 ('G', 'T', 'G'): 15736,
 ('G', 'T', 'T'): 15581,
 ('T', 'A', 'A'): 15864,
 ('T', 'A', 'C'): 15829,
 ('T', 'A', 'G'): 15857,
 ('T', 'A', 'T'): 15617,
 ('T', 'C', 'A'): 15589,
 ('T', 'C', 'C'): 15413,
 ('T', 'C', 'G'): 15614,
 ('T', 'C', 'T'): 15496,
 ('T', 'G', 'A'): 15538,
 ('T', 'G', 'C'): 15600,
 ('T', 'G', 'G'): 15578,
 ('T', 'G', 'T'): 15629,
 ('T', 'T', 'A'): 15722,
 ('T', 'T', 'C'): 15527,
 ('T', 'T', 'G'): 15595,
 ('T', 'T', 'T'): 15565}

In [24]:

def reshape(item):
    ((a, b, c), count) = item  # pattern matching
    return {(a, b): {c: count}}

pipe(snoot, take(1000000), sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-79160841a478> in <module>()
      3     return {(a, b): {c: count}}
      4 
----> 5 pipe(snoot, take(1000000), sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge))

/home/mrocklin/workspace/toolz/toolz/functoolz/core.pyc in pipe(data, *functions)
    278     """
    279     for func in functions:
--> 280         data = func(data)
    281     return data

TypeError: 'dict' object is not callable

In [ ]:

pipe(file_pattern, glob, map(open), map(drop(1)), concat, map(str.upper), map(str.strip), concat,
                   sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge))

In [ ]: