Inspired/stolen from work by John Jacobsen: http://eigenhombre.com/2013/11/03/nucleotide-repetition-lengths/
An exercise in laziness and counting.
We represent genomes as strings on an alphabet of four letters, A, C, G, and T. Thus part of a genome might look like the following:
"...AACCGTGTGCGTTTATTAATTATTGCTTTA..."
Each letter corresponds to a particular nucleic acid (TODO verify this) in a very long strand of DNA. A single genome can be quite large and scientists generate them at an alarming rate.
In this section we'll process genomes and perform simple counting analytics on them. We'll start by counting frequencies of the individual nucleic acids (how often does A
appear?), move on to counting lengths of repeated nucleic acids (how often do we see long repeated strings like ('AAAAAA'), and finish with a quick markov chain.
Throughout we'll see exercise laziness and become familiar with the following functions from PyToolz.
from toolz.curried import (map, frequencies, pipe, take, concat, drop, count,
partitionby, identity, compose, sliding_window,
merge_with, merge)
We've stored some actual genomes in the data folder
from glob import glob
file_pattern = 'data/yeast-genome/chr*.fa'
We'll compare our results against the genome of a snoot. A snoot is a magical creature with an infinitely long gene sequence where each letter is uniformly chosen. We can implement a snoot genome with a Python generator.
By comparing our physical genomes against that of the snoot we'll highlight ways in which we deviate from randomness.
import random
def random_genome():
while True:
yield random.choice('ACTG')
snoot = random_genome()
list(take(10, snoot))
['A', 'C', 'A', 'T', 'C', 'T', 'G', 'C', 'C', 'T']
# Set up matplotlib and build a small function to plot histograms of count dictionaries
from pylab import hist, show, xticks, title
%matplotlib inline
def histdict(d, *args, **kwargs):
""" Plot a histogram given a dictionary of counts
See Also:
matplotlib.hist
"""
if all(isinstance(k, (int, float)) for k in d.keys()):
hist(d.keys(), *args, weights=d.values(), **kwargs)
else:
keydict = dict(enumerate(d.keys()))
revdict = {v: k for k, v in keydict.items()}
hist(list(map(revdict.get, d.keys())), *args, weights=d.values(), **kwargs)
xticks(keydict.keys(), keydict.values())
histdict({'dog': 3, 'cat': 5})
show()
counts = frequencies(take(10000, snoot))
histdict(counts)
counts = pipe(file_pattern, glob, map(open), map(drop(1)),
concat, map(str.upper), map(str.strip),
concat, frequencies)
histdict(counts)
How often do very long repetitions of base pair occur? E.g. are sequences like "TTTTTTTTT" very rare or are they commonplace?
# Using partitionby - also known as itertools.groupby
from toolz.curried import partitionby, identity
animals = ['cat', 'dog', 'hen', 'goose', 'moose', 'rat', 'giraffe']
list(partitionby(len, animals)) # Parition animals by their name length
[['cat', 'dog', 'hen'], ['goose', 'moose'], ['rat'], ['giraffe']]
# We want to partition our gene sequences 'AACCTTTTTGCT' by the letters themselves
# Instead of `len`, our key function is just the trivial identity function
identity('A')
'A'
# Lets compose partitionby with identity to collect our sequence
# into groups of repeated elements
partitions = list(partitionby(identity, 'AACCTTTTTGCT'))
partitions
[['A', 'A'], ['C', 'C'], ['T', 'T', 'T', 'T', 'T'], ['G'], ['C'], ['T']]
# From here we just want to compute the length of each group
list(map(len, partitions))
[2, 2, 5, 1, 1, 1]
Lets pull this together and count recurring base pairs on our random genome.
pipe(snoot, take(20), partitionby(identity), map(len), list)
[1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1]
result = pipe(snoot, take(1000000), partitionby(identity), map(len), frequencies)
histdict(result, log=True)
title('Counts of Repeated Base Pairs - Random')
result
{1: 562222, 2: 140330, 3: 35289, 4: 8843, 5: 2219, 6: 577, 7: 138, 8: 33, 9: 8, 10: 2}
result = pipe(file_pattern, glob, map(open), map(drop(1)),
concat, map(str.upper), map(str.strip), concat,
partitionby(identity), map(len), frequencies)
histdict(result, log=True)
title('Counts of Repeated Base Pairs - Yeast')
result
{1: 6118373, 2: 1729275, 3: 488751, 4: 158660, 5: 52793, 6: 17076, 7: 7079, 8: 2700, 9: 1305, 10: 778, 11: 496, 12: 344, 13: 247, 14: 135, 15: 91, 16: 64, 17: 52, 18: 26, 19: 34, 20: 23, 21: 14, 22: 17, 23: 13, 24: 19, 25: 8, 26: 10, 27: 4, 28: 4, 29: 3, 30: 2, 31: 4, 32: 1, 33: 1, 34: 1, 35: 2, 36: 1, 37: 1, 42: 1}
Do certain base pairs commonly occur after others? Given that we've just seen "AG" is it more or less likely that we'll see 'G' again? What about 'T'? These questions can be answered with Markov chains.
To do this exercise we'll need to use toolz.sliding_window
. Sliding window is a generalization on functions like sliding average or sliding max. Here are some examples demonstrating its use
list(sliding_window(3, range(10)))
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9)]
list(sliding_window(5, range(10)))
[(0, 1, 2, 3, 4), (1, 2, 3, 4, 5), (2, 3, 4, 5, 6), (3, 4, 5, 6, 7), (4, 5, 6, 7, 8), (5, 6, 7, 8, 9)]
list(map(sum, sliding_window(5, range(10))))
[10, 15, 20, 25, 30, 35]
from numpy import mean
data = [4, 4, 6, 3, 4, 6, 7, 3, 8, 9, 10, 10, 10, 10, 8, 7]
moving_average = compose(map(mean), sliding_window(3))
list(moving_average(data))
[4.666666666666667, 4.333333333333333, 4.333333333333333, 4.333333333333333, 5.666666666666667, 5.333333333333333, 6.0, 6.666666666666667, 9.0, 9.6666666666666661, 10.0, 10.0, 9.3333333333333339, 8.3333333333333339]
We'll use sliding window without a function like sum
or mean
. We'll use it to chop up our sequence of base-pairs into a sequence of tuples of base-pairs, called k-mers.
seq = list(take(20, snoot))
print seq
pipe(seq, sliding_window(3), list)
['C', 'T', 'A', 'C', 'C', 'A', 'G', 'A', 'G', 'G', 'T', 'G', 'C', 'C', 'T', 'C', 'C', 'T', 'A', 'A']
[('C', 'T', 'A'), ('T', 'A', 'C'), ('A', 'C', 'C'), ('C', 'C', 'A'), ('C', 'A', 'G'), ('A', 'G', 'A'), ('G', 'A', 'G'), ('A', 'G', 'G'), ('G', 'G', 'T'), ('G', 'T', 'G'), ('T', 'G', 'C'), ('G', 'C', 'C'), ('C', 'C', 'T'), ('C', 'T', 'C'), ('T', 'C', 'C'), ('C', 'C', 'T'), ('C', 'T', 'A'), ('T', 'A', 'A')]
pipe(snoot, take(1000000), sliding_window(3), frequencies)
{('A', 'A', 'A'): 15769, ('A', 'A', 'C'): 15635, ('A', 'A', 'G'): 15442, ('A', 'A', 'T'): 15915, ('A', 'C', 'A'): 15599, ('A', 'C', 'C'): 15733, ('A', 'C', 'G'): 15629, ('A', 'C', 'T'): 15649, ('A', 'G', 'A'): 15747, ('A', 'G', 'C'): 15731, ('A', 'G', 'G'): 15585, ('A', 'G', 'T'): 15609, ('A', 'T', 'A'): 15923, ('A', 'T', 'C'): 15486, ('A', 'T', 'G'): 15648, ('A', 'T', 'T'): 15609, ('C', 'A', 'A'): 15641, ('C', 'A', 'C'): 15628, ('C', 'A', 'G'): 15758, ('C', 'A', 'T'): 15424, ('C', 'C', 'A'): 15629, ('C', 'C', 'C'): 15548, ('C', 'C', 'G'): 15426, ('C', 'C', 'T'): 15793, ('C', 'G', 'A'): 15466, ('C', 'G', 'C'): 15468, ('C', 'G', 'G'): 15654, ('C', 'G', 'T'): 15599, ('C', 'T', 'A'): 15867, ('C', 'T', 'C'): 15621, ('C', 'T', 'G'): 15367, ('C', 'T', 'T'): 15654, ('G', 'A', 'A'): 15487, ('G', 'A', 'C'): 15518, ('G', 'A', 'G'): 15614, ('G', 'A', 'T'): 15710, ('G', 'C', 'A'): 15634, ('G', 'C', 'C'): 15702, ('G', 'C', 'G'): 15518, ('G', 'C', 'T'): 15571, ('G', 'G', 'A'): 15578, ('G', 'G', 'C'): 15626, ('G', 'G', 'G'): 15692, ('G', 'G', 'T'): 15613, ('G', 'T', 'A'): 15655, ('G', 'T', 'C'): 15478, ('G', 'T', 'G'): 15736, ('G', 'T', 'T'): 15581, ('T', 'A', 'A'): 15864, ('T', 'A', 'C'): 15829, ('T', 'A', 'G'): 15857, ('T', 'A', 'T'): 15617, ('T', 'C', 'A'): 15589, ('T', 'C', 'C'): 15413, ('T', 'C', 'G'): 15614, ('T', 'C', 'T'): 15496, ('T', 'G', 'A'): 15538, ('T', 'G', 'C'): 15600, ('T', 'G', 'G'): 15578, ('T', 'G', 'T'): 15629, ('T', 'T', 'A'): 15722, ('T', 'T', 'C'): 15527, ('T', 'T', 'G'): 15595, ('T', 'T', 'T'): 15565}
def reshape(item):
((a, b, c), count) = item # pattern matching
return {(a, b): {c: count}}
pipe(snoot, take(1000000), sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge))
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-24-79160841a478> in <module>() 3 return {(a, b): {c: count}} 4 ----> 5 pipe(snoot, take(1000000), sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge)) /home/mrocklin/workspace/toolz/toolz/functoolz/core.pyc in pipe(data, *functions) 278 """ 279 for func in functions: --> 280 data = func(data) 281 return data TypeError: 'dict' object is not callable
pipe(file_pattern, glob, map(open), map(drop(1)), concat, map(str.upper), map(str.strip), concat,
sliding_window(3), frequencies, dict.items, map(reshape), merge_with(merge))