Pre-requisites:
This walkthrough also makes use of:
To navigate this notebook:
shift+enter
or clicking the "play" button on the toolbar.In [*]:
and will display the execution count when it is done.Kernel
dropdown menu.Restart & Run All
option# Import the packages we will use
import os.path as op
import matplotlib.pyplot as plt
import numpy as np
import pandas
import h5py
import cooler
# The following directive activates inline plotting
%matplotlib inline
filepath = 'data/Rao2014-GM12878-MboI-allreps-filtered.5kb.cool'
# Download a high resoltion COOL file from the interwebs (this will take a few...)
# The ! at the beginning of the line tells IPython to run the line in the shell.
if not op.exists(filepath):
!wget ftp://cooler.csail.mit.edu/coolers/hg19/Rao2014-GM12878-MboI-allreps-filtered.5kb.cool -O {filepath}
h5py
¶The h5py
library (HDF5 for Python) provides an excellent Pythonic interface between HDF5 and native NumPy arrays and dtypes. It allows you to treat an HDF5 file like a dictionary with complete access to the file's contents as well as the ability to manipulate groups and read or write datasets and attributes. There is additionally a low-level API that wraps the libhdf5
C functions directly. See the h5py docs.
h5 = h5py.File(filepath, 'r')
h5
<HDF5 file "Rao2014-GM12878-MboI-allreps-filtered.5kb.cool" (mode r)>
h5.keys()
<KeysViewHDF5 ['bins', 'chroms', 'indexes', 'pixels']>
Files and Groups are dict
-like.
h5['pixels']
<HDF5 group "/pixels" (3 members)>
list(h5['pixels'].keys())
['bin1_id', 'bin2_id', 'count']
h5py
dataset objects are views onto the data on disk
h5['pixels']['bin2_id']
<HDF5 dataset "bin2_id": shape (1543535265,), type "<i8">
Slicing or indexing returns a numpy array in memory.
h5['pixels']['bin2_id'][:10]
array([ 234, 1994, 3258, 4087, 6093, 37359, 49826, 49889, 58451, 60826])
h5['pixels']['count'][:10]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
h5.close()
The Python cooler
package is just a thin wrapper over h5py
.
See below.
Cooler
class¶Accepts a file path or an open HDF5 file object.
NOTE: Using a filepath allows the Cooler
object to be serialized/pickled since the file is only opened when needed.
c = cooler.Cooler(filepath)
c.info
{'bin-size': 5000, 'bin-type': 'fixed', 'creation-date': '2016-02-25T22:53:09.510744', 'format-url': 'https://github.com/mirnylab/cooler', 'format-version': 2, 'genome-assembly': 'hg19', 'id': None, 'library-version': '0.3.0', 'metadata': {'publication': '', 'QC': {'double-sided': {'total': 3390352656, 'valid': 3147639590, 'filtered-invalid': {'removed-self-circles': 1741768, 'removed-error-pair': 6074295, 'removed-dangling-ends': 234897003}, 'filtered-valid': {'removed-duplicate': 110650005, 'removed-start-near-rsite': '', 'removed-outlier-fragment': 151337031, 'removed-large-small-pair': 657466}}, 'pre-filtering': {'total': 5332721651, 'double-sided': 3390352656, 'unused': 0, 'single-sided': 1942368995}, 'post-filtering': {'total': 2884995088, 'cis': 2085711027, 'trans': 799284061}}, 'enzyme': 'MboI', 'cell-type': 'GM12878', 'species': 'Homo sapiens', 'sex': 'F'}, 'nbins': 619150, 'nchroms': 25, 'nnz': 1543535265}
Tables are accessed via methods.
c.chroms()
<cooler.core.RangeSelector1D at 0x7f97f6f963c8>
The return value is a selector or "view" on a table that accepts column and range queries ("slices").
c.chroms()[1:5]
name | length | |
---|---|---|
1 | chr2 | 243199373 |
2 | chr3 | 198022430 |
3 | chr4 | 191154276 |
4 | chr5 | 180915260 |
# get the whole table
c.chroms()[:]
name | length | |
---|---|---|
0 | chr1 | 249250621 |
1 | chr2 | 243199373 |
2 | chr3 | 198022430 |
3 | chr4 | 191154276 |
4 | chr5 | 180915260 |
5 | chr6 | 171115067 |
6 | chr7 | 159138663 |
7 | chr8 | 146364022 |
8 | chr9 | 141213431 |
9 | chr10 | 135534747 |
10 | chr11 | 135006516 |
11 | chr12 | 133851895 |
12 | chr13 | 115169878 |
13 | chr14 | 107349540 |
14 | chr15 | 102531392 |
15 | chr16 | 90354753 |
16 | chr17 | 81195210 |
17 | chr18 | 78077248 |
18 | chr19 | 59128983 |
19 | chr20 | 63025520 |
20 | chr21 | 48129895 |
21 | chr22 | 51304566 |
22 | chrX | 155270560 |
23 | chrY | 59373566 |
24 | chrM | 16571 |
# more convenient access to chromosomes
c.chromnames
['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']
In the bin table, the weight column contains the matrix balancing weights computed for each genomic bin.
# more convenient access to chromosome lengths
c.chromsizes
name chr1 249250621 chr2 243199373 chr3 198022430 chr4 191154276 chr5 180915260 chr6 171115067 chr7 159138663 chr8 146364022 chr9 141213431 chr10 135534747 chr11 135006516 chr12 133851895 chr13 115169878 chr14 107349540 chr15 102531392 chr16 90354753 chr17 81195210 chr18 78077248 chr19 59128983 chr20 63025520 chr21 48129895 chr22 51304566 chrX 155270560 chrY 59373566 chrM 16571 Name: length, dtype: int64
c.bins()[:10]
chrom | start | end | weight | |
---|---|---|---|---|
0 | chr1 | 0 | 5000 | NaN |
1 | chr1 | 5000 | 10000 | NaN |
2 | chr1 | 10000 | 15000 | NaN |
3 | chr1 | 15000 | 20000 | NaN |
4 | chr1 | 20000 | 25000 | NaN |
5 | chr1 | 25000 | 30000 | NaN |
6 | chr1 | 30000 | 35000 | NaN |
7 | chr1 | 35000 | 40000 | NaN |
8 | chr1 | 40000 | 45000 | NaN |
9 | chr1 | 45000 | 50000 | NaN |
Selecting a list of columns returns a new DataFrame view on that subset of columns
bins = c.bins()[['chrom', 'start', 'end']]
bins
<cooler.core.RangeSelector1D at 0x7f97f6f966a0>
bins[:10]
chrom | start | end | |
---|---|---|---|
0 | chr1 | 0 | 5000 |
1 | chr1 | 5000 | 10000 |
2 | chr1 | 10000 | 15000 |
3 | chr1 | 15000 | 20000 |
4 | chr1 | 20000 | 25000 |
5 | chr1 | 25000 | 30000 |
6 | chr1 | 30000 | 35000 |
7 | chr1 | 35000 | 40000 |
8 | chr1 | 40000 | 45000 |
9 | chr1 | 45000 | 50000 |
Selecting a single column returns a Series view
weights = c.bins()['weight']
weights
<cooler.core.RangeSelector1D at 0x7f97f6f52f60>
weights[500:510]
500 1.021570 501 1.935517 502 0.584862 503 0.895666 504 1.302236 505 0.845886 506 0.992397 507 1.470796 508 1.022486 509 0.952231 Name: weight, dtype: float64
The pixel table contains the non-zero upper triangle entries of the contact map.
c.pixels()[:10]
bin1_id | bin2_id | count | |
---|---|---|---|
0 | 2 | 234 | 1 |
1 | 2 | 1994 | 1 |
2 | 2 | 3258 | 1 |
3 | 2 | 4087 | 1 |
4 | 2 | 6093 | 1 |
5 | 2 | 37359 | 1 |
6 | 2 | 49826 | 1 |
7 | 2 | 49889 | 1 |
8 | 2 | 58451 | 1 |
9 | 2 | 60826 | 1 |
Use the join=True
option if you would like to expand the bin IDs into genomic bin coordinates by joining the output with the bin table.
c.pixels(join=True)[:10]
chrom1 | start1 | end1 | chrom2 | start2 | end2 | count | |
---|---|---|---|---|---|---|---|
0 | chr1 | 10000 | 15000 | chr1 | 1170000 | 1175000 | 1 |
1 | chr1 | 10000 | 15000 | chr1 | 9970000 | 9975000 | 1 |
2 | chr1 | 10000 | 15000 | chr1 | 16290000 | 16295000 | 1 |
3 | chr1 | 10000 | 15000 | chr1 | 20435000 | 20440000 | 1 |
4 | chr1 | 10000 | 15000 | chr1 | 30465000 | 30470000 | 1 |
5 | chr1 | 10000 | 15000 | chr1 | 186795000 | 186800000 | 1 |
6 | chr1 | 10000 | 15000 | chr1 | 249130000 | 249135000 | 1 |
7 | chr1 | 10000 | 15000 | chr2 | 190000 | 195000 | 1 |
8 | chr1 | 10000 | 15000 | chr2 | 43000000 | 43005000 | 1 |
9 | chr1 | 10000 | 15000 | chr2 | 54875000 | 54880000 | 1 |
Pandas lets you readily dump any table selection to tabular text file.
df = c.pixels(join=True)[:100]
# tab-delimited file, don't write the index column or header row
df.to_csv('data/myselection.txt', sep='\t', index=False, header=False)
!head data/myselection.txt
chr1 10000 15000 chr1 1170000 1175000 1 chr1 10000 15000 chr1 9970000 9975000 1 chr1 10000 15000 chr1 16290000 16295000 1 chr1 10000 15000 chr1 20435000 20440000 1 chr1 10000 15000 chr1 30465000 30470000 1 chr1 10000 15000 chr1 186795000 186800000 1 chr1 10000 15000 chr1 249130000 249135000 1 chr1 10000 15000 chr2 190000 195000 1 chr1 10000 15000 chr2 43000000 43005000 1 chr1 10000 15000 chr2 54875000 54880000 1
Another way to annotate the bins in a data frame of pixels is to use cooler.annotate
. It does a left outer join from the bin1_id
and bin2_id
columns onto a data frame indexed by bin ID that describes the bins.
bins = c.bins()[:] # fetch all the bins
pix = c.pixels()[100:110] # select some pixels with unannotated bins
pix
bin1_id | bin2_id | count | |
---|---|---|---|
100 | 2 | 356649 | 1 |
101 | 2 | 356700 | 1 |
102 | 2 | 363068 | 1 |
103 | 2 | 363243 | 1 |
104 | 2 | 363270 | 1 |
105 | 2 | 363293 | 1 |
106 | 2 | 363329 | 1 |
107 | 2 | 363348 | 1 |
108 | 2 | 363373 | 1 |
109 | 2 | 363374 | 1 |
cooler.annotate(pix, bins)
chrom1 | start1 | end1 | weight1 | chrom2 | start2 | end2 | weight2 | bin1_id | bin2_id | count | |
---|---|---|---|---|---|---|---|---|---|---|---|
100 | chr1 | 10000 | 15000 | NaN | chr10 | 102850000 | 102855000 | 1.503567 | 2 | 356649 | 1 |
101 | chr1 | 10000 | 15000 | NaN | chr10 | 103105000 | 103110000 | 0.924921 | 2 | 356700 | 1 |
102 | chr1 | 10000 | 15000 | NaN | chr10 | 134945000 | 134950000 | 1.480836 | 2 | 363068 | 1 |
103 | chr1 | 10000 | 15000 | NaN | chr11 | 285000 | 290000 | 1.455694 | 2 | 363243 | 1 |
104 | chr1 | 10000 | 15000 | NaN | chr11 | 420000 | 425000 | 1.107395 | 2 | 363270 | 1 |
105 | chr1 | 10000 | 15000 | NaN | chr11 | 535000 | 540000 | 0.819330 | 2 | 363293 | 1 |
106 | chr1 | 10000 | 15000 | NaN | chr11 | 715000 | 720000 | 0.911051 | 2 | 363329 | 1 |
107 | chr1 | 10000 | 15000 | NaN | chr11 | 810000 | 815000 | 1.202326 | 2 | 363348 | 1 |
108 | chr1 | 10000 | 15000 | NaN | chr11 | 935000 | 940000 | 0.889250 | 2 | 363373 | 1 |
109 | chr1 | 10000 | 15000 | NaN | chr11 | 940000 | 945000 | 1.017561 | 2 | 363374 | 1 |
cooler.annotate(pix, bins[['weight']], replace=False)
weight1 | weight2 | bin1_id | bin2_id | count | |
---|---|---|---|---|---|
100 | NaN | 1.503567 | 2 | 356649 | 1 |
101 | NaN | 0.924921 | 2 | 356700 | 1 |
102 | NaN | 1.480836 | 2 | 363068 | 1 |
103 | NaN | 1.455694 | 2 | 363243 | 1 |
104 | NaN | 1.107395 | 2 | 363270 | 1 |
105 | NaN | 0.819330 | 2 | 363293 | 1 |
106 | NaN | 0.911051 | 2 | 363329 | 1 |
107 | NaN | 1.202326 | 2 | 363348 | 1 |
108 | NaN | 0.889250 | 2 | 363373 | 1 |
109 | NaN | 1.017561 | 2 | 363374 | 1 |
Finally, the matrix
method provides a 2D-sliceable view on the data. It allows you to query the data on file as a full rectangular contact matrix.
c.matrix()
<cooler.core.RangeSelector2D at 0x7f97f6f56160>
The result of a query is a 2D NumPy array.
arr = c.matrix(balance=False)[1000:1200, 1000:1200]
arr
array([[129, 230, 72, ..., 3, 3, 3], [230, 217, 128, ..., 3, 6, 1], [ 72, 128, 42, ..., 3, 1, 1], ..., [ 3, 3, 3, ..., 199, 357, 165], [ 3, 6, 1, ..., 357, 225, 257], [ 3, 1, 1, ..., 165, 257, 136]])
Use sparse=True
to return scipy.sparse.coo_matrix
objects instead.
mat = c.matrix(balance=False, sparse=True)[1000:1200, 1000:1200]
mat
<200x200 sparse matrix of type '<class 'numpy.int64'>' with 39393 stored elements in COOrdinate format>
It is straightforward to convert to a dense 2D numpy array.
arr = mat.toarray()
arr
array([[129, 230, 72, ..., 3, 3, 3], [230, 217, 128, ..., 3, 6, 1], [ 72, 128, 42, ..., 3, 1, 1], ..., [ 3, 3, 3, ..., 199, 357, 165], [ 3, 6, 1, ..., 357, 225, 257], [ 3, 1, 1, ..., 165, 257, 136]])
Notice that the lower triangle has been automatically filled in.
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
im = ax.matshow(np.log10(arr), cmap='YlOrRd')
fig.colorbar(im)
/home/nezar/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log10 This is separate from the ipykernel package so we can avoid doing imports until
<matplotlib.colorbar.Colorbar at 0x7f97f57539b0>
Notice the light and dark "banded" appearance? That's because you are looking at the unnormalized counts.
We usually normalize or "correct" Hi-C using a technique called matrix balancing. This involves finding a set of weights or biases $b_i$ for each bin $i$ such that
$$ Normalized[i,j] = Observed[i,j] \cdot b[i] \cdot b[j], $$such that the marginals (i.e., row/column sums) of the global contact matrix are flat and equal.
Cooler can store the pre-computed balancing weights in the bin table.
Here's one way to manually apply them to balance your selection.
# get the balancing weights as a numpy array
weights = c.bins()['weight'] # view
bias = weights[1000:1200] # series
bias = bias.values # array
# fetch a sparse matrix
mat = c.matrix(balance=False, sparse=True)[1000:1200, 1000:1200]
# apply the balancing weights
mat.data = bias[mat.row] * bias[mat.col] * mat.data
# convert to dense numpy array
arr = mat.toarray()
As a shortcut, we get the same result by passing balance=True
to the matrix view constructor.
arr2 = c.matrix(balance=True, sparse=True)[1000:1200, 1000:1200].toarray()
np.allclose(arr, arr2, equal_nan=True)
True
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
im = ax.matshow(np.log10(arr), cmap='YlOrRd')
fig.colorbar(im)
/home/nezar/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log10 This is separate from the ipykernel package so we can avoid doing imports until
<matplotlib.colorbar.Colorbar at 0x7f97f4259898>
The bin table, pixel table and matrix views also accept UCSC-style genomic range strings or (chrom, start, end) triples.
c.bins().fetch('chr2:10,000,000-20,000,000')
chrom | start | end | weight | |
---|---|---|---|---|
51851 | chr2 | 10000000 | 10005000 | 2.474065 |
51852 | chr2 | 10005000 | 10010000 | 1.403127 |
51853 | chr2 | 10010000 | 10015000 | 0.883045 |
51854 | chr2 | 10015000 | 10020000 | 0.771122 |
51855 | chr2 | 10020000 | 10025000 | 0.924510 |
... | ... | ... | ... | ... |
53846 | chr2 | 19975000 | 19980000 | 0.575194 |
53847 | chr2 | 19980000 | 19985000 | 1.450563 |
53848 | chr2 | 19985000 | 19990000 | 0.861964 |
53849 | chr2 | 19990000 | 19995000 | 0.786807 |
53850 | chr2 | 19995000 | 20000000 | 0.692477 |
2000 rows × 4 columns
cis = c.matrix(sparse=True).fetch('chr21')
cis.shape
(9626, 9626)
trans = c.matrix(sparse=True).fetch('chr21', 'chr22')
trans.shape
(9626, 10261)