Cooler Python API¶

Pre-requisites:

Basic Python knowledge
Some experience with the NumPy array package (or MATLAB)

This walkthrough also makes use of:

Jupyter (IPython) Notebook. That's what you're using right now!
Pandas, the dataframe package (similar to R data.frames)
h5py, the package to interact with HDF5 files from Python
matplotlib, the MATLAB-inspired plotting package for Python

To navigate this notebook:

Click on a code cell and execute its code by pressing shift+enter or clicking the "play" button on the toolbar.
While the code cell is running the prompt on the left will look like In [*]: and will display the execution count when it is done.
Execution output will be displayed beneath each cell.
To restart the notebook, use the options in the Kernel dropdown menu.
You can also run the entire notebook in one go with the Restart & Run All option

In [1]:

# Import the packages we will use
import os.path as op
import matplotlib.pyplot as plt
import numpy as np
import pandas
import h5py

import cooler

In [2]:

# The following directive activates inline plotting
%matplotlib inline

In [3]:

filepath = 'data/Rao2014-GM12878-MboI-allreps-filtered.5kb.cool'

# Download a high resoltion COOL file from the interwebs (this will take a few...)
# The ! at the beginning of the line tells IPython to run the line in the shell.
if not op.exists(filepath):
    !wget ftp://cooler.csail.mit.edu/coolers/hg19/Rao2014-GM12878-MboI-allreps-filtered.5kb.cool -O {filepath}

Direct access with `h5py`¶

The h5py library (HDF5 for Python) provides an excellent Pythonic interface between HDF5 and native NumPy arrays and dtypes. It allows you to treat an HDF5 file like a dictionary with complete access to the file's contents as well as the ability to manipulate groups and read or write datasets and attributes. There is additionally a low-level API that wraps the libhdf5 C functions directly. See the h5py docs.

In [4]:

h5 = h5py.File(filepath, 'r')

In [5]:

h5

Out[5]:

<HDF5 file "Rao2014-GM12878-MboI-allreps-filtered.5kb.cool" (mode r)>

In [6]:

h5.keys()

Out[6]:

<KeysViewHDF5 ['bins', 'chroms', 'indexes', 'pixels']>

Files and Groups are dict-like.

In [7]:

h5['pixels']

Out[7]:

<HDF5 group "/pixels" (3 members)>

In [8]:

list(h5['pixels'].keys())

Out[8]:

['bin1_id', 'bin2_id', 'count']

h5py dataset objects are views onto the data on disk

In [9]:

h5['pixels']['bin2_id']

Out[9]:

<HDF5 dataset "bin2_id": shape (1543535265,), type "<i8">

Slicing or indexing returns a numpy array in memory.

In [10]:

h5['pixels']['bin2_id'][:10]

Out[10]:

array([  234,  1994,  3258,  4087,  6093, 37359, 49826, 49889, 58451,
       60826])

In [11]:

h5['pixels']['count'][:10]

Out[11]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [12]:

h5.close()

The Python cooler package is just a thin wrapper over h5py.

It lets you access the data tables as Pandas data frames and series.
It also provides a matrix abstraction: letting you query the upper triangle pixel table as if it were a full rectangular sparse matrix via SciPy.

See below.

The `Cooler` class¶

Accepts a file path or an open HDF5 file object.

NOTE: Using a filepath allows the Cooler object to be serialized/pickled since the file is only opened when needed.

In [13]:

c = cooler.Cooler(filepath)

The info dictionary¶

In [14]:

c.info

Out[14]:

{'bin-size': 5000,
 'bin-type': 'fixed',
 'creation-date': '2016-02-25T22:53:09.510744',
 'format-url': 'https://github.com/mirnylab/cooler',
 'format-version': 2,
 'genome-assembly': 'hg19',
 'id': None,
 'library-version': '0.3.0',
 'metadata': {'publication': '',
  'QC': {'double-sided': {'total': 3390352656,
    'valid': 3147639590,
    'filtered-invalid': {'removed-self-circles': 1741768,
     'removed-error-pair': 6074295,
     'removed-dangling-ends': 234897003},
    'filtered-valid': {'removed-duplicate': 110650005,
     'removed-start-near-rsite': '',
     'removed-outlier-fragment': 151337031,
     'removed-large-small-pair': 657466}},
   'pre-filtering': {'total': 5332721651,
    'double-sided': 3390352656,
    'unused': 0,
    'single-sided': 1942368995},
   'post-filtering': {'total': 2884995088,
    'cis': 2085711027,
    'trans': 799284061}},
  'enzyme': 'MboI',
  'cell-type': 'GM12878',
  'species': 'Homo sapiens',
  'sex': 'F'},
 'nbins': 619150,
 'nchroms': 25,
 'nnz': 1543535265}

Table Views¶

Tables are accessed via methods.

In [15]:

c.chroms()

Out[15]:

<cooler.core.RangeSelector1D at 0x7f97f6f963c8>

The return value is a selector or "view" on a table that accepts column and range queries ("slices").

Column selections return a new view.
Range selections return pandas DataFrames or Series.

In [16]:

c.chroms()[1:5]

Out[16]:

	name	length
1	chr2	243199373
2	chr3	198022430
3	chr4	191154276
4	chr5	180915260

In [17]:

# get the whole table
c.chroms()[:]

Out[17]:

	name	length
0	chr1	249250621
1	chr2	243199373
2	chr3	198022430
3	chr4	191154276
4	chr5	180915260
5	chr6	171115067
6	chr7	159138663
7	chr8	146364022
8	chr9	141213431
9	chr10	135534747
10	chr11	135006516
11	chr12	133851895
12	chr13	115169878
13	chr14	107349540
14	chr15	102531392
15	chr16	90354753
16	chr17	81195210
17	chr18	78077248
18	chr19	59128983
19	chr20	63025520
20	chr21	48129895
21	chr22	51304566
22	chrX	155270560
23	chrY	59373566
24	chrM	16571

In [18]:

# more convenient access to chromosomes
c.chromnames

Out[18]:

['chr1',
 'chr2',
 'chr3',
 'chr4',
 'chr5',
 'chr6',
 'chr7',
 'chr8',
 'chr9',
 'chr10',
 'chr11',
 'chr12',
 'chr13',
 'chr14',
 'chr15',
 'chr16',
 'chr17',
 'chr18',
 'chr19',
 'chr20',
 'chr21',
 'chr22',
 'chrX',
 'chrY',
 'chrM']

In the bin table, the weight column contains the matrix balancing weights computed for each genomic bin.

In [19]:

# more convenient access to chromosome lengths
c.chromsizes

Out[19]:

name
chr1     249250621
chr2     243199373
chr3     198022430
chr4     191154276
chr5     180915260
chr6     171115067
chr7     159138663
chr8     146364022
chr9     141213431
chr10    135534747
chr11    135006516
chr12    133851895
chr13    115169878
chr14    107349540
chr15    102531392
chr16     90354753
chr17     81195210
chr18     78077248
chr19     59128983
chr20     63025520
chr21     48129895
chr22     51304566
chrX     155270560
chrY      59373566
chrM         16571
Name: length, dtype: int64

In [20]:

c.bins()[:10]

Out[20]:

	chrom	start	end	weight
0	chr1	0	5000	NaN
1	chr1	5000	10000	NaN
2	chr1	10000	15000	NaN
3	chr1	15000	20000	NaN
4	chr1	20000	25000	NaN
5	chr1	25000	30000	NaN
6	chr1	30000	35000	NaN
7	chr1	35000	40000	NaN
8	chr1	40000	45000	NaN
9	chr1	45000	50000	NaN

Selecting a list of columns returns a new DataFrame view on that subset of columns

In [21]:

bins = c.bins()[['chrom', 'start', 'end']]
bins

Out[21]:

<cooler.core.RangeSelector1D at 0x7f97f6f966a0>

In [22]:

bins[:10]

Out[22]:

	chrom	start	end
0	chr1	0	5000
1	chr1	5000	10000
2	chr1	10000	15000
3	chr1	15000	20000
4	chr1	20000	25000
5	chr1	25000	30000
6	chr1	30000	35000
7	chr1	35000	40000
8	chr1	40000	45000
9	chr1	45000	50000

Selecting a single column returns a Series view

In [23]:

weights = c.bins()['weight']
weights

Out[23]:

<cooler.core.RangeSelector1D at 0x7f97f6f52f60>

In [24]:

weights[500:510]

Out[24]:

500    1.021570
501    1.935517
502    0.584862
503    0.895666
504    1.302236
505    0.845886
506    0.992397
507    1.470796
508    1.022486
509    0.952231
Name: weight, dtype: float64

The pixel table contains the non-zero upper triangle entries of the contact map.

In [25]:

c.pixels()[:10]

Out[25]:

	bin1_id	bin2_id	count
0	2	234	1
1	2	1994	1
2	2	3258	1
3	2	4087	1
4	2	6093	1
5	2	37359	1
6	2	49826	1
7	2	49889	1
8	2	58451	1
9	2	60826	1

Use the join=True option if you would like to expand the bin IDs into genomic bin coordinates by joining the output with the bin table.

In [26]:

c.pixels(join=True)[:10]

Out[26]:

	chrom1	start1	end1	chrom2	start2	end2	count
0	chr1	10000	15000	chr1	1170000	1175000	1
1	chr1	10000	15000	chr1	9970000	9975000	1
2	chr1	10000	15000	chr1	16290000	16295000	1
3	chr1	10000	15000	chr1	20435000	20440000	1
4	chr1	10000	15000	chr1	30465000	30470000	1
5	chr1	10000	15000	chr1	186795000	186800000	1
6	chr1	10000	15000	chr1	249130000	249135000	1
7	chr1	10000	15000	chr2	190000	195000	1
8	chr1	10000	15000	chr2	43000000	43005000	1
9	chr1	10000	15000	chr2	54875000	54880000	1

Pandas lets you readily dump any table selection to tabular text file.

In [27]:

df = c.pixels(join=True)[:100]

# tab-delimited file, don't write the index column or header row
df.to_csv('data/myselection.txt', sep='\t', index=False, header=False)

In [28]:

!head data/myselection.txt

chr1	10000	15000	chr1	1170000	1175000	1
chr1	10000	15000	chr1	9970000	9975000	1
chr1	10000	15000	chr1	16290000	16295000	1
chr1	10000	15000	chr1	20435000	20440000	1
chr1	10000	15000	chr1	30465000	30470000	1
chr1	10000	15000	chr1	186795000	186800000	1
chr1	10000	15000	chr1	249130000	249135000	1
chr1	10000	15000	chr2	190000	195000	1
chr1	10000	15000	chr2	43000000	43005000	1
chr1	10000	15000	chr2	54875000	54880000	1

Bin annotation¶

Another way to annotate the bins in a data frame of pixels is to use cooler.annotate. It does a left outer join from the bin1_id and bin2_id columns onto a data frame indexed by bin ID that describes the bins.

In [29]:

bins = c.bins()[:]  # fetch all the bins

pix = c.pixels()[100:110]  # select some pixels with unannotated bins
pix

Out[29]:

	bin1_id	bin2_id	count
100	2	356649	1
101	2	356700	1
102	2	363068	1
103	2	363243	1
104	2	363270	1
105	2	363293	1
106	2	363329	1
107	2	363348	1
108	2	363373	1
109	2	363374	1

In [30]:

cooler.annotate(pix, bins)

Out[30]:

	chrom1	start1	end1	weight1	chrom2	start2	end2	weight2	bin1_id	bin2_id	count
100	chr1	10000	15000	NaN	chr10	102850000	102855000	1.503567	2	356649	1
101	chr1	10000	15000	NaN	chr10	103105000	103110000	0.924921	2	356700	1
102	chr1	10000	15000	NaN	chr10	134945000	134950000	1.480836	2	363068	1
103	chr1	10000	15000	NaN	chr11	285000	290000	1.455694	2	363243	1
104	chr1	10000	15000	NaN	chr11	420000	425000	1.107395	2	363270	1
105	chr1	10000	15000	NaN	chr11	535000	540000	0.819330	2	363293	1
106	chr1	10000	15000	NaN	chr11	715000	720000	0.911051	2	363329	1
107	chr1	10000	15000	NaN	chr11	810000	815000	1.202326	2	363348	1
108	chr1	10000	15000	NaN	chr11	935000	940000	0.889250	2	363373	1
109	chr1	10000	15000	NaN	chr11	940000	945000	1.017561	2	363374	1

In [31]:

cooler.annotate(pix, bins[['weight']], replace=False)

Out[31]:

	weight1	weight2	bin1_id	bin2_id	count
100	NaN	1.503567	2	356649	1
101	NaN	0.924921	2	356700	1
102	NaN	1.480836	2	363068	1
103	NaN	1.455694	2	363243	1
104	NaN	1.107395	2	363270	1
105	NaN	0.819330	2	363293	1
106	NaN	0.911051	2	363329	1
107	NaN	1.202326	2	363348	1
108	NaN	0.889250	2	363373	1
109	NaN	1.017561	2	363374	1

Enter The Matrix¶

Finally, the matrix method provides a 2D-sliceable view on the data. It allows you to query the data on file as a full rectangular contact matrix.

In [32]:

c.matrix()

Out[32]:

<cooler.core.RangeSelector2D at 0x7f97f6f56160>

The result of a query is a 2D NumPy array.

In [33]:

arr = c.matrix(balance=False)[1000:1200, 1000:1200]
arr

Out[33]:

array([[129, 230,  72, ...,   3,   3,   3],
       [230, 217, 128, ...,   3,   6,   1],
       [ 72, 128,  42, ...,   3,   1,   1],
       ...,
       [  3,   3,   3, ..., 199, 357, 165],
       [  3,   6,   1, ..., 357, 225, 257],
       [  3,   1,   1, ..., 165, 257, 136]])

Use sparse=True to return scipy.sparse.coo_matrix objects instead.

In [34]:

mat = c.matrix(balance=False, sparse=True)[1000:1200, 1000:1200]
mat

Out[34]:

<200x200 sparse matrix of type '<class 'numpy.int64'>'
	with 39393 stored elements in COOrdinate format>

It is straightforward to convert to a dense 2D numpy array.

In [35]:

arr = mat.toarray()
arr

Out[35]:

array([[129, 230,  72, ...,   3,   3,   3],
       [230, 217, 128, ...,   3,   6,   1],
       [ 72, 128,  42, ...,   3,   1,   1],
       ...,
       [  3,   3,   3, ..., 199, 357, 165],
       [  3,   6,   1, ..., 357, 225, 257],
       [  3,   1,   1, ..., 165, 257, 136]])

Notice that the lower triangle has been automatically filled in.

In [36]:

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
im = ax.matshow(np.log10(arr), cmap='YlOrRd')
fig.colorbar(im)

/home/nezar/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log10
  This is separate from the ipykernel package so we can avoid doing imports until

Out[36]:

<matplotlib.colorbar.Colorbar at 0x7f97f57539b0>

Notice the light and dark "banded" appearance? That's because you are looking at the unnormalized counts.

Balancing your selection¶

We usually normalize or "correct" Hi-C using a technique called matrix balancing. This involves finding a set of weights or biases $b_i$ for each bin $i$ such that

$$ Normalized[i,j] = Observed[i,j] \cdot b[i] \cdot b[j], $$

such that the marginals (i.e., row/column sums) of the global contact matrix are flat and equal.

Cooler can store the pre-computed balancing weights in the bin table.

Here's one way to manually apply them to balance your selection.

In [37]:

# get the balancing weights as a numpy array
weights = c.bins()['weight']  # view
bias = weights[1000:1200]     # series
bias = bias.values            # array

# fetch a sparse matrix
mat = c.matrix(balance=False, sparse=True)[1000:1200, 1000:1200]

# apply the balancing weights
mat.data = bias[mat.row] * bias[mat.col] * mat.data

# convert to dense numpy array
arr = mat.toarray()

As a shortcut, we get the same result by passing balance=True to the matrix view constructor.

In [38]:

arr2 = c.matrix(balance=True, sparse=True)[1000:1200, 1000:1200].toarray()
np.allclose(arr, arr2, equal_nan=True)

Out[38]:

True

In [39]:

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
im = ax.matshow(np.log10(arr), cmap='YlOrRd')
fig.colorbar(im)

/home/nezar/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log10
  This is separate from the ipykernel package so we can avoid doing imports until

Out[39]:

<matplotlib.colorbar.Colorbar at 0x7f97f4259898>

Genomic coordinate range selection¶

The bin table, pixel table and matrix views also accept UCSC-style genomic range strings or (chrom, start, end) triples.

In [40]:

c.bins().fetch('chr2:10,000,000-20,000,000')

Out[40]:

	chrom	start	end	weight
51851	chr2	10000000	10005000	2.474065
51852	chr2	10005000	10010000	1.403127
51853	chr2	10010000	10015000	0.883045
51854	chr2	10015000	10020000	0.771122
51855	chr2	10020000	10025000	0.924510
...	...	...	...	...
53846	chr2	19975000	19980000	0.575194
53847	chr2	19980000	19985000	1.450563
53848	chr2	19985000	19990000	0.861964
53849	chr2	19990000	19995000	0.786807
53850	chr2	19995000	20000000	0.692477

2000 rows × 4 columns

In [41]:

cis = c.matrix(sparse=True).fetch('chr21')
cis.shape

Out[41]:

(9626, 9626)

In [42]:

trans = c.matrix(sparse=True).fetch('chr21', 'chr22')
trans.shape

Out[42]:

(9626, 10261)

In [ ]:

Table of Contents