All steps of hicstuff pipeline and downstream operations can be ran using the python api. The matrix can be generated directly from the reads, just like in the command line or by doing each step separately for fine control.
If using minimap2, the genome can be in fasta format. If using bowtie2, it must be indexed using bowtie2-build:
bowtie2-build genome.fasta genome
The input reads can be in fastq format, or if already aligned to the genome in SAM/BAM format. The pipeline also accepts input in the form of pairs file. The input format is specified using the start_stage
argument (fastq/sam/pairs/pairs_idx).
The hicstuff.pipeline submodule allows to run all steps at one, in a way identical to the hicstuff pipeline command. Here we assume the reads have been aligned and converted to a pairs file already by specifying start_stage='pairs'
, but it is also possible to give fastq files or sam files as input by specifying start_stage='fastq'
or start_stage='sam'
.
import hicstuff.pipeline as hpi
# Generating matrix from dummy dataset
hpi.full_pipeline(input1='../../test_data/valid.pairs',
input2=None,
genome='../../test_data/genome/seq',
enzyme="DpnII",
filter_events=True,
pcr_duplicates=True,
start_stage='pairs')
INFO :: ## hicstuff: v1.4.4 log file INFO :: ## date: 2019-07-17 16:48:36 INFO :: ## enzyme: DpnII INFO :: ## input1: ../../test_data/valid.pairs INFO :: ## input2: None INFO :: ## ref: ../../test_data/genome/seq INFO :: --- INFO :: Filtering with thresholds: uncuts=6 loops=5 INFO :: Proportion of inter contacts: 1.97% (intra: 298, inter: 6) INFO :: 9696 pairs discarded: Loops: 31, Uncuts: 9663, Weirds: 2 INFO :: 304 pairs kept (3.04%) INFO :: 0% PCR duplicates have been filtered out (0 / 304 pairs) INFO :: 304 pairs used to build a contact map of 564 bins with 304 nonzero entries. INFO :: Contact map generated after 0h 0m 0s
The temporary pairs files can be used to visualise the probability of contacts over genomic distance (distance law). Functions in the view module can also be used to generate visualisations of the results. The utilities in the hicstuff module are handy to transform the matrix.
%matplotlib notebook
from importlib import reload
import hicstuff.distance_law as hdl
from matplotlib import pyplot as plt
reload(hdl)
xs, ps, names = hdl.get_distance_law('example_data/example_clean.pairs', 'example_data/fragments_list.txt')
# Normalize P(s) to sum to 1 between
ps = hdl.normalize_distance_law(xs, ps)
# Extract first chromosome only
chrom_xs, chrom_ps, chrom_name = xs[0], ps[0], names[0]
# Select interactions between 10 and 500kb
bp_range = (chrom_xs > 10000) & (chrom_xs < 500000)
# Plot log probability of contact versus log genomic distance
plt.loglog(chrom_xs[bp_range], chrom_ps[bp_range])
[<matplotlib.lines.Line2D at 0x7f302008c550>]
Common operations for processing Hi-C matrices are made available in hicstuff. These operations work on scipy sparse matrices.
%matplotlib notebook
from hicstuff import hicstuff as hcs
import hicstuff.view as hcv
import hicstuff.io as hio
from scipy.sparse import csr_matrix
from scipy.ndimage import gaussian_filter
import numpy as np
# Load COO sparse matrix from text file
sparse_mat = hio.load_sparse_matrix('example_data/abs_fragments_weighted.txt')
# Normalize matrix so that each genomic bin is equally covered
norm_mat = hcs.normalize_sparse(sparse_mat)
# Remove outlier bins with very little contacts
trim_mat = hcs.trim_sparse(norm_mat)
# Remove "speckles" (outlier pixel values) to reduce noise
clean_mat = hcs.despeckle_simple(trim_mat)
# Show the matrix
dense_mat = hcv.sparse_to_dense(clean_mat, remove_diag=False)
# Add some blur to the matrix :)
blurred_mat = gaussian_filter(dense_mat, 1.1)
# Plot blurred matrix
hcv.plot_matrix(blurred_mat, vmax=0.05)
WARNING :: /home/varogh/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py:708: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. self[i, j] = values
%matplotlib notebook
from matplotlib import pyplot as plt
# Visualize matrix at each different steps
todense = lambda x: hcv.sparse_to_dense(x, remove_diag=False)
mats = [
('raw', todense(sparse_mat)),
('normalized', todense(norm_mat)),
('trimmed', todense(trim_mat)),
('despeckled', todense(clean_mat)),
('Gaussian blur', blurred_mat),
('shrec', hcs.shortest_path_interpolation(dense_mat, alpha=0.04, strict=True))
]
fig, ax = plt.subplots(3, 2, sharex=True, sharey=True, figsize=(12, 18), dpi=80)
for i, axi in enumerate(ax.flatten()):
axi.set_aspect(1)
axi.set_title(mats[i][0])
axi.imshow(mats[i][1], cmap="Oranges", vmax=np.percentile(mats[i][1], 98.5))
plt.suptitle("Hi-C matrix at different processing steps")