Firstly import all components from coolbox.api
, and check your CoolBox version.
# change working directory
import os
os.chdir("../../")
print(f"Current working directory: {os.path.abspath(os.curdir)}")
import coolbox
from coolbox.api import *
coolbox.__version__
'0.3.0'
Here, we use a small testing dataset for convenient. This dataset contains files in differnet file formats, and they are in same genome range(chr9:4000000-6000000) of a reference genome (hg19).
!pwd
!ls -lh tests/test_data/
/Users/bakezq/Desktop/workspace/CoolBox.nosync total 197152 -rw-r--r-- 1 bakezq staff 787K Dec 6 04:11 bam_chr9_4000000_6000000.bam -rw-r--r-- 1 bakezq staff 5.8K Jan 16 09:32 bam_chr9_4000000_6000000.bam.bai -rw-r--r-- 1 bakezq staff 8.5K Dec 6 04:11 bed_chr9_4000000_6000000.bed -rw-r--r-- 1 bakezq staff 2.2K Jan 16 09:32 bed_chr9_4000000_6000000.bed.bgz -rw-r--r-- 1 bakezq staff 220B Jan 16 09:32 bed_chr9_4000000_6000000.bed.bgz.tbi -rw-r--r-- 1 bakezq staff 18K Dec 6 04:11 bedgraph_chr9_4000000_6000000.bg -rw-r--r-- 1 bakezq staff 264B Dec 6 04:11 bedpe_chr9_4000000_6000000.bedpe -rw-r--r-- 1 bakezq staff 147B Jan 16 09:14 bedpe_chr9_4000000_6000000.bedpe.bgz -rw-r--r-- 1 bakezq staff 160B Jan 16 09:14 bedpe_chr9_4000000_6000000.bedpe.bgz.px2 -rw-r--r-- 1 bakezq staff 31K Dec 6 04:11 bigwig_chr9_4000000_6000000.bw -rw-r--r-- 1 bakezq staff 23M Jan 15 22:08 cool_chr1_89000000_90400000_for_cmp_1.mcool -rw-r--r-- 1 bakezq staff 23M Jan 15 22:08 cool_chr1_89000000_90400000_for_cmp_2.mcool -rw-r--r--@ 1 bakezq staff 26M Dec 6 23:26 cool_chr9_4000000_6000000.mcool -rw-r--r-- 1 bakezq staff 13M Jan 16 06:07 dothic_chr9_4000000_6000000.hic -rw-r--r--@ 1 bakezq staff 9.3M Dec 29 19:24 down100.ctcf.pkl -rw-r--r-- 1 bakezq staff 535K Dec 6 04:11 gtf_chr9_4000000_6000000.gtf -rw-r--r-- 1 bakezq staff 28K Jan 16 09:21 gtf_chr9_4000000_6000000.gtf.bgz -rw-r--r-- 1 bakezq staff 398B Jan 16 09:21 gtf_chr9_4000000_6000000.gtf.bgz.tbi -rw-r--r-- 1 bakezq staff 32K Dec 6 04:11 hg19_ideogram.txt -rw-r--r-- 1 bakezq staff 1.9K Dec 6 04:11 human.hg19.genome -rw-r--r-- 1 bakezq staff 2.1K Dec 6 23:26 make_test_dataset.py -rw-r--r-- 1 bakezq staff 777B Dec 6 04:11 pairs_chr9_4000000_6000000.pairs -rw-r--r-- 1 bakezq staff 298B Jan 16 09:14 pairs_chr9_4000000_6000000.pairs.bgz -rw-r--r-- 1 bakezq staff 258B Jan 16 09:14 pairs_chr9_4000000_6000000.pairs.bgz.px2 -rw-r--r-- 1 bakezq staff 592B Dec 6 04:11 peak_chr9_4000000_6000000.bedpe -rw-r--r-- 1 bakezq staff 206B Jan 16 09:22 peak_chr9_4000000_6000000.bedpe.bgz -rw-r--r-- 1 bakezq staff 222B Jan 16 09:22 peak_chr9_4000000_6000000.bedpe.bgz.px2 -rw-r--r-- 1 bakezq staff 765K Dec 6 04:11 snp_chr9_4000000_6000000.snp -rw-r--r-- 1 bakezq staff 299K Jan 16 09:25 snp_chr9_4000000_6000000.snp.bgz -rw-r--r-- 1 bakezq staff 1.1K Jan 16 09:25 snp_chr9_4000000_6000000.snp.bgz.tbi -rw-r--r-- 1 bakezq staff 432B Dec 6 04:11 tad_chr9_4000000_6000000.bed -rw-r--r-- 1 bakezq staff 153B Jan 16 09:25 tad_chr9_4000000_6000000.bed.bgz -rw-r--r-- 1 bakezq staff 133B Jan 16 09:25 tad_chr9_4000000_6000000.bed.bgz.tbi
# Here we define const values for reference files easily later
DATA_DIR = "tests/test_data"
TEST_RANGE = "chr9:4000000-6000000"
RANGE_MARK = "chr9_4000000_6000000"
In CoolBox ploting system, "Track" is the basic element. If you have used genome browser like UCSC Genome Browser or WashU EpiGenome Browser, you must know what it is.
Basically, "Track" is a image what is related to a piece of continuous region on the reference genome. The most common track is the bigWig track, If you have read some papers about epigenomics you must have seen some figures like this:
bigwig_path = f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw"
frame = XAxis() + BigWig(bigwig_path) # input a file path
frame.plot(TEST_RANGE) # input a genome range
Actually, bigWig is just one kind of track, there are other kinds of tracks in CoolBox, for display other kind of genomic data like long range genome interaction from ChIA-PET and genome contact matrix from Hi-C.
Now, CoolBox support the following track types:
Track Type | Relevant file format | Description |
---|---|---|
Track | None |
Base class for all tracks. |
XAxis | None |
X axis of genome. |
Spacer | None |
For add vertical space between two tracks. |
BAM | .bam | BAM track for visualize the coverage or alignment. |
GTF | .gtf | Track of GTF file, for visualize gene annotation. |
HistBase | None |
Base class for all hist-like tracks. |
BigWig | .bigwig | Track of bigWig file. |
BedGraph | .bedgraph | Track of bedgraph file. |
BAMCov | .bam | BAM Coverage track for visualize reads coverage. |
SNP | .tsv, .vcf | Track for show SNPs Manhattan plot. Input file is a tab-split file, contain SNP's chrom, position, pvalue information. |
Virtual4C | .cool, .mcool, .hic | Virtual 4C track, using Hi-C data to mimic 4C. |
DiScore | .cool, .mcool, .hic | Directionality index of Hi-C matrix for detecting TAD. |
InsuScore | .cool, .mcool, .hic | Insulation score of Hi-C matrix for inferring TAD borders. |
BedBase | None |
Base class for all bed-like(1d intervals) tracks. |
BED | .bed | Track of Bed file, for visualization genome annotation,like refSeq genes and chromatin states or TAD intervals. |
TADCoverage | .bed | Track Coverage for showing TAD(topologically associated domains) upon a HicMat track. |
ArcsBase | None |
Nase class for all bedpe-like(2d contacts/regions) tracks. |
Arcs | .pairs, .bedpe | Show the chromosome interactions get from ChIA-PET or Hi-C loop data. |
BEDPE | .bedpe | Same to Arcs, specific to BEDPE file |
Pairs | .pairs | Same to Arcs, specific to Pairs file |
HiCPeaksCoverage | .bedpe, .pairs | HiCPeaks Coverage track for displaying peaks upon a HicMat track. |
HiCMatBase | None |
Base class for all matrix-like(2d ndarray) tracks. |
HiCMat | .cool, .mcool, .hic | Show the chromosome contact matrix from Hi-C data. |
Cool | .cool, .mcool | Same to HiCMat, specific to cooler's .cool or .mcool format. |
DotHiC | .hic | Same to HiCMat, specific to juicer .hic file format. |
HiCDiff | .cool, .mcool, .hic | Show the difference between two contact matrix. |
Selfish | .cool, .mcool, .hic | Show the difference computed by Selfish algorithm between two contact matrix. |
Visualize RefSeq with CoolBox:
frame = XAxis() + BED(f"{DATA_DIR}/bed_{RANGE_MARK}.bed") + TrackHeight(8)
frame.plot("chr9:4700000-4900000")
[WARNING:plot.py:56 - plot_genes()] *WARNING* Color set to 'bed_rgb', but bed file does not have the rgb field. The color has been set to #1f78b4
GTF track is also for visualize gene annotations:
frame = XAxis() + GTF(f"{DATA_DIR}/gtf_{RANGE_MARK}.gtf") + TrackHeight(5)
frame.plot(TEST_RANGE)
multi-cool(.mcool)
for multiple resolution Hi-C matrix¶Cooler file support multi-resolution interaction matrix storage (normally file name ends with .mcool
), this feature allow us take appropriate resolution matrix data depending on the corresponding genome region size, it let program respond fast when plot the hic matrix.
The multi-resolution cooler file is suggested, you can use cooler zoomify
command to create multi-resolution cooler file from a single resolution cooler file.
frame = XAxis() + HiCMat(f"{DATA_DIR}/cool_{RANGE_MARK}.mcool")
frame.plot(TEST_RANGE)
Default Hi-C Track will be plot in triangular
style, it also can be matrix
or window
style.
Just change the specify the style
parameter, when create Cool instance, like this:
frame = XAxis() + \
HiCMat(f"{DATA_DIR}/cool_{RANGE_MARK}.mcool", style='matrix', color_bar='horizontal')
frame.plot(TEST_RANGE)
frame = XAxis() + HiCMat(f"{DATA_DIR}/cool_{RANGE_MARK}.mcool", style='window', depth_ratio=0.3)
frame.plot("chr9:4500000-5500000")
Cool
show balanced matrix as default, if you want to show the unbalanced matrix, you can set as:
frame = XAxis() + HiCMat(f"{DATA_DIR}/cool_{RANGE_MARK}.mcool", style='window', depth_ratio=0.3, balance=False) + MinValue(1)
frame.plot("chr9:4500000-5500000")
Another kind of important technology like ChIA-PET or HiChIP, it can produce many long-range genome-wide chromatin interactions.
And, some times, Hi-C contact matrix is too informative to understand, we only need some most important interactions from it, We use some tools like HICCUPS call the most significant interactions, or "Peaks" from contact matrix.
In either case, Arcs Track can be used to visulize the data.
Arcs track accept .pairs
or .bedpe
format:
# BEDPE
frame = XAxis() + Arcs(f"{DATA_DIR}/bedpe_{RANGE_MARK}.bedpe", line_width=2)
frame.plot(TEST_RANGE)
# Pairs
frame = XAxis() + Arcs(f"{DATA_DIR}/pairs_{RANGE_MARK}.pairs", line_width=1.5)
frame.plot("chr9:4500000-5000000")
In CoolBox you can compose tracks with "+" operator, as shown above, compose XAxis track and a bigwig track to a frame object:
frame = XAxis() + BigWig("data/K562_RNASeq.bigWig")
Frame is a higher level object, denote a set of relevant tracks. We can use a long "+" expression compose a complex Frame.
cool1 = Cool(f"{DATA_DIR}/cool_{RANGE_MARK}.mcool")
frame = XAxis() + \
cool1 + Title("Hi-C(.cool)") + \
Spacer(0.5) + \
Virtual4C(cool1, "chr9:4986000-4986000") + Title("Virtual4C") + \
Spacer(0.5) + \
BAMCov(f"{DATA_DIR}/bam_{RANGE_MARK}.bam") + Title("BAM Coverage") +\
Spacer(0.5) + \
Arcs(f"{DATA_DIR}/bedpe_{RANGE_MARK}.bedpe") + Inverted() + Title("Arcs(BEDPE)") + \
Spacer(0.1) + \
Arcs(f"{DATA_DIR}/pairs_{RANGE_MARK}.pairs") + Inverted() + Title("Arcs(Pairs)") + \
GTF(f"{DATA_DIR}/gtf_{RANGE_MARK}.gtf", length_ratio_thresh=0.005) + TrackHeight(6) + Title("GTF Annotation") + \
Spacer(0.1) + \
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") + Title("BigWig")
frame.plot(TEST_RANGE)
Maybe you have noticed that, in the complex expression above, there some element witch added with
Tracks is not a Track, for example, the TrackHeight
, Title
and Title
.
These elements is Feature
, it is represent the features of the Track.
For example, we set the color and track height feature of a bigWig track.
frame = XAxis() + \
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") + \
Color("#ce00ce") + TrackHeight(8)
frame.plot(TEST_RANGE)
And we can adjust the min value and max value of the track:
frame = XAxis() + \
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") + \
Color("#ce00ce") + TrackHeight(5) + \
MinValue(0) + MaxValue(50)
frame.plot(TEST_RANGE)
By the way, there are one useful trick, you can use Feature with "with statement
", like:
with Color("#fd9c6b"):
frame1 = XAxis() +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw")
with Color("#66ccff"):
frame2 = BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw")
frame = frame1 + frame2
frame.plot(TEST_RANGE)
As shown above, any tracks created inside the "with statement
" will have the specified feature.
use this trick, we can simplify the complex expression:
Some times we need to draw some graphics above the original figure, for example,
the vertical lines and highlight regions. CoolBox has another kinds of element, the Coverage
.
We can add coverage with track, after added to track, coverage will plot upper the track when track is ploted.
locus = [("chr9", 4500000), ("chr9", 5000000)]
frame = XAxis() + BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") + Vlines(locus, line_width=2)
frame.plot(TEST_RANGE)
Like the Feature
if you want a set of tracks with same coverge, you can use the "with statement
":
with Vlines(locus, line_width=2):
frame = BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw")
frame = XAxis() + frame + XAxis()
frame.plot(TEST_RANGE)
Or, you can also use *
operator do this:
frame = BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw")
frame = frame * Vlines(locus, line_width=2)
frame = XAxis() + frame + XAxis()
frame.plot(TEST_RANGE)
regions= ["chr9:4600000-5000000", "chr9:5750000-5950000"]
highlights = HighLights(regions, color="green", alpha=0.05)
with highlights, Color("#aa5cff"):
frame = BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") +\
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw")
frame = XAxis() + frame + XAxis()
frame.plot(TEST_RANGE)
coolbox.api.Browser
¶When you want to explore the data, you will change the genome region window very frequently.
Under these circumstances, when you want to do the operations like "move right", "move left", "zoom in", "zoom out",
if you use above Frame.plot
API to plot the figure, you must change parameters and run again.
It is troublesome and boring.
In order to solve this problem, CoolBox impletmented a simple GUI with ipywidgets.
You can crate a Browser
instance with a composed frame, and call .show()
method to show the browser.
cool1 = Cool(f"{DATA_DIR}/cool_{RANGE_MARK}.mcool")
frame = XAxis() + \
cool1 + Title("Hi-C(.cool)") + \
Spacer(0.5) + \
Virtual4C(cool1, "chr9:4986000-4986000") + Title("Virtual4C") + \
Spacer(0.5) + \
BAM(f"{DATA_DIR}/bam_{RANGE_MARK}.bam") + Title("BAM Coverage") +\
Spacer(0.5) + \
Arcs(f"{DATA_DIR}/bedpe_{RANGE_MARK}.bedpe") + Inverted() + Title("Arcs(BEDPE)") + \
Spacer(0.1) + \
Arcs(f"{DATA_DIR}/pairs_{RANGE_MARK}.pairs") + Inverted() + Title("Arcs(Pairs)") + \
GTF(f"{DATA_DIR}/gtf_{RANGE_MARK}.gtf", length_ratio_thresh=0.005) + TrackHeight(6) + Title("GTF Annotation") + \
Spacer(0.1) + \
BigWig(f"{DATA_DIR}/bigwig_{RANGE_MARK}.bw") + Title("BigWig")
bsr = Browser(frame)
bsr.goto(TEST_RANGE)
Note: browser is valid only when the jupyter kernel is active
bsr.show()
VBox(children=(VBox(children=(HBox(children=(Dropdown(index=8, options=('chr1', 'chr2', 'chr3', 'chr4', 'chr5'…
.fetch_data
API¶In CoolBox, data and figure is bound together with a single Python object. So, you can fetch precise data of what you see in the figure.
Call the .fetch_data
method of Browser
or Frame
, will return an collection.OrderDict
which store many pandas.Dataframe
object correspond to each tracks in the browser or frame , and the data is only about the current genome region.
bsr.tracks
OrderedDict([('XAxis.21', <coolbox.core.track.pseudo.XAxis at 0x7fc1191fbaf0>), ('Cool.6', <coolbox.core.track.hicmat.cool.Cool at 0x7fc0ffa16790>), ('Spacer.6', <coolbox.core.track.pseudo.Spacer at 0x7fc0ffa16eb0>), ('Virtual4C.2', <coolbox.core.track.hist.hicfeature.Virtual4C at 0x7fc0ffa160a0>), ('Spacer.7', <coolbox.core.track.pseudo.Spacer at 0x7fc0ffa163d0>), ('BAM.1', <coolbox.core.track.bam.BAM at 0x7fc0ffa164c0>), ('Spacer.8', <coolbox.core.track.pseudo.Spacer at 0x7fc0ffa16e50>), ('BEDPE.3', <coolbox.core.track.arcs.bedpe.BEDPE at 0x7fc0ffa162b0>), ('Spacer.9', <coolbox.core.track.pseudo.Spacer at 0x7fc0ffa16070>), ('Pairs.3', <coolbox.core.track.arcs.pairs.Pairs at 0x7fc0ffa166a0>), ('GTF.3', <coolbox.core.track.gtf.GTF at 0x7fc0ffa16e80>), ('Spacer.10', <coolbox.core.track.pseudo.Spacer at 0x7fc0ffa16130>), ('BigWig.20', <coolbox.core.track.hist.bigwig.BigWig at 0x7fc0ffa160d0>)])
current_data = bsr.fetch_data()
list(current_data.keys())
['XAxis.21', 'Cool.6', 'Spacer.6', 'Virtual4C.2', 'Spacer.7', 'BAM.1', 'Spacer.8', 'BEDPE.3', 'Spacer.9', 'Pairs.3', 'GTF.3', 'Spacer.10', 'BigWig.20']
Data of each track related to the current genome range are stored in this dict:
current_cool = current_data['Cool.6']
print(type(current_cool))
current_cool.shape
<class 'numpy.ndarray'>
(401, 401)
current_data['GTF.3'].head(5)
seqname | source | feature | start | end | score | strand | frame | attribute | gene_name | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr9 | protein_coding | gene | 3824127 | 4348392 | . | - | . | gene_id "ENSG00000107249"; gene_name "GLIS3"; ... | GLIS3 |
1 | chr9 | protein_coding | transcript | 3824127 | 4152183 | . | - | . | gene_id "ENSG00000107249"; transcript_id "ENST... | GLIS3 |
2 | chr9 | protein_coding | transcript | 3827698 | 4299916 | . | - | . | gene_id "ENSG00000107249"; transcript_id "ENST... | GLIS3 |
3 | chr9 | retained_intron | transcript | 3855437 | 4118017 | . | - | . | gene_id "ENSG00000107249"; transcript_id "ENST... | GLIS3 |
4 | chr9 | processed_transcript | transcript | 3932360 | 4081361 | . | - | . | gene_id "ENSG00000107249"; transcript_id "ENST... | GLIS3 |
We can do some statics analyze on it. For example count the distribution of interaction count, etc...
DATA_DIR = f"tests/test_data"
test_interval = "chr9:4000000-6000000"
test_itv = test_interval.replace(':', '_').replace('-', '_')
cool1 = Cool(f"{DATA_DIR}/cool_{test_itv}.mcool", cmap="JuiceBoxLike", style='window', color_bar='vertical')
with TrackHeight(2):
frame = XAxis() + \
cool1 + Title("Hi-C(.cool)") + \
TADCoverage(f"{DATA_DIR}/tad_{test_itv}.bed", border_only=True, alpha=1) + Title("HIC with TADs") + \
Spacer(0.1) + \
BED(f"{DATA_DIR}/tad_{test_itv}.bed", border_only=True, alpha=1) + Title("TADs") + \
DiScore(cool1, window_size=30) + Feature(title="Directionality index") + \
InsuScore(cool1, window_size=30) + Title("Insulation score") + \
Virtual4C(cool1, "chr9:4986000-4986000") + Title("Virtual4C") + \
BAMCov(f"{DATA_DIR}/bam_{test_itv}.bam") + Title("BAM Coverage") +\
Spacer(0.1) + \
Arcs(f"{DATA_DIR}/bedpe_{test_itv}.bedpe", line_width=1.5) + Title("Arcs(BEDPE)") + \
Arcs(f"{DATA_DIR}/pairs_{test_itv}.pairs", line_width=1.5) + Inverted() + Title("Arcs(Pairs)") + \
GTF(f"{DATA_DIR}/gtf_{test_itv}.gtf", length_ratio_thresh=0.005) + TrackHeight(6) + Title("GTF Annotation") + \
Spacer(0.1) + \
BigWig(f"{DATA_DIR}/bigwig_{test_itv}.bw") + Title("BigWig") + \
BedGraph(f"{DATA_DIR}/bedgraph_{test_itv}.bg") + Title("BedGraph") + \
Spacer(0.1) + \
BED(f"{DATA_DIR}/bed_{test_itv}.bed") + Feature(height=10, title="BED Annotation")
frame.properties['width'] = 45
frame.goto(test_interval)
frame.show()
[WARNING:plot.py:56 - plot_genes()] *WARNING* Color set to 'bed_rgb', but bed file does not have the rgb field. The color has been set to #1f78b4 [WARNING:plot.py:56 - plot_genes()] *WARNING* Color set to 'bed_rgb', but bed file does not have the rgb field. The color has been set to #1f78b4