alignparse.ccs.Summaries
¶Tests this class and makes sure it works with and without reports and np
tags giving number of passes.
This Jupyter notebook is designed to be run with nbval
for the testing.
First, import Python modules:
import contextlib
import os
import tempfile
import warnings
import Bio.SeqIO
import pandas as pd
import alignparse.ccs
Hide warnings that clutter output:
warnings.simplefilter('ignore')
Data frame giving the ccs
report file and the CCS FASTQ file for each run:
run_names = ['recA_lib-1', 'recA_lib-2']
ccs_dir = '../notebooks/input_files'
ccs_df = pd.DataFrame(
{'name': run_names,
'report': [f"{ccs_dir}/{name}_report.txt" for name in run_names],
'fastq': [f"{ccs_dir}/{name}_ccs.fastq" for name in run_names]
})
Create an alignparse.ccs.Summaries
object:
summaries = alignparse.ccs.Summaries(ccs_df)
for summary in summaries.summaries:
print(summary.name)
recA_lib-1 recA_lib-2
Confirm ZMW stats exist:
summaries.has_zmw_stats()
True
Get and plot the ZMW stats:
summaries.zmw_stats()
name | status | number | fraction | |
---|---|---|---|---|
0 | recA_lib-1 | Success -- CCS generated | 139 | 0.837349 |
1 | recA_lib-1 | Failed -- Lacking full passes | 19 | 0.114458 |
2 | recA_lib-1 | Failed -- Draft generation error | 3 | 0.018072 |
3 | recA_lib-1 | Failed -- Min coverage violation | 1 | 0.006024 |
4 | recA_lib-1 | Failed -- CCS below minimum RQ | 2 | 0.012048 |
5 | recA_lib-1 | Failed -- Other reason | 2 | 0.012048 |
6 | recA_lib-2 | Success -- CCS generated | 124 | 0.794872 |
7 | recA_lib-2 | Failed -- Lacking full passes | 22 | 0.141026 |
8 | recA_lib-2 | Failed -- Draft generation error | 4 | 0.025641 |
9 | recA_lib-2 | Failed -- Min coverage violation | 2 | 0.012821 |
10 | recA_lib-2 | Failed -- CCS below minimum RQ | 2 | 0.012821 |
11 | recA_lib-2 | Failed -- Other reason | 2 | 0.012821 |
Plot these stats:
p = summaries.plot_zmw_stats()
_ = p.draw()
Now do the same with a Summaries
object with no reports defined:
summaries_no_zmw = alignparse.ccs.Summaries(ccs_df, report_col=None, ncpus=1)
No ZMW stats defined:
summaries_no_zmw.has_zmw_stats()
False
So trying to get stats raises an error:
try:
summaries_no_zmw.plot_zmw_stats()
except ValueError:
print('cannot plot ZMW stats')
cannot plot ZMW stats
Now get information on CCS statistics:
for stat in ['passes', 'length', 'accuracy']:
if summaries.has_stat(stat):
print(summaries.ccs_stats(stat).head(n=5))
print()
else:
print(f"no {stat} stat\n")
name passes 0 recA_lib-1 28 1 recA_lib-1 29 2 recA_lib-1 22 3 recA_lib-1 16 4 recA_lib-1 20 name length 0 recA_lib-1 1325 1 recA_lib-1 1340 2 recA_lib-1 1339 3 recA_lib-1 985 4 recA_lib-1 1196 name accuracy 0 recA_lib-1 0.999998 1 recA_lib-1 0.999976 2 recA_lib-1 0.999686 3 recA_lib-1 0.999986 4 recA_lib-1 0.999592
Plot these stats:
for stat in ['length', 'passes', 'accuracy']:
try:
p = summaries.plot_ccs_stats(stat)
_ = p.draw()
except ValueError:
print(f"Cannot plot {stat}")
Now do the same for a Summaries
defined using FASTQ files without the np
tag giving the number of passes:
with contextlib.ExitStack() as stack:
tempfiles = [stack.enter_context(tempfile.NamedTemporaryFile('wt', suffix='.fastq'))
for _ in range(len(ccs_df))]
ccs_df = ccs_df.assign(fastq_nopasstag=[f.name for f in tempfiles])
for fout, tup in zip(tempfiles, ccs_df.itertuples()):
seqs = []
for iseq, seq in enumerate(Bio.SeqIO.parse(tup.fastq, 'fastq')):
if iseq == 0: # drop np tag from just first
seq.description = seq.description.split()[0]
seqs.append(seq)
Bio.SeqIO.write(seqs, fout, 'fastq')
fout.flush()
summaries_nopasstag = alignparse.ccs.Summaries(ccs_df, fastq_col='fastq_nopasstag')
Now print and plot information without np
tag:
for stat in ['passes', 'length', 'accuracy']:
if summaries_nopasstag.has_stat(stat):
print(summaries_nopasstag.ccs_stats(stat).head(n=5))
print()
else:
print(f"no {stat} stat\n")
no passes stat name length 0 recA_lib-1 1325 1 recA_lib-1 1340 2 recA_lib-1 1339 3 recA_lib-1 985 4 recA_lib-1 1196 name accuracy 0 recA_lib-1 0.999998 1 recA_lib-1 0.999976 2 recA_lib-1 0.999686 3 recA_lib-1 0.999986 4 recA_lib-1 0.999592
for stat in ['length', 'passes', 'accuracy']:
try:
p = summaries_nopasstag.plot_ccs_stats(stat)
_ = p.draw()
except ValueError:
print(f"Cannot plot {stat}")
Cannot plot passes