Targets.align_and_parse
¶This notebook makes sure that running Targets.align_and_parse
returns the same output as running Targets.align
and Targets.parse_alignment
separately.
Import Python modules:
import os
import warnings
import pandas as pd
import alignparse.ccs
import alignparse.minimap2
import alignparse.targets
Suppress warnings that clutter output:
warnings.simplefilter("ignore")
Directory for output:
outdir_base = "_temp_check_Targets_align_and_parse"
outdir_separate = os.path.join(outdir_base, "separate/")
outdir_together = os.path.join(outdir_base, "together/")
os.makedirs(outdir_separate, exist_ok=True)
os.makedirs(outdir_together, exist_ok=True)
Get Target:
recA_targetfile = "../notebooks/input_files/recA_amplicon.gb"
recA_parse_specs_file = "../notebooks/input_files/recA_feature_parse_specs.yaml"
targets = alignparse.targets.Targets(
seqsfile=recA_targetfile, feature_parse_specs=recA_parse_specs_file
)
Get the PacBio runs:
run_names = ["recA_lib-1", "recA_lib-2"]
ccs_dir = "input_files"
pacbio_runs = pd.DataFrame(
{
"name": run_names,
"report": [f"../notebooks/{ccs_dir}/{name}_report.txt" for name in run_names],
"fastq": [f"../notebooks/{ccs_dir}/{name}_ccs.fastq" for name in run_names],
}
)
pacbio_runs
name | report | fastq | |
---|---|---|---|
0 | recA_lib-1 | ../notebooks/input_files/recA_lib-1_report.txt | ../notebooks/input_files/recA_lib-1_ccs.fastq |
1 | recA_lib-2 | ../notebooks/input_files/recA_lib-2_report.txt | ../notebooks/input_files/recA_lib-2_ccs.fastq |
Create a Mapper
:
mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)
First, add the names of the desired alignment files to our data frame:
pacbio_runs = pacbio_runs.assign(
alignments=lambda x: outdir_separate + x["name"] + "_alignments.sam"
)
pacbio_runs
name | report | fastq | alignments | |
---|---|---|---|---|
0 | recA_lib-1 | ../notebooks/input_files/recA_lib-1_report.txt | ../notebooks/input_files/recA_lib-1_ccs.fastq | _temp_check_Targets_align_and_parse/separate/r... |
1 | recA_lib-2 | ../notebooks/input_files/recA_lib-2_report.txt | ../notebooks/input_files/recA_lib-2_ccs.fastq | _temp_check_Targets_align_and_parse/separate/r... |
Now use the mapper to actually align the FASTQ queries to the target:
for tup in pacbio_runs.itertuples(index=False):
targets.align(queryfile=tup.fastq, alignmentfile=tup.alignments, mapper=mapper)
Parse the alignments:
readstats = []
aligned = {targetname: [] for targetname in targets.target_names}
filtered = {targetname: [] for targetname in targets.target_names}
for run in pacbio_runs.itertuples():
run_readstats, run_aligned, run_filtered = targets.parse_alignment(run.alignments)
# when concatenating add the run name to keep track of runs for results
readstats.append(run_readstats.assign(name=run.name))
for targetname in targets.target_names:
aligned[targetname].append(run_aligned[targetname].assign(name=run.name))
filtered[targetname].append(run_filtered[targetname].assign(name=run.name))
# now concatenate the data frames for each run
readstats = pd.concat(readstats, ignore_index=True, sort=False)
for targetname in targets.target_names:
aligned[targetname] = pd.concat(aligned[targetname], ignore_index=True, sort=False)
filtered[targetname] = pd.concat(
filtered[targetname], ignore_index=True, sort=False
)
First lets look at the read stats:
readstats
category | count | name | |
---|---|---|---|
0 | filtered RecA_PacBio_amplicon | 16 | recA_lib-1 |
1 | aligned RecA_PacBio_amplicon | 123 | recA_lib-1 |
2 | unmapped | 0 | recA_lib-1 |
3 | filtered RecA_PacBio_amplicon | 12 | recA_lib-2 |
4 | aligned RecA_PacBio_amplicon | 112 | recA_lib-2 |
5 | unmapped | 0 | recA_lib-2 |
Now look at the information on the filtered reads. This is a bigger data frame, so we just look at the first few lines for the first target (of which there is only one anyway):
filtered[targets.target_names[0]].head()
query_name | filter_reason | name | |
---|---|---|---|
0 | m54228_180801_171631/4194459/ccs | spacer mutation_nt_count | recA_lib-1 |
1 | m54228_180801_171631/4325806/ccs | barcode mutation_nt_count | recA_lib-1 |
2 | m54228_180801_171631/4391313/ccs | termini3 mutation_nt_count | recA_lib-1 |
3 | m54228_180801_171631/4391375/ccs | gene clip3 | recA_lib-1 |
4 | m54228_180801_171631/4391467/ccs | gene clip3 | recA_lib-1 |
Finally, we can look at the information for the validly aligned (not filtered) reads. First just look at the first few entries in the data frame for the first target:
aligned[targets.target_names[0]].head()
query_name | query_clip5 | query_clip3 | gene_mutations | gene_accuracy | barcode_sequence | barcode_accuracy | variant_tag5_sequence | variant_tag3_sequence | name | |
---|---|---|---|---|---|---|---|---|---|---|
0 | m54228_180801_171631/4391577/ccs | 1 | 0 | C100A T102A G658C C659T del840to840 | 0.999455 | AAGATACACTCGAAATCT | 1.0 | G | C | recA_lib-1 |
1 | m54228_180801_171631/4915465/ccs | 0 | 0 | C142G G144T T329A A738G A946T C947A | 1.000000 | AAATATCATCGCGGCCAG | 1.0 | T | T | recA_lib-1 |
2 | m54228_180801_171631/4981392/ccs | 0 | 0 | C142G G144T T329A A738G A946T C947A | 1.000000 | AAATATCATCGCGGCCAG | 1.0 | G | C | recA_lib-1 |
3 | m54228_180801_171631/6029553/ccs | 0 | 0 | T83A G84A A106T T107A G108A ins693G G862T C863... | 0.999940 | CTAATAGTAGTTTTCCAG | 1.0 | G | C | recA_lib-1 |
4 | m54228_180801_171631/6488565/ccs | 0 | 0 | A254C G255T A466T T467G C468T C940G G942A | 0.999967 | TATTTATACCCATGAGTG | 1.0 | A | T | recA_lib-1 |
Use Targets.align_and_parse
:
readstats2, aligned2, filtered2 = targets.align_and_parse(
df=pacbio_runs,
mapper=mapper,
outdir=outdir_together,
name_col="name",
queryfile_col="fastq",
overwrite=True,
ncpus=2,
)
Make sure the data frames are the same:
pd.testing.assert_frame_equal(readstats, readstats2, check_like=True)
pd.testing.assert_frame_equal(
aligned[targets.target_names[0]], aligned2[targets.target_names[0]], check_like=True
)
pd.testing.assert_frame_equal(
filtered[targets.target_names[0]],
filtered2[targets.target_names[0]],
check_like=True,
)