Test `Targets.align_and_parse`¶

This notebook makes sure that running Targets.align_and_parse returns the same output as running Targets.align and Targets.parse_alignment separately.

Setup¶

Import Python modules:

In [1]:

import os
import warnings

import pandas as pd

import alignparse.ccs
import alignparse.minimap2
import alignparse.targets

Suppress warnings that clutter output:

In [2]:

warnings.simplefilter("ignore")

Directory for output:

In [3]:

outdir_base = "_temp_check_Targets_align_and_parse"
outdir_separate = os.path.join(outdir_base, "separate/")
outdir_together = os.path.join(outdir_base, "together/")
os.makedirs(outdir_separate, exist_ok=True)
os.makedirs(outdir_together, exist_ok=True)

Get Target:

In [4]:

recA_targetfile = "../notebooks/input_files/recA_amplicon.gb"
recA_parse_specs_file = "../notebooks/input_files/recA_feature_parse_specs.yaml"
targets = alignparse.targets.Targets(
    seqsfile=recA_targetfile, feature_parse_specs=recA_parse_specs_file
)

Get the PacBio runs:

In [5]:

run_names = ["recA_lib-1", "recA_lib-2"]
ccs_dir = "input_files"

pacbio_runs = pd.DataFrame(
    {
        "name": run_names,
        "report": [f"../notebooks/{ccs_dir}/{name}_report.txt" for name in run_names],
        "fastq": [f"../notebooks/{ccs_dir}/{name}_ccs.fastq" for name in run_names],
    }
)

pacbio_runs

Out[5]:

	name	report	fastq
0	recA_lib-1	../notebooks/input_files/recA_lib-1_report.txt	../notebooks/input_files/recA_lib-1_ccs.fastq
1	recA_lib-2	../notebooks/input_files/recA_lib-2_report.txt	../notebooks/input_files/recA_lib-2_ccs.fastq

Create a Mapper:

In [6]:

mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)

Align and parse separately¶

First, add the names of the desired alignment files to our data frame:

In [7]:

pacbio_runs = pacbio_runs.assign(
    alignments=lambda x: outdir_separate + x["name"] + "_alignments.sam"
)

pacbio_runs

Out[7]:

	name	report	fastq	alignments
0	recA_lib-1	../notebooks/input_files/recA_lib-1_report.txt	../notebooks/input_files/recA_lib-1_ccs.fastq	_temp_check_Targets_align_and_parse/separate/r...
1	recA_lib-2	../notebooks/input_files/recA_lib-2_report.txt	../notebooks/input_files/recA_lib-2_ccs.fastq	_temp_check_Targets_align_and_parse/separate/r...

Now use the mapper to actually align the FASTQ queries to the target:

In [8]:

for tup in pacbio_runs.itertuples(index=False):
    targets.align(queryfile=tup.fastq, alignmentfile=tup.alignments, mapper=mapper)

Parse the alignments:

In [9]:

readstats = []
aligned = {targetname: [] for targetname in targets.target_names}
filtered = {targetname: [] for targetname in targets.target_names}

for run in pacbio_runs.itertuples():
    run_readstats, run_aligned, run_filtered = targets.parse_alignment(run.alignments)

    # when concatenating add the run name to keep track of runs for results
    readstats.append(run_readstats.assign(name=run.name))
    for targetname in targets.target_names:
        aligned[targetname].append(run_aligned[targetname].assign(name=run.name))
        filtered[targetname].append(run_filtered[targetname].assign(name=run.name))

# now concatenate the data frames for each run
readstats = pd.concat(readstats, ignore_index=True, sort=False)
for targetname in targets.target_names:
    aligned[targetname] = pd.concat(aligned[targetname], ignore_index=True, sort=False)
    filtered[targetname] = pd.concat(
        filtered[targetname], ignore_index=True, sort=False
    )

First lets look at the read stats:

In [10]:

readstats

Out[10]:

	category	count	name
0	filtered RecA_PacBio_amplicon	16	recA_lib-1
1	aligned RecA_PacBio_amplicon	123	recA_lib-1
2	unmapped	0	recA_lib-1
3	filtered RecA_PacBio_amplicon	12	recA_lib-2
4	aligned RecA_PacBio_amplicon	112	recA_lib-2
5	unmapped	0	recA_lib-2

Now look at the information on the filtered reads. This is a bigger data frame, so we just look at the first few lines for the first target (of which there is only one anyway):

In [11]:

filtered[targets.target_names[0]].head()

Out[11]:

	query_name	filter_reason	name
0	m54228_180801_171631/4194459/ccs	spacer mutation_nt_count	recA_lib-1
1	m54228_180801_171631/4325806/ccs	barcode mutation_nt_count	recA_lib-1
2	m54228_180801_171631/4391313/ccs	termini3 mutation_nt_count	recA_lib-1
3	m54228_180801_171631/4391375/ccs	gene clip3	recA_lib-1
4	m54228_180801_171631/4391467/ccs	gene clip3	recA_lib-1

Finally, we can look at the information for the validly aligned (not filtered) reads. First just look at the first few entries in the data frame for the first target:

In [12]:

aligned[targets.target_names[0]].head()

Out[12]:

	query_name	query_clip5	gene_mutations	gene_accuracy	barcode_sequence	barcode_accuracy	variant_tag5_sequence	variant_tag3_sequence	name
0	m54228_180801_171631/4391577/ccs	1	C100A T102A G658C C659T del840to840	0.999455	AAGATACACTCGAAATCT	1.0	G	C	recA_lib-1
1	m54228_180801_171631/4915465/ccs	0	C142G G144T T329A A738G A946T C947A	1.000000	AAATATCATCGCGGCCAG	1.0	T	T	recA_lib-1
2	m54228_180801_171631/4981392/ccs	0	C142G G144T T329A A738G A946T C947A	1.000000	AAATATCATCGCGGCCAG	1.0	G	C	recA_lib-1
3	m54228_180801_171631/6029553/ccs	0	T83A G84A A106T T107A G108A ins693G G862T C863...	0.999940	CTAATAGTAGTTTTCCAG	1.0	G	C	recA_lib-1
4	m54228_180801_171631/6488565/ccs	0	A254C G255T A466T T467G C468T C940G G942A	0.999967	TATTTATACCCATGAGTG	1.0	A	T	recA_lib-1

Align and parse together¶

Use Targets.align_and_parse:

In [13]:

readstats2, aligned2, filtered2 = targets.align_and_parse(
    df=pacbio_runs,
    mapper=mapper,
    outdir=outdir_together,
    name_col="name",
    queryfile_col="fastq",
    overwrite=True,
    ncpus=2,
)

Make sure the data frames are the same:

In [14]:

pd.testing.assert_frame_equal(readstats, readstats2, check_like=True)

In [15]:

pd.testing.assert_frame_equal(
    aligned[targets.target_names[0]], aligned2[targets.target_names[0]], check_like=True
)

In [16]:

pd.testing.assert_frame_equal(
    filtered[targets.target_names[0]],
    filtered2[targets.target_names[0]],
    check_like=True,
)

Table of Contents

Test Targets.align_and_parse¶

Setup¶

Align and parse separately¶

Align and parse together¶

Test `Targets.align_and_parse`¶