Example of how to define alignment targets.
Import necessary Python modules:
import tempfile
import Bio.SeqIO
from alignparse.targets import Target, Targets
First we show how to define a single Target
, using an example an amplicon for PacBio sequencing of RecA for a deep mutational scanning experiment.
The amplicon is defined in Genbank Flat File format.
First, let's just look at that file:
recA_targetfile = "../notebooks/input_files/recA_amplicon.gb"
with open(recA_targetfile) as f:
print(f.read())
LOCUS RecA_PacBio_amplicon 1342 bp ds-DNA linear 06-AUG-2018 DEFINITION PacBio amplicon for deep mutational scanning of E. coli RecA. ACCESSION None VERSION SOURCE Danny Lawrence ORGANISM . COMMENT PacBio amplicon for RecA libraries. COMMENT There are single nucleotide tags in the 5' and 3' termini to measure strand exchange. FEATURES Location/Qualifiers termini5 1..147 /label="termini 5' of gene" gene 148..1206 /label="RecA gene" spacer 1207..1285 /label="spacer between gene & barcode" barcode 1286..1303 /label="18 nucleotide barcode" termini3 1304..1342 /label="termini 3' of barcode" variant_tag5 33..33 /label="5' variant tag" variant_tag3 1311..1311 /label="3' variant tag" ORIGIN 1 gcacggcgtc acactttgct atgccatagc atRtttatcc ataagattag cggatcctac 61 ctgacgcttt ttatcgcaac tctctactgt ttctccataa cagaacatat tgactatccg 121 gtattacccg gcatgacagg agtaaaaATG GCTATCGACG AAAACAAACA GAAAGCGTTG 181 GCGGCAGCAC TGGGCCAGAT TGAGAAACAA TTTGGTAAAG GCTCCATCAT GCGCCTGGGT 241 GAAGACCGTT CCATGGATGT GGAAACCATC TCTACCGGTT CGCTTTCACT GGATATCGCG 301 CTTGGGGCAG GTGGTCTGCC GATGGGCCGT ATCGTCGAAA TCTACGGACC GGAATCTTCC 361 GGTAAAACCA CGCTGACGCT GCAGGTGATC GCCGCAGCGC AGCGTGAAGG TAAAACCTGT 421 GCGTTTATCG ATGCTGAACA CGCGCTGGAC CCAATCTACG CACGTAAACT GGGCGTCGAT 481 ATCGACAACC TGCTGTGCTC CCAGCCGGAC ACCGGCGAGC AGGCACTGGA AATCTGTGAC 541 GCCCTGGCGC GTTCTGGCGC AGTAGACGTT ATCGTCGTTG ACTCCGTGGC GGCACTGACG 601 CCGAAAGCGG AAATCGAAGG CGAAATCGGC GACTCTCATA TGGGCCTTGC GGCACGTATG 661 ATGAGCCAGG CGATGCGTAA GCTGGCGGGT AACCTGAAGC AGTCCAACAC GCTGCTGATC 721 TTCATCAACC AGATCCGTAT GAAAATTGGT GTGATGTTCG GCAACCCGGA AACCACTACC 781 GGTGGTAACG CGCTGAAATT CTACGCCTCT GTTCGTCTCG ACATCCGTCG TATCGGCGCG 841 GTGAAAGAGG GCGAAAACGT GGTGGGTAGC GAAACCCGCG TGAAAGTGGT GAAGAACAAA 901 ATCGCTGCGC CGTTTAAACA GGCTGAATTC CAGATCCTCT ACGGCGAAGG TATCAACTTC 961 TACGGCGAAC TGGTTGACCT GGGCGTAAAA GAGAAGCTGA TCGAGAAAGC AGGCGCGTGG 1021 TACAGCTACA AAGGTGAGAA GATCGGTCAG GGTAAAGCGA ATGCGACTGC CTGGCTGAAA 1081 GATAACCCGG AAACCGCGAA AGAGATCGAG AAGAAAGTAC GTGAGTTGCT GCTGAGCAAC 1141 CCGAACTCAA CGCCGGATTT CTCTGTAGAT GATAGCGAAG GCGTAGCAGA AACTAACGAA 1201 GATTTTTAAt cgtcttgttt gatacacaag ggtcgcatct gcggcccttt tgctttttta 1261 agttgtaagg atatgccatt ctagannnnn nnnnnnnnnn nnnagatcgg Yagagcgtcg 1321 tgtagggaaa gagtgtggta cc //
Read the Genbank file for the target into a BioPython SeqRecord:
recA_seqrecord = Bio.SeqIO.read(recA_targetfile, format="genbank")
Create a Target
object:
target = Target(
seqrecord=recA_seqrecord,
req_features=[
"termini5",
"gene",
"spacer",
"barcode",
"termini3",
"variant_tag5",
"variant_tag3",
],
)
We can get specific features out of the Target
object.
Below we look for two features and print the one that exists:
for feature in ["non-existent", "termini5"]:
if target.has_feature(feature):
print(f"Here is feature {feature}:\n{target.get_feature(feature)}")
else:
print(f"target lacks feature {feature}\n")
target lacks feature non-existent Here is feature termini5: Feature(name=termini5, seq=GCACGGCGTCACACTTTGCTATGCCATAGCATRTTTATCCATAAGATTAGCGGATCCTACCTGACGCTTTTTATCGCAACTCTCTACTGTTTCTCCATAACAGAACATATTGACTATCCGGTATTACCCGGCATGACAGGAGTAAAA, start=0, end=147)
We can get a dna_features_viewer GraphicRecord
with Target.image
, and then plot this using its .plot
method, which returns a matplotlib.Axes
instance.
(Note that Target.image
also provides options for setting colors, labels):
image = target.image()
ax, _ = image.plot()
_ = ax.set_title(target.name)
We can read multiple targets into a Targets
object. Below is an example with the two LASV GP constructs - wildtype and codon optimized - from the Josiah strain.
First, let's look at these files:
target_file_names = ["LASV_Josiah_WT", "LASV_Josiah_OPT"]
targetfiles = [
f"../notebooks/input_files/{target_file_name}.gb"
for target_file_name in target_file_names
]
for targetfile in targetfiles:
with open(targetfile) as f:
print(f.read())
LOCUS LASV_Josiah_WT 1730 bp ds-DNA linear 14-JUN-2019 DEFINITION . ACCESSION VERSION SOURCE Kate Crawford ORGANISM . COMMENT PacBio amplicon for LASV Josiah WT sequence FEATURES Location/Qualifiers T2A 85..147 /label="T2A" WPRE 1639..1730 /label="WPRE" ZsGreen 15..84 /label="ZsGreen" termini3 1639..1730 /label="3'Termini" index 9..14 /label="index" leader5 1..8 /label="5' leader" termini5 1..147 /label="5'Termini" variant_tag5 34..34 /variant_1=T /variant_2=C /label="5'VariantTag" variant_tag3 1702..1702 /variant_1=G /variant_2=A /label="3'VariantTag" spacer 1624..1638 /label="3'Spacer" gene 148..1623 /label="LASV_Josiah_WT" ORIGIN 1 GACTGATANN NNNNcagcga cgccaagaac cagYagtggc acctgaccga gcacgccatc 61 gcctccggcT CCGCCTTGCC CGCTGGATCC GGCGAGGGCA GAGGAAGTCT GCTAACATGC 121 GGTGACGTCG AGGAGAATCC TGGCCCAATG GGACAAATAG TGACATTCTT CCAGGAAGTG 181 CCTCATGTAA TAGAAGAGGT GATGAACATT GTTCTCATTG CACTGTCTGT ACTAGCAGTG 241 CTGAAAGGTC TGTACAATTT TGCAACGTGT GGCCTTGTTG GTTTGGTCAC TTTCCTCCTG 301 TTGTGTGGTA GGTCTTGCAC AACCAGTCTT TATAAAGGGG TTTATGAGCT TCAGACTCTG 361 GAACTAAACA TGGAGACACT CAATATGACC ATGCCTCTCT CCTGCACAAA GAACAACAGT 421 CATCATTATA TAATGGTGGG CAATGAGACA GGACTAGAAC TGACCTTGAC CAACACGAGC 481 ATTATTAATC ACAAATTTTG CAATCTGTCT GATGCCCACA AAAAGAACCT CTATGACCAC 541 GCTCTTATGA GCATAATCTC AACTTtccac ttgtccatcc ccaacTTCAA TCAGTATGAG 601 GCAATGAGCT GCGATTTTAA TGGGGGAAAG ATTAGTGTGC AGTACAACCT GAGTCACAGC 661 TATGCTGGGG ATGCAGCCAA CCATTGTGGT ACTGTTGCAA ATGGTGTGTT ACAGACTTTT 721 ATGAGGATGG CTTGGGGTGG GAGCTACATT GCTCTTGACT CAGGCCGTGG CAACTGGGAC 781 TGTATTATGA CTAGTTATCA ATATCTGATA ATCCAAAATA CAACCTGGGA AGATCACTGC 841 CAATTCTCGA GACCATCTCC CATCGGTTAT CTCGGGCTCC TCTCACAAAG GACTAGAGAT 901 ATTTATATTA GTAGAAGATT GCTAGGCACA TTCACATGGA CACTGTCAGA TTCTGAAGGT 961 AAAGACACAC CAGGGGGATA TTGTCTGACC AGGTGGATGC TAATTGAGGC TGAACTAAAA 1021 TGCTTCGGGA ACACAGCTGT GGCAAAATGT AATGAGAAGC ATGATGAgga attttgtgac 1081 atgctgaggc TGTTTGACTT CAACAAACAA GCCATTCAAA GGTTGAAAGC TGAAGCACAA 1141 ATGAGCATTC AGTTGATCAA CAAAGCAGTA AATGCTTTGA TAAATGACCA ACTTATAATG 1201 AAGAACCATC TACGGGACAT CATGGGAATT CCATACTGTA ATTACAGCAA GTATTGGTAC 1261 CTCAACCACA CAACTACTGG GAGAACATCA CTGCCCAAAT GTTGGCTTGT ATCAAATGGT 1321 TCATACTTGA ACGAGACCCA CTTTTCTGAT GATATTGAAC AACAAGCTGA CAATATGATC 1381 ACTGAGATGT TACAGAAGGA GTATATGGAG AGGCAGGGGA AGACACCATT GGGTCTAGTT 1441 GACCTCTTTG TGTTCAGCAC AAGTTTCTAT CTTATTAGCA TCTTCCTTCA CCTAGTCAAA 1501 ATACCAACTC ATAGGCATAT TGTAGGCAAG TCGTGTCCCA AACCTCACAG ATTGAATCAT 1561 ATGGGCATTT GTTCCTGTGG ACTCTACAAA CAGCCTGGTG TGCCTGTGAA ATGGAAGAGA 1621 TGAGCTAGCT AAACGCGTTG ATCCtaatca acctctggat tacaaaattt gtgaaagatt 1681 gactggtatt cttaactatg tRgctccttt tacgctatgt ggatacgctg // LOCUS LASV_Josiah_OPT 1730 bp ds-DNA linear 14-JUN-2019 DEFINITION . ACCESSION VERSION SOURCE Kate Crawford ORGANISM . COMMENT PacBio amplicon for LASV Josiah OPT sequence FEATURES Location/Qualifiers T2A 85..147 /label="T2A" WPRE 1639..1730 /label="WPRE" ZsGreen 15..84 /label="ZsGreen" termini3 1639..1730 /label="3'Termini" index 9..14 /label="index" leader5 1..8 /label="5' leader" termini5 1..147 /label="5'Termini" variant_tag5 34..34 /variant_1=T /variant_2=C /label="5'VariantTag" variant_tag3 1702..1702 /variant_1=G /variant_2=A /label="3'VariantTag" spacer 1624..1638 /label="3'Spacer" gene 148..1623 /label="LASV_Josiah_OPT" ORIGIN 1 GACTGATANN NNNNcagcga cgccaagaac cagYagtggc acctgaccga gcacgccatc 61 gcctccggcT CCGCCTTGCC CGCTGGATCC GGCGAGGGCA GAGGAAGTCT GCTAACATGC 121 GGTGACGTCG AGGAGAATCC TGGCCCAATG GGCCAGATCG TGACCTTCTT CCAAGAAGTG 181 CCTCATGTGA TTGAGGAGGT GATGAATATC GTGCTGATCG CTTTAAGCGT GCTGGCCGTT 241 CTTAAGGGCC TCTATAACTT CGCCACTTGT GGTTTAGTCG GACTGGTGAC ATTTCTGCTG 301 CTGTGTGGCA GATCTTGTAC CACATCTTTA TACAAGGGCG TGTACGAGCT GCAGACTTTA 361 GAACTGAACA TGGAGACTTT AAACATGACC ATGCCTTTAA GCTGTACCAA GAACAATAGC 421 CACCACTACA TCATGGTGGG CAACGAGACC GGTTTAGAAC TGACACTCAC CAACACCAGC 481 ATTATCAACC ATAAGTTCTG CAACCTCTCC GACGCTCACA AGAAGAATTT ATACGACCAC 541 GCTTTAATGA GCATCATCTC CACCTTCCAT CTCTCCATTC CTAATttcaa ccagtacgag 601 gccatgAGCT GCGACTTTAA CGGCGGCAAG ATCTCCGTGC AGTACAATTT ATCCCATAGC 661 TACGCCGGCG ATGCCGCCAA TCACTGCGGA ACCGTGGCCA ACGGCGTGCT GCAGACATTC 721 ATGAGGATGG CTTGGGGCGG CTCCTATATC GCTTTAGACT CCGGCAGAGG AAACTGGGAC 781 TGTATCATGA CCAGCTACCA ATATTTAATC ATTCAGAACA CCACATGGGA GGACCACTGC 841 CAATTCTCTC GTCCCTCTCC TATCGGCTAT CTGGGACTGC TGTCCCAGAG GACCAGAGAC 901 ATCTACATCT CTCGTAGGCT GCTGGGCACA TTCACTTGGA CTTTAAGCGA CAGCGAAGGC 961 AAAGATACTC CCGGTGGCTA CTGTTTAACA AGATGGATGC TGATCGAGGC CGAGCTCAAG 1021 TGCTTCGGAA ATACCGCCGT GGCCAAATGC AACGAGAAAC ACGACGAGGA GTTCTGCGAC 1081 ATGCTGAGGC TCTTCGACTT CAacaagcaa gccattcaga ggcTGAAGGC CGAAGCCCAG 1141 ATGTCCATCC AGCTGATTAA TAAGGCCGTG AATGCCCTCA TTAACGACCA GCTGATCATG 1201 AAGAACCATT TAAGGGACAT CATGGGCATC CCTTATTGCA ACTACAGCAA ATACTGGTAT 1261 TTAAATCATA CCACCACCGG TCGTACATCC TTACCTAAGT GCTGGCTGGT CAGCAATGGC 1321 TCCTATTTAA ACGAGACACA CTTCTCCGAC GACATCGAGC AGCAAGCCGA CAACATGATC 1381 ACCGAAATGC TCCAGAAGGA GTACATGGAG AGGCAAGGTA AGACTCCTCT GGGTTTAGTG 1441 GATTTATTCG TCTTCAGCAC CTCCTTCTAT TTAATCTCCA TCTTTCTTCA TCTGGTGAAG 1501 ATTCCTACCC ACAGACACAT TGTGGGCAAG AGCTGTCCTA AGCCTCATAG ACTGAACCAC 1561 ATGGGCATCT GTAGCTGCGG TTTATATAAA CAGCCCGGTG TTCCCGTTAA GTGGAAGAGG 1621 TGAGCTAGCT AAACGCGTTG ATCCtaatca acctctggat tacaaaattt gtgaaagatt 1681 gactggtatt cttaactatg tRgctccttt tacgctatgt ggatacgctg //
Read the sequences into a Targets
:
lasv_parse_specs_file = "../notebooks/input_files/lasv_feature_parse_specs.yaml"
lasv_targets = Targets(
seqsfile=targetfiles,
feature_parse_specs=lasv_parse_specs_file,
allow_extra_features=True,
allow_clipped_muts_seqs=True,
)
Iterate through lasv_targets to identify features present:
for lasv_target in lasv_targets.targets:
print(f"\ntarget = {lasv_target.name}")
for feature in ["non-existent", "termini5"]:
if lasv_target.has_feature(feature):
print(f"Here is feature {feature}:\n{target.get_feature(feature)}")
else:
print(f"target lacks feature {feature}\n")
target = LASV_Josiah_WT target lacks feature non-existent Here is feature termini5: Feature(name=termini5, seq=GCACGGCGTCACACTTTGCTATGCCATAGCATRTTTATCCATAAGATTAGCGGATCCTACCTGACGCTTTTTATCGCAACTCTCTACTGTTTCTCCATAACAGAACATATTGACTATCCGGTATTACCCGGCATGACAGGAGTAAAA, start=0, end=147) target = LASV_Josiah_OPT target lacks feature non-existent Here is feature termini5: Feature(name=termini5, seq=GCACGGCGTCACACTTTGCTATGCCATAGCATRTTTATCCATAAGATTAGCGGATCCTACCTGACGCTTTTTATCGCAACTCTCTACTGTTTCTCCATAACAGAACATATTGACTATCCGGTATTACCCGGCATGACAGGAGTAAAA, start=0, end=147)
We can plot the Targets
:
_ = lasv_targets.plot(ax_width=10)
We can write them to a file for alignment:
with tempfile.NamedTemporaryFile(mode="w") as f:
lasv_targets.write_fasta(f.name)
f.flush()
fasta_text = open(f.name).read()
print(fasta_text)
>LASV_Josiah_WT GACTGATANNNNNNCAGCGACGCCAAGAACCAGYAGTGGCACCTGACCGAGCACGCCATCGCCTCCGGCTCCGCCTTGCCCGCTGGATCCGGCGAGGGCAGAGGAAGTCTGCTAACATGCGGTGACGTCGAGGAGAATCCTGGCCCAATGGGACAAATAGTGACATTCTTCCAGGAAGTGCCTCATGTAATAGAAGAGGTGATGAACATTGTTCTCATTGCACTGTCTGTACTAGCAGTGCTGAAAGGTCTGTACAATTTTGCAACGTGTGGCCTTGTTGGTTTGGTCACTTTCCTCCTGTTGTGTGGTAGGTCTTGCACAACCAGTCTTTATAAAGGGGTTTATGAGCTTCAGACTCTGGAACTAAACATGGAGACACTCAATATGACCATGCCTCTCTCCTGCACAAAGAACAACAGTCATCATTATATAATGGTGGGCAATGAGACAGGACTAGAACTGACCTTGACCAACACGAGCATTATTAATCACAAATTTTGCAATCTGTCTGATGCCCACAAAAAGAACCTCTATGACCACGCTCTTATGAGCATAATCTCAACTTTCCACTTGTCCATCCCCAACTTCAATCAGTATGAGGCAATGAGCTGCGATTTTAATGGGGGAAAGATTAGTGTGCAGTACAACCTGAGTCACAGCTATGCTGGGGATGCAGCCAACCATTGTGGTACTGTTGCAAATGGTGTGTTACAGACTTTTATGAGGATGGCTTGGGGTGGGAGCTACATTGCTCTTGACTCAGGCCGTGGCAACTGGGACTGTATTATGACTAGTTATCAATATCTGATAATCCAAAATACAACCTGGGAAGATCACTGCCAATTCTCGAGACCATCTCCCATCGGTTATCTCGGGCTCCTCTCACAAAGGACTAGAGATATTTATATTAGTAGAAGATTGCTAGGCACATTCACATGGACACTGTCAGATTCTGAAGGTAAAGACACACCAGGGGGATATTGTCTGACCAGGTGGATGCTAATTGAGGCTGAACTAAAATGCTTCGGGAACACAGCTGTGGCAAAATGTAATGAGAAGCATGATGAGGAATTTTGTGACATGCTGAGGCTGTTTGACTTCAACAAACAAGCCATTCAAAGGTTGAAAGCTGAAGCACAAATGAGCATTCAGTTGATCAACAAAGCAGTAAATGCTTTGATAAATGACCAACTTATAATGAAGAACCATCTACGGGACATCATGGGAATTCCATACTGTAATTACAGCAAGTATTGGTACCTCAACCACACAACTACTGGGAGAACATCACTGCCCAAATGTTGGCTTGTATCAAATGGTTCATACTTGAACGAGACCCACTTTTCTGATGATATTGAACAACAAGCTGACAATATGATCACTGAGATGTTACAGAAGGAGTATATGGAGAGGCAGGGGAAGACACCATTGGGTCTAGTTGACCTCTTTGTGTTCAGCACAAGTTTCTATCTTATTAGCATCTTCCTTCACCTAGTCAAAATACCAACTCATAGGCATATTGTAGGCAAGTCGTGTCCCAAACCTCACAGATTGAATCATATGGGCATTTGTTCCTGTGGACTCTACAAACAGCCTGGTGTGCCTGTGAAATGGAAGAGATGAGCTAGCTAAACGCGTTGATCCTAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTGGTATTCTTAACTATGTRGCTCCTTTTACGCTATGTGGATACGCTG >LASV_Josiah_OPT GACTGATANNNNNNCAGCGACGCCAAGAACCAGYAGTGGCACCTGACCGAGCACGCCATCGCCTCCGGCTCCGCCTTGCCCGCTGGATCCGGCGAGGGCAGAGGAAGTCTGCTAACATGCGGTGACGTCGAGGAGAATCCTGGCCCAATGGGCCAGATCGTGACCTTCTTCCAAGAAGTGCCTCATGTGATTGAGGAGGTGATGAATATCGTGCTGATCGCTTTAAGCGTGCTGGCCGTTCTTAAGGGCCTCTATAACTTCGCCACTTGTGGTTTAGTCGGACTGGTGACATTTCTGCTGCTGTGTGGCAGATCTTGTACCACATCTTTATACAAGGGCGTGTACGAGCTGCAGACTTTAGAACTGAACATGGAGACTTTAAACATGACCATGCCTTTAAGCTGTACCAAGAACAATAGCCACCACTACATCATGGTGGGCAACGAGACCGGTTTAGAACTGACACTCACCAACACCAGCATTATCAACCATAAGTTCTGCAACCTCTCCGACGCTCACAAGAAGAATTTATACGACCACGCTTTAATGAGCATCATCTCCACCTTCCATCTCTCCATTCCTAATTTCAACCAGTACGAGGCCATGAGCTGCGACTTTAACGGCGGCAAGATCTCCGTGCAGTACAATTTATCCCATAGCTACGCCGGCGATGCCGCCAATCACTGCGGAACCGTGGCCAACGGCGTGCTGCAGACATTCATGAGGATGGCTTGGGGCGGCTCCTATATCGCTTTAGACTCCGGCAGAGGAAACTGGGACTGTATCATGACCAGCTACCAATATTTAATCATTCAGAACACCACATGGGAGGACCACTGCCAATTCTCTCGTCCCTCTCCTATCGGCTATCTGGGACTGCTGTCCCAGAGGACCAGAGACATCTACATCTCTCGTAGGCTGCTGGGCACATTCACTTGGACTTTAAGCGACAGCGAAGGCAAAGATACTCCCGGTGGCTACTGTTTAACAAGATGGATGCTGATCGAGGCCGAGCTCAAGTGCTTCGGAAATACCGCCGTGGCCAAATGCAACGAGAAACACGACGAGGAGTTCTGCGACATGCTGAGGCTCTTCGACTTCAACAAGCAAGCCATTCAGAGGCTGAAGGCCGAAGCCCAGATGTCCATCCAGCTGATTAATAAGGCCGTGAATGCCCTCATTAACGACCAGCTGATCATGAAGAACCATTTAAGGGACATCATGGGCATCCCTTATTGCAACTACAGCAAATACTGGTATTTAAATCATACCACCACCGGTCGTACATCCTTACCTAAGTGCTGGCTGGTCAGCAATGGCTCCTATTTAAACGAGACACACTTCTCCGACGACATCGAGCAGCAAGCCGACAACATGATCACCGAAATGCTCCAGAAGGAGTACATGGAGAGGCAAGGTAAGACTCCTCTGGGTTTAGTGGATTTATTCGTCTTCAGCACCTCCTTCTATTTAATCTCCATCTTTCTTCATCTGGTGAAGATTCCTACCCACAGACACATTGTGGGCAAGAGCTGTCCTAAGCCTCATAGACTGAACCACATGGGCATCTGTAGCTGCGGTTTATATAAACAGCCCGGTGTTCCCGTTAAGTGGAAGAGGTGAGCTAGCTAAACGCGTTGATCCTAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTGGTATTCTTAACTATGTRGCTCCTTTTACGCTATGTGGATACGCTG