This Python Jupyter notebook analyzes mutational antigenic profiling of serum against virus carrying the A/Perth/2009 (H3N2) HA.
We first configure the analysis by importing packages and getting information on the samples that we will analyze.
Import modules / packages. In particular, we use:
import os
import math
import collections
import itertools
import warnings
import subprocess
import yaml
import pandas as pd
from IPython.display import display, HTML
import matplotlib.pyplot as plt
from plotnine import *
import dms_tools2
from dms_tools2.ipython_utils import showPDF
from dms_tools2.plot import COLOR_BLIND_PALETTE_GRAY as PALETTE
import dmslogo
Turn on interactive matplotlib plotting:
plt.ion() # turn on interactive plotting
Print versions of Bloom lab software:
print(f"Using dms_tools2 version {dms_tools2.__version__}")
print(f"Using dmslogo version {dmslogo.__version__}")
Using dms_tools2 version 2.4.16 Using dmslogo version 0.2.3
Set data frame display options:
pd.set_option('display.max_colwidth', 200)
Ignore warnings that can clutter notebook:
warnings.simplefilter('ignore')
Read the variables from the config file. These variables specify input / output paths and key parameters for analysis:
configfile = 'config.yaml'
with open(configfile) as f:
config = yaml.safe_load(f)
print(f"Read the following configuration from {configfile}:")
display(HTML(pd.Series(config)
.to_frame('value')
.to_html()
))
Read the following configuration from config.yaml:
value | |
---|---|
serum_info | data/serum_info.yaml |
sample_list | data/sample_list.csv |
refseq | data/Perth09_HA_reference.fa |
renumbering_scheme | data/H3renumbering_scheme.csv |
seq_data_source | SRA_accession |
fastq_dir | results/FASTQ_files |
fastq_dump | fastq-dump |
ascp | /app/aspera-connect/3.7.5/bin/ascp |
asperakey | /app/aspera-connect/3.7.5/etc/asperaweb_id_dsa.openssh |
alignspecs | 1,285,38,40 286,567,33,34 568,852,34,30 853,1137,34,31 1138,1422,36,29 1423,1701,39,44 |
R1trim | 200 |
R2trim | 165 |
ncpus | 16 |
use_existing | yes |
avg_type | median |
natseqs | data/human_H3N2_HA_2007-2018.fasta.gz |
neut_config | data/neut_assays/neut_config.yaml |
figure_config | data/figure_config.yaml |
map_on_struct_template | map_on_struct_template.ipynb |
pdb_id | 4o5n |
site_to_pdb | data/H3_site_to_PDB_4o5n.csv |
countsdir | results/codoncounts |
renumbcountsdir | results/renumbered_codoncounts |
selectiontabledir | results/selection_tables |
diffseldir | results/diffsel |
avgdiffseldir | results/avgdiffsel |
avgdiffsel_sigsites_dir | results/avgdiffsel/sigsites |
avgdiffsel_zoom_dir | results/avgdiffsel/zoomed_plots |
avgdiffsel_reps_dir | results/avgdiffsel/replicates |
avgdiffsel_full_dir | results/avgdiffsel/full_logo_plots |
natseqs_dir | results/natseqs |
neutresultsdir | results/neutralization_assays |
structsdir | results/structs |
figsdir | results/figures |
finalfigsdir | results/figures/final |
notebookdir | results/notebooks |
Read information on the sera that are being mapped.
For each serum sample below, we get:
with open(config['serum_info']) as f:
sera = (pd.DataFrame(yaml.safe_load(f))
.transpose()
.add_prefix('serum_')
.rename_axis('serum')
.reset_index()
)
assert len(sera) == len(sera['serum'].unique()), 'sera not unique'
print(f"Read the following sera information from {config['serum_info']}:")
display(HTML(sera.to_html(index=False)))
Read the following sera information from data/serum_info.yaml:
serum | serum_description | serum_group | serum_name | serum_species | serum_vaccination |
---|---|---|---|---|---|
mock | no-serum control | mock | no-serum | NaN | NaN |
plasmid | plasmid used as control to estimate sequencing error rate | plasmid | plasmid | NaN | NaN |
5A01 | site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley | antibody_region_B | antibody-5A01 | NaN | NaN |
3C04 | site B-targeting monoclonal antibody 3C04 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C04 | NaN | NaN |
3C06 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN |
4C01 | site B-targeting monoclonal antibody 4C01 from Seth Zost and Scott Hensley | antibody_region_B | antibody-4C01 | NaN | NaN |
4F03 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN |
1C04 | lower head-targeting monoclonal antibody 1C04 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-1C04 | NaN | NaN |
f9267neg | Lakdawala lab ferret, before infection | ferret | Pitt-1-preinf | ferret | pre |
f9267d23 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-1-postinf | ferret | post |
f9435neg | Lakdawala lab ferret, before infection | ferret | Pitt-2-preinf | ferret | pre |
f9435d23 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-2-postinf | ferret | post |
f9437neg | Lakdawala lab ferret, before infection | ferret | Pitt-3-preinf | ferret | pre |
f9437d23 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-3-postinf | ferret | post |
WHOCCPerth | ferret infected by Melbourne WHO CC with their Perth/2009 strain | ferret | WHO-Perth2009 | ferret | post |
WHOCCVic | ferret infected by Melbourne WHO CC with their Victoria/2011 strain | ferret | WHO-Victoria2011 | ferret | post |
VIDD1 | collected at Hutch in 2/2010 from person born in 1989 | VIDD_sera | 2010-age-21 | human | NaN |
VIDD2 | collected at Hutch in 1/2009 from person born in 1956 | VIDD_sera | 2009-age-53 | human | NaN |
VIDD3 | second sample collected at Hutch in 3/2009 from person born in 1956 | VIDD_sera | 2009-age-53-plus-2-months | human | NaN |
VIDD4 | collected at Hutch in 11/2008 from person born in 1945 | VIDD_sera | 2009-age-64 | human | NaN |
VIDD5 | collected at Hutch in 6/2009 from person born in 1944 | VIDD_sera | 2009-age-65 | human | NaN |
557v1 | collected before 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-prevacc | human | pre |
557v2 | collected after 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-vacc | human | post |
574v1 | collected before 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-prevacc | human | pre |
574v2 | collected after 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-vacc | human | post |
589v1 | collected before 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-prevacc | human | pre |
589v2 | collected after 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-vacc | human | post |
571v1 | collected before 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-prevacc | human | pre |
571v2 | collected after 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-vacc | human | post |
VIDD5andlow4F03 | collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency | serum_mAb_spike | 2009-age-65-with-low-4F03 | human | NaN |
VIDD5andmid4F03 | collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency | serum_mAb_spike | 2009-age-65-with-mid-4F03 | human | NaN |
VIDD5andhi4F03 | collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency | serum_mAb_spike | 2009-age-65-with-hi-4F03 | human | NaN |
Read information about all of the samples that we have deep sequenced.
For each sample, we have information on the serum to which it corresponds, the virus library, the date of sequencing, the serum dilution, the percent infectivity, and (depending on the value of seq_data_source in the config file) either the Sequence Read Archive (SRA) accession or the location of the R1 files on the Hutch server:
samples = pd.read_csv(config['sample_list'])
# don't need any R1 column if we are using SRA accession
if config['seq_data_source'] == 'SRA_accession':
samples = samples.drop(columns='R1', errors='ignore')
assert len(samples) == len(samples['sample'].unique()), 'non-unique samples'
print(f"Read the following samples from {config['sample_list']}:")
display(HTML(samples.to_html(index=False)))
Read the following samples from data/sample_list.csv:
sample | serum | library | date | serum_dilution | percent_infectivity | SRA_accession |
---|---|---|---|---|---|---|
L2-3C06 | 3C06 | lib2 | 2018-07-20 | 0.55 | 0.0100 | SRR7974483 |
L4-3C06 | 3C06 | lib1 | 2018-09-12 | 0.1 | 0.0688 | SRR7974503 |
L2-3C04 | 3C04 | lib2 | 2018-07-20 | 0.65 | 0.0331 | SRR7974490 |
L4-3C04 | 3C04 | lib1 | 2018-09-12 | 0.1 | 0.0777 | SRR7974499 |
L2-4C01 | 4C01 | lib2 | 2018-07-20 | 0.60 | 0.0185 | SRR7974491 |
L4-4C01 | 4C01 | lib1 | 2018-09-12 | 0.2 | 0.1190 | SRR7974488 |
L2-1C04 | 1C04 | lib2 | 2018-07-20 | 16.0 | 10.3190 | SRR7974492 |
L3-1C04 | 1C04 | lib3 | 2018-09-12 | 18.0 | 7.9040 | SRR7974497 |
Lib2mock-mAb | mock | lib2 | 2018-07-20 | NaN | 100.0000 | SRR7974494 |
Lib3mock-mAb | mock | lib3 | 2018-09-12 | NaN | 100.0000 | SRR7974507 |
Lib4mock-mAb | mock | lib1 | 2018-09-12 | NaN | 100.0000 | SRR7974486 |
WTplasmid-mAb-A | plasmid | lib2 | 2018-07-20 | NaN | NaN | SRR7974495 |
WTplasmid-mAb-B | plasmid | lib3 | 2018-09-12 | NaN | NaN | SRR7974502 |
WTplasmid-mAb-C | plasmid | lib1 | 2018-09-12 | NaN | NaN | SRR7974484 |
L4-589v1 | 589v1 | lib1 | 2018-11-14 | 0.0185 | 13.8200 | SRR8875142 |
L4-571v1 | 571v1 | lib1 | 2018-11-14 | 0.0185 | 19.6200 | SRR8875143 |
L4-571v2 | 571v2 | lib1 | 2018-11-14 | 0.0185 | 5.4800 | SRR8875144 |
L4-574v1 | 574v1 | lib1 | 2018-11-14 | 0.0185 | 9.4800 | SRR8875145 |
L4-WHOCCPerth | WHOCCPerth | lib1 | 2018-11-14 | 0.0185 | 2.3200 | SRR8875138 |
L4-589v2 | 589v2 | lib1 | 2018-11-14 | 0.000875 | 4.5800 | SRR8875139 |
L4-557v1 | 557v1 | lib1 | 2018-11-14 | 0.0075 | 6.9000 | SRR8875140 |
L4-f9267neg | f9267neg | lib1 | 2018-11-14 | 0.00075 | 100.0000 | SRR8875141 |
L4-f9267d23 | f9267d23 | lib1 | 2018-11-14 | 0.00075 | 4.3600 | SRR8875136 |
L4-f9435neg | f9435neg | lib1 | 2018-11-14 | 0.001 | 100.0000 | SRR8875137 |
L4-f9437neg | f9437neg | lib1 | 2018-11-14 | 0.00225 | 100.0000 | SRR8875168 |
L4-f9437d23 | f9437d23 | lib1 | 2018-11-14 | 0.00225 | 8.7700 | SRR8875169 |
Lib4mock-A | mock | lib1 | 2018-11-14 | NaN | 100.0000 | SRR8875170 |
WTplasmid-A | plasmid | lib1 | 2018-11-14 | NaN | NaN | SRR8875171 |
L5-f9267neg | f9267neg | lib2 | 2019-01-16 | 0.00075 | 100.0000 | SRR8875172 |
L5-f9435neg | f9435neg | lib2 | 2019-01-16 | 0.0025 | 100.0000 | SRR8875173 |
L5-f9437neg | f9437neg | lib2 | 2019-01-16 | 0.00625 | 100.0000 | SRR8875174 |
L5-VIDD1 | VIDD1 | lib2 | 2019-01-16 | 0.0045 | 5.9500 | SRR8875175 |
L5-WHOCCPerth | WHOCCPerth | lib2 | 2019-01-16 | 0.0075 | 1.8800 | SRR8875166 |
L5-f9267d23 | f9267d23 | lib2 | 2019-01-16 | 0.000175 | 5.3500 | SRR8875167 |
L5-WHOCCVic | WHOCCVic | lib2 | 2019-01-16 | 0.0025 | 3.9000 | SRR8875191 |
L5-VIDD5 | VIDD5 | lib2 | 2019-01-16 | 0.000875 | 5.2400 | SRR8875190 |
L5-VIDD4 | VIDD4 | lib2 | 2019-01-16 | 0.001675 | 0.8700 | SRR8875193 |
L5-VIDD2 | VIDD2 | lib2 | 2019-01-16 | 0.00375 | 2.0300 | SRR8875192 |
L5-VIDD3 | VIDD3 | lib2 | 2019-01-16 | 0.0045 | 1.7300 | SRR8875187 |
L5-f9435d23 | f9435d23 | lib2 | 2019-01-16 | 0.000625 | 1.8800 | SRR8875186 |
L5-f9437d23 | f9437d23 | lib2 | 2019-01-16 | 0.001375 | 1.7000 | SRR8875189 |
Lib5mock-A | mock | lib2 | 2019-01-16 | NaN | 100.0000 | SRR8875188 |
WTplasmid-B | plasmid | lib2 | 2019-01-16 | NaN | NaN | SRR8875195 |
L4-f9435d23 | f9435d23 | lib1 | 2019-01-16 | 0.0025 | 3.4500 | SRR8875194 |
L4-VIDD4 | VIDD4 | lib1 | 2019-01-16 | 0.004 | 10.7400 | SRR8875112 |
L4-VIDD3 | VIDD3 | lib1 | 2019-01-16 | 0.0075 | 5.9800 | SRR8875113 |
L4-WHOCCVic | WHOCCVic | lib1 | 2019-01-16 | 0.00625 | 1.2600 | SRR8875110 |
L4-VIDD1 | VIDD1 | lib1 | 2019-01-16 | 0.0175 | 4.0800 | SRR8875111 |
L4-VIDD2 | VIDD2 | lib1 | 2019-01-16 | 0.00875 | 3.7400 | SRR8875108 |
Lib4mock-B | mock | lib1 | 2019-01-16 | NaN | 100.0000 | SRR8875109 |
WTplasmid-C | plasmid | lib1 | 2019-01-16 | NaN | NaN | SRR8875106 |
L5-589v1 | 589v1 | lib2 | 2019-03-06 | 0.0175 | 20.3200 | SRR8875107 |
L5-557v2 | 557v2 | lib2 | 2019-03-06 | 0.0005 | 0.8100 | SRR8875114 |
L5-571v2 | 571v2 | lib2 | 2019-03-06 | 0.01 | 1.0900 | SRR8875115 |
L5-574v1 | 574v1 | lib2 | 2019-03-06 | 0.0125 | 7.5700 | SRR8875119 |
L5-574v2 | 574v2 | lib2 | 2019-03-06 | 0.000875 | 1.1100 | SRR8875118 |
L5-4F03-c1 | 4F03 | lib2 | 2019-03-06 | 0.3 | 22.9900 | SRR8875117 |
L5-4F03-c3 | 4F03 | lib2 | 2019-03-06 | 1.5 | 3.7100 | SRR8875116 |
L6-589v2 | 589v2 | lib3 | 2019-03-06 | 0.000375 | 4.0600 | SRR8875123 |
L6-557v1 | 557v1 | lib3 | 2019-03-06 | 0.005 | 1.5700 | SRR8875122 |
L6-557v2 | 557v2 | lib3 | 2019-03-06 | 0.0005 | 1.6400 | SRR8875121 |
L6-VIDD2 | VIDD2 | lib3 | 2019-03-06 | 0.00375 | 6.2200 | SRR8875120 |
L6-VIDD3 | VIDD3 | lib3 | 2019-03-06 | 0.00625 | 1.1200 | SRR8875125 |
L6-VIDD1 | VIDD1 | lib3 | 2019-03-06 | 0.0075 | 3.6800 | SRR8875124 |
L6-f9267d23 | f9267d23 | lib3 | 2019-03-06 | 0.000375 | 1.6900 | SRR8875132 |
L6-f9435d23 | f9435d23 | lib3 | 2019-03-06 | 0.00125 | 3.0400 | SRR8875133 |
L6-WHOCCPerth | WHOCCPerth | lib3 | 2019-03-06 | 0.0075 | 4.1000 | SRR8875134 |
L6-WHOCCVic | WHOCCVic | lib3 | 2019-03-06 | 0.0025 | 3.6200 | SRR8875135 |
Lib5mock-B | mock | lib2 | 2019-03-06 | NaN | 100.0000 | SRR8875128 |
Lib6mock-A | mock | lib3 | 2019-03-06 | NaN | 100.0000 | SRR8875129 |
WTplasmid-D | plasmid | lib2 | 2019-03-06 | NaN | NaN | SRR8875130 |
WTplasmid-E | plasmid | lib3 | 2019-03-06 | NaN | NaN | SRR8875131 |
L5-589v2 | 589v2 | lib2 | 2019-03-06 | 0.000375 | 4.6800 | SRR8875126 |
L5-557v1 | 557v1 | lib2 | 2019-03-06 | 0.005 | 2.3100 | SRR8875127 |
L5-571v1 | 571v1 | lib2 | 2019-03-06 | 0.0185 | 11.8600 | SRR8875149 |
L5-5A01 | 5A01 | lib2 | 2019-03-06 | 0.025 | 1.3600 | SRR8875148 |
L5-4F03-c2 | 4F03 | lib2 | 2019-03-06 | 0.7 | 4.7100 | SRR8875151 |
L5-VIDD5-low4F03 | VIDD5andlow4F03 | lib2 | 2019-03-06 | 0.000875+0.3 | 5.0000 | SRR8875150 |
L5-VIDD5-mid4F03 | VIDD5andmid4F03 | lib2 | 2019-03-06 | 0.000875+0.7 | 0.4600 | SRR8875153 |
L5-VIDD5-hi4F03 | VIDD5andhi4F03 | lib2 | 2019-03-06 | 0.000875+1.5 | 0.0400 | SRR8875152 |
L4-574v2 | 574v2 | lib1 | 2019-03-06 | 0.002 | 1.5400 | SRR8875155 |
L4-557v2 | 557v2 | lib1 | 2019-03-06 | 0.00125 | 1.5300 | SRR8875154 |
L4-5A01 | 5A01 | lib1 | 2019-03-06 | 0.07 | 4.3200 | SRR8875147 |
L4-4F03-c2 | 4F03 | lib1 | 2019-03-06 | 1.0 | 3.4700 | SRR8875146 |
L4-4F03-c3 | 4F03 | lib1 | 2019-03-06 | 2.0 | 0.7800 | SRR8875160 |
Lib4mock-C | mock | lib1 | 2019-03-06 | NaN | 100.0000 | SRR8875161 |
WTplasmid-F | plasmid | lib1 | 2019-03-06 | NaN | NaN | SRR8875158 |
L6-589v1 | 589v1 | lib3 | 2019-03-26 | 0.0175 | 18.4000 | SRR8875159 |
L6-571v1 | 571v1 | lib3 | 2019-03-26 | 0.0185 | 27.7400 | SRR8875164 |
L6-571v2 | 571v2 | lib3 | 2019-03-26 | 0.005 | 7.7900 | SRR8875165 |
L6-574v1 | 574v1 | lib3 | 2019-03-26 | 0.015 | 3.3000 | SRR8875162 |
L6-574v2 | 574v2 | lib3 | 2019-03-26 | 0.00075 | 1.9200 | SRR8875163 |
L6-f9267neg | f9267neg | lib3 | 2019-03-26 | 0.00075 | 100.0000 | SRR8875156 |
L6-f9435neg | f9435neg | lib3 | 2019-03-26 | 0.0025 | 100.0000 | SRR8875157 |
L6-f9437neg | f9437neg | lib3 | 2019-03-26 | 0.00625 | 100.0000 | SRR8875183 |
L6-5A01 | 5A01 | lib3 | 2019-03-26 | 0.025 | 1.0200 | SRR8875182 |
Lib6mock-B | mock | lib3 | 2019-03-26 | NaN | 100.0000 | SRR8875181 |
L4-4F03-c1 | 4F03 | lib1 | 2019-03-26 | 0.3 | 28.2400 | SRR8875180 |
L6-VIDD5 | VIDD5 | lib3 | 2019-03-26 | 0.002 | 5.5800 | SRR8875179 |
L6-VIDD4 | VIDD4 | lib3 | 2019-03-26 | 0.00375 | 0.2600 | SRR8875178 |
L6-f9437d23 | f9437d23 | lib3 | 2019-03-26 | 0.00175 | 5.7800 | SRR8875177 |
L6-4F03-c1 | 4F03 | lib3 | 2019-03-26 | 0.3 | 23.9100 | SRR8875176 |
L6-4F03-c3 | 4F03 | lib3 | 2019-03-26 | 1.4 | 0.8400 | SRR8875185 |
L4-VIDD5 | VIDD5 | lib1 | 2019-03-26 | 0.003375 | 1.7700 | SRR8875184 |
L4-VIDD5-low4F03 | VIDD5andlow4F03 | lib1 | 2019-03-26 | 0.003375+0.3 | 0.0300 | SRR8875196 |
L4-VIDD5-mid4F03 | VIDD5andmid4F03 | lib1 | 2019-03-26 | 0.003375+1 | 0.0037 | SRR8875197 |
L4-VIDD5-hi4F03 | VIDD5andhi4F03 | lib1 | 2019-03-26 | 0.003375+2 | 0.0016 | SRR8875198 |
L6-4F03-c2 | 4F03 | lib3 | 2019-03-26 | 0.75 | 7.1600 | SRR8875199 |
L6-VIDD5-low4F03 | VIDD5andlow4F03 | lib3 | 2019-03-26 | 0.002+0.3 | 0.2040 | SRR8875200 |
L6-VIDD5-mid4F03 | VIDD5andmid4F03 | lib3 | 2019-03-26 | 0.002+0.75 | 0.0077 | SRR8875201 |
L6-VIDD5-hi4F03 | VIDD5andhi4F03 | lib3 | 2019-03-26 | 0.002+1.4 | 0.0032 | SRR8875202 |
Lib4mock-D | mock | lib1 | 2019-03-26 | NaN | 100.0000 | SRR8875203 |
WTplasmid-G | plasmid | lib1 | 2019-03-26 | NaN | NaN | SRR8875204 |
WTplasmid-H | plasmid | lib3 | 2019-03-26 | NaN | NaN | SRR8875205 |
Check that the serum for all samples are in our set of sera:
unknown_sera = set(samples['serum']) - set(sera['serum'])
if unknown_sera:
raise ValueError(f"samples include unknown sera: {unknown_sera}")
else:
print('We have information for all sera used for the samples.')
We have information for all sera used for the samples.
The config file specifies whether we get the data from existing R1 files on the Hutch server, or download the data from the Sequence Read Archive (SRA) using dms_tools2.sra.fastqFromSRA:
if config['seq_data_source'] == 'SRA_accession':
# are we using Aspera for rapid downloads?
if config['ascp'] and config['asperakey']:
aspera = (config['ascp'], config['asperakey'])
else:
aspera = None
# do the downloads
print(f"Downloading FASTQ files to {config['fastq_dir']} (takes a while)...")
os.makedirs(config['fastq_dir'], exist_ok=True)
samples = samples.rename(columns={'sample': 'name',
'SRA_accession': 'run'})
dms_tools2.sra.fastqFromSRA(
samples=samples,
fastq_dump=config['fastq_dump'],
fastqdir=config['fastq_dir'],
aspera=aspera,
ncpus=config['ncpus'],
)
samples = samples.rename(columns={'name': 'sample',
'run': 'SRA_accession'})
print('Completed downloading files.')
elif config['seq_data_source'] != 'R1':
raise ValueError('invalid value of `seq_data_source`')
Downloading FASTQ files to results/FASTQ_files (takes a while)... Completed downloading files.
The samples were sequenced using barcoded subamplicon sequencing to obtain high accuracy. So we need to process these data to determine the counts of each codon mutation in each sample.
First, create the directory used for the results of this part of the analysis:
os.makedirs(config['countsdir'], exist_ok=True)
print(f"Results from counting mutations go to {config['countsdir']}")
Results from counting mutations go to results/codoncounts
dms2_batch_bcsubamp
¶We process the sequencing data by using dms2_batch_bcsubamp to generate "codon counts" files that give the counts of each codon mutation for each sample.
First, write the batch file used as input by this program:
bcsubamp_batchfile = os.path.join(config['countsdir'], 'batch.csv')
(samples
.rename(columns={'sample': 'name'})
[['name', 'R1']]
.to_csv(bcsubamp_batchfile)
)
print(f"Creating batch file {bcsubamp_batchfile}")
Creating batch file results/codoncounts/batch.csv
Now run the program:
cmds = ['dms2_batch_bcsubamp',
'--batchfile', bcsubamp_batchfile,
'--refseq', config['refseq'],
'--alignspecs'] + config['alignspecs'].split() + [
'--R1trim', str(config['R1trim']),
'--R2trim', str(config['R2trim']),
'--outdir', config['countsdir'],
'--fastqdir', config['fastq_dir'],
'--summaryprefix', 'summary',
'--ncpus', str(config['ncpus']),
'--use_existing', config['use_existing'],
]
print(f"Running dms2_batch_bcsubamp with this command:\n{' '.join(cmds)}")
subprocess.check_output(cmds)
print('Completed running dms2_batch_bcsubamp.')
Running dms2_batch_bcsubamp with this command: dms2_batch_bcsubamp --batchfile results/codoncounts/batch.csv --refseq data/Perth09_HA_reference.fa --alignspecs 1,285,38,40 286,567,33,34 568,852,34,30 853,1137,34,31 1138,1422,36,29 1423,1701,39,44 --R1trim 200 --R2trim 165 --outdir results/codoncounts --fastqdir results/FASTQ_files --summaryprefix summary --ncpus 16 --use_existing yes Completed running dms2_batch_bcsubamp.
Confirm that all the expected counts files exist:
assert all(os.path.isfile(f) for f in
config['countsdir'] + '/' + samples['sample'] + '_codoncounts.csv'
), 'missing counts files'
Running dms2_batch_bcsubamp created some summary plots. The prefix on those plots should be as follows:
countsplotprefix = os.path.join(config['countsdir'], 'summary_')
Total sequencing reads per sample:
showPDF(countsplotprefix + 'readstats.pdf')
Distribution of sequencing reads per barcode on subamplicons:
showPDF(countsplotprefix + 'readsperbc.pdf')
Number of barcoded subamplicons that align and have sufficient reads:
showPDF(countsplotprefix + 'bcstats.pdf')
Depth of valid barcoded subamplicons covering each site in the gene:
showPDF(countsplotprefix + 'depth.pdf')
The average mutation frequency for each sample, stratifying by codon mutation type:
showPDF(countsplotprefix + 'codonmuttypes.pdf')
Average mutation frequency per sample, stratifying by number of nucleotide changes per codon mutation:
showPDF(countsplotprefix + 'codonntchanges.pdf')
Per-codon mutation frequencies across all sites in gene for each sample:
showPDF(countsplotprefix + 'mutfreq.pdf')
Sometimes there is oxidative damage which manifests as an enrichment of G
->T
and C
->A
mutations among the single-nucleotide codon mutations.
Check for this by plotting frequencies of different single-nucleotide mutation types:
showPDF(countsplotprefix + 'singlentchanges.pdf')
The above alignments use sequential 1, 2, ... numbering of the codons. This is not the standard numbering scheme used for HA, so we use dms_tools2.utils.renumbSites to renumber to the standard HA numbering scheme:
dms_tools2.utils.renumberSites(
renumbfile=config['renumbering_scheme'],
infiles=list(config['countsdir'] + '/' + samples['sample'] +
'_codoncounts.csv'),
missing='drop',
outdir=config['renumbcountsdir'])
assert all(os.path.isfile(f) for f in
config['renumbcountsdir'] + '/' + samples['sample'] +
'_codoncounts.csv'
), 'missing renumbered counts files'
print(f"Renumbered codon counts are in {config['renumbcountsdir']}")
Renumbered codon counts are in results/renumbered_codoncounts
We will now determine the immune selection on each mutation by comparing its frequency in each serum-selected sample to an appropriate mock-selected control. Specifically, we will quantify the immune selection as the differential selection (diffsel), which is essentially the log enrichment of each mutation relative to wildtype in the immune-selected versus mock sample. See Doud et al (2017) for the paper introducing this metric, and see here for a more detailed description.
In order to quantify the immune selection, we need to compare each selected sample to the appropriate controls. Specifically, for each selection, we define three samples:
We call each combination of three such samples a "selection". For each selection, we also record:
Below we construct the data frame with this information on the selections:
selections = (
# get all immune selected (sel) samples
samples
.query('(serum != "mock") & (serum != "plasmid")')
.rename(columns={'sample': 'sel'})
# add mock sample for that date and library
.merge(samples
.query('serum == "mock"')
[['sample', 'library', 'date']]
.rename(columns={'sample': 'mock'})
)
# add plasmid sample as error control (err) for that date and library
.merge(samples
.query('serum == "plasmid"')
[['sample', 'library', 'date']]
.rename(columns={'sample': 'err'})
)
# add information about sera
.merge(sera, validate='many_to_one')
# add informative names for serum and samples
.assign(
serum_name_formatted=lambda x:
x['serum_species'].map(lambda s: '' if pd.isnull(s) or s == 'human'
else s + '-') + x['serum_name'],
name_formatted=lambda x:
x['library'] + ', ' + x['percent_infectivity'].apply(
dms_tools2.utils.sigFigStr, nsig=2) + '% infectivity',
name=lambda x:
x['library'] + '-' + x['percent_infectivity'].apply(
dms_tools2.utils.sigFigStr, nsig=2)
)
# drop unneeded columns
.drop(['R1', 'R2', 'SRA_accession'], axis='columns', errors='ignore')
# re-order columns a bit so key ones are displayed first
.set_index(['serum_name_formatted', 'name', 'sel', 'mock', 'err'])
.reset_index()
)
# make sure no duplicated serum / names
assert len(selections) == len(selections.groupby(['serum_name_formatted',
'name']))
print(f"Tabulated information for {len(selections)} selections:")
display(HTML(selections.to_html(index=False)))
Tabulated information for 92 selections:
serum_name_formatted | name | sel | mock | err | serum | library | date | serum_dilution | percent_infectivity | serum_description | serum_group | serum_name | serum_species | serum_vaccination | name_formatted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
antibody-3C06 | lib2-0.010 | L2-3C06 | Lib2mock-mAb | WTplasmid-mAb-A | 3C06 | lib2 | 2018-07-20 | 0.55 | 0.0100 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN | lib2, 0.010% infectivity |
antibody-3C06 | lib1-0.069 | L4-3C06 | Lib4mock-mAb | WTplasmid-mAb-C | 3C06 | lib1 | 2018-09-12 | 0.1 | 0.0688 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN | lib1, 0.069% infectivity |
antibody-3C04 | lib2-0.033 | L2-3C04 | Lib2mock-mAb | WTplasmid-mAb-A | 3C04 | lib2 | 2018-07-20 | 0.65 | 0.0331 | site B-targeting monoclonal antibody 3C04 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C04 | NaN | NaN | lib2, 0.033% infectivity |
antibody-3C04 | lib1-0.078 | L4-3C04 | Lib4mock-mAb | WTplasmid-mAb-C | 3C04 | lib1 | 2018-09-12 | 0.1 | 0.0777 | site B-targeting monoclonal antibody 3C04 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C04 | NaN | NaN | lib1, 0.078% infectivity |
antibody-4C01 | lib2-0.018 | L2-4C01 | Lib2mock-mAb | WTplasmid-mAb-A | 4C01 | lib2 | 2018-07-20 | 0.60 | 0.0185 | site B-targeting monoclonal antibody 4C01 from Seth Zost and Scott Hensley | antibody_region_B | antibody-4C01 | NaN | NaN | lib2, 0.018% infectivity |
antibody-4C01 | lib1-0.12 | L4-4C01 | Lib4mock-mAb | WTplasmid-mAb-C | 4C01 | lib1 | 2018-09-12 | 0.2 | 0.1190 | site B-targeting monoclonal antibody 4C01 from Seth Zost and Scott Hensley | antibody_region_B | antibody-4C01 | NaN | NaN | lib1, 0.12% infectivity |
antibody-1C04 | lib2-10 | L2-1C04 | Lib2mock-mAb | WTplasmid-mAb-A | 1C04 | lib2 | 2018-07-20 | 16.0 | 10.3190 | lower head-targeting monoclonal antibody 1C04 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-1C04 | NaN | NaN | lib2, 10% infectivity |
antibody-1C04 | lib3-7.9 | L3-1C04 | Lib3mock-mAb | WTplasmid-mAb-B | 1C04 | lib3 | 2018-09-12 | 18.0 | 7.9040 | lower head-targeting monoclonal antibody 1C04 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-1C04 | NaN | NaN | lib3, 7.9% infectivity |
2015-age-48-prevacc | lib1-14 | L4-589v1 | Lib4mock-A | WTplasmid-A | 589v1 | lib1 | 2018-11-14 | 0.0185 | 13.8200 | collected before 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-prevacc | human | pre | lib1, 14% infectivity |
2015-age-48-prevacc | lib2-20 | L5-589v1 | Lib5mock-B | WTplasmid-D | 589v1 | lib2 | 2019-03-06 | 0.0175 | 20.3200 | collected before 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-prevacc | human | pre | lib2, 20% infectivity |
2015-age-48-prevacc | lib3-18 | L6-589v1 | Lib6mock-B | WTplasmid-H | 589v1 | lib3 | 2019-03-26 | 0.0175 | 18.4000 | collected before 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-prevacc | human | pre | lib3, 18% infectivity |
2015-age-49-prevacc | lib1-20 | L4-571v1 | Lib4mock-A | WTplasmid-A | 571v1 | lib1 | 2018-11-14 | 0.0185 | 19.6200 | collected before 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-prevacc | human | pre | lib1, 20% infectivity |
2015-age-49-prevacc | lib2-12 | L5-571v1 | Lib5mock-B | WTplasmid-D | 571v1 | lib2 | 2019-03-06 | 0.0185 | 11.8600 | collected before 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-prevacc | human | pre | lib2, 12% infectivity |
2015-age-49-prevacc | lib3-28 | L6-571v1 | Lib6mock-B | WTplasmid-H | 571v1 | lib3 | 2019-03-26 | 0.0185 | 27.7400 | collected before 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-prevacc | human | pre | lib3, 28% infectivity |
2015-age-49-vacc | lib1-5.5 | L4-571v2 | Lib4mock-A | WTplasmid-A | 571v2 | lib1 | 2018-11-14 | 0.0185 | 5.4800 | collected after 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-vacc | human | post | lib1, 5.5% infectivity |
2015-age-49-vacc | lib2-1.1 | L5-571v2 | Lib5mock-B | WTplasmid-D | 571v2 | lib2 | 2019-03-06 | 0.01 | 1.0900 | collected after 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-vacc | human | post | lib2, 1.1% infectivity |
2015-age-49-vacc | lib3-7.8 | L6-571v2 | Lib6mock-B | WTplasmid-H | 571v2 | lib3 | 2019-03-26 | 0.005 | 7.7900 | collected after 2015/2016 vaccine from person born in 1966 | Hensley_sera | 2015-age-49-vacc | human | post | lib3, 7.8% infectivity |
2015-age-29-prevacc | lib1-9.5 | L4-574v1 | Lib4mock-A | WTplasmid-A | 574v1 | lib1 | 2018-11-14 | 0.0185 | 9.4800 | collected before 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-prevacc | human | pre | lib1, 9.5% infectivity |
2015-age-29-prevacc | lib2-7.6 | L5-574v1 | Lib5mock-B | WTplasmid-D | 574v1 | lib2 | 2019-03-06 | 0.0125 | 7.5700 | collected before 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-prevacc | human | pre | lib2, 7.6% infectivity |
2015-age-29-prevacc | lib3-3.3 | L6-574v1 | Lib6mock-B | WTplasmid-H | 574v1 | lib3 | 2019-03-26 | 0.015 | 3.3000 | collected before 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-prevacc | human | pre | lib3, 3.3% infectivity |
ferret-WHO-Perth2009 | lib1-2.3 | L4-WHOCCPerth | Lib4mock-A | WTplasmid-A | WHOCCPerth | lib1 | 2018-11-14 | 0.0185 | 2.3200 | ferret infected by Melbourne WHO CC with their Perth/2009 strain | ferret | WHO-Perth2009 | ferret | post | lib1, 2.3% infectivity |
ferret-WHO-Perth2009 | lib2-1.9 | L5-WHOCCPerth | Lib5mock-A | WTplasmid-B | WHOCCPerth | lib2 | 2019-01-16 | 0.0075 | 1.8800 | ferret infected by Melbourne WHO CC with their Perth/2009 strain | ferret | WHO-Perth2009 | ferret | post | lib2, 1.9% infectivity |
ferret-WHO-Perth2009 | lib3-4.1 | L6-WHOCCPerth | Lib6mock-A | WTplasmid-E | WHOCCPerth | lib3 | 2019-03-06 | 0.0075 | 4.1000 | ferret infected by Melbourne WHO CC with their Perth/2009 strain | ferret | WHO-Perth2009 | ferret | post | lib3, 4.1% infectivity |
2015-age-48-vacc | lib1-4.6 | L4-589v2 | Lib4mock-A | WTplasmid-A | 589v2 | lib1 | 2018-11-14 | 0.000875 | 4.5800 | collected after 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-vacc | human | post | lib1, 4.6% infectivity |
2015-age-48-vacc | lib2-4.7 | L5-589v2 | Lib5mock-B | WTplasmid-D | 589v2 | lib2 | 2019-03-06 | 0.000375 | 4.6800 | collected after 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-vacc | human | post | lib2, 4.7% infectivity |
2015-age-48-vacc | lib3-4.1 | L6-589v2 | Lib6mock-A | WTplasmid-E | 589v2 | lib3 | 2019-03-06 | 0.000375 | 4.0600 | collected after 2015/2016 vaccine from person born in 1967 | Hensley_sera | 2015-age-48-vacc | human | post | lib3, 4.1% infectivity |
2015-age-25-prevacc | lib1-6.9 | L4-557v1 | Lib4mock-A | WTplasmid-A | 557v1 | lib1 | 2018-11-14 | 0.0075 | 6.9000 | collected before 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-prevacc | human | pre | lib1, 6.9% infectivity |
2015-age-25-prevacc | lib2-2.3 | L5-557v1 | Lib5mock-B | WTplasmid-D | 557v1 | lib2 | 2019-03-06 | 0.005 | 2.3100 | collected before 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-prevacc | human | pre | lib2, 2.3% infectivity |
2015-age-25-prevacc | lib3-1.6 | L6-557v1 | Lib6mock-A | WTplasmid-E | 557v1 | lib3 | 2019-03-06 | 0.005 | 1.5700 | collected before 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-prevacc | human | pre | lib3, 1.6% infectivity |
ferret-Pitt-1-preinf | lib1-100 | L4-f9267neg | Lib4mock-A | WTplasmid-A | f9267neg | lib1 | 2018-11-14 | 0.00075 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-1-preinf | ferret | pre | lib1, 100% infectivity |
ferret-Pitt-1-preinf | lib2-100 | L5-f9267neg | Lib5mock-A | WTplasmid-B | f9267neg | lib2 | 2019-01-16 | 0.00075 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-1-preinf | ferret | pre | lib2, 100% infectivity |
ferret-Pitt-1-preinf | lib3-100 | L6-f9267neg | Lib6mock-B | WTplasmid-H | f9267neg | lib3 | 2019-03-26 | 0.00075 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-1-preinf | ferret | pre | lib3, 100% infectivity |
ferret-Pitt-1-postinf | lib1-4.4 | L4-f9267d23 | Lib4mock-A | WTplasmid-A | f9267d23 | lib1 | 2018-11-14 | 0.00075 | 4.3600 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-1-postinf | ferret | post | lib1, 4.4% infectivity |
ferret-Pitt-1-postinf | lib2-5.3 | L5-f9267d23 | Lib5mock-A | WTplasmid-B | f9267d23 | lib2 | 2019-01-16 | 0.000175 | 5.3500 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-1-postinf | ferret | post | lib2, 5.3% infectivity |
ferret-Pitt-1-postinf | lib3-1.7 | L6-f9267d23 | Lib6mock-A | WTplasmid-E | f9267d23 | lib3 | 2019-03-06 | 0.000375 | 1.6900 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-1-postinf | ferret | post | lib3, 1.7% infectivity |
ferret-Pitt-2-preinf | lib1-100 | L4-f9435neg | Lib4mock-A | WTplasmid-A | f9435neg | lib1 | 2018-11-14 | 0.001 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-2-preinf | ferret | pre | lib1, 100% infectivity |
ferret-Pitt-2-preinf | lib2-100 | L5-f9435neg | Lib5mock-A | WTplasmid-B | f9435neg | lib2 | 2019-01-16 | 0.0025 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-2-preinf | ferret | pre | lib2, 100% infectivity |
ferret-Pitt-2-preinf | lib3-100 | L6-f9435neg | Lib6mock-B | WTplasmid-H | f9435neg | lib3 | 2019-03-26 | 0.0025 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-2-preinf | ferret | pre | lib3, 100% infectivity |
ferret-Pitt-3-preinf | lib1-100 | L4-f9437neg | Lib4mock-A | WTplasmid-A | f9437neg | lib1 | 2018-11-14 | 0.00225 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-3-preinf | ferret | pre | lib1, 100% infectivity |
ferret-Pitt-3-preinf | lib2-100 | L5-f9437neg | Lib5mock-A | WTplasmid-B | f9437neg | lib2 | 2019-01-16 | 0.00625 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-3-preinf | ferret | pre | lib2, 100% infectivity |
ferret-Pitt-3-preinf | lib3-100 | L6-f9437neg | Lib6mock-B | WTplasmid-H | f9437neg | lib3 | 2019-03-26 | 0.00625 | 100.0000 | Lakdawala lab ferret, before infection | ferret | Pitt-3-preinf | ferret | pre | lib3, 100% infectivity |
ferret-Pitt-3-postinf | lib1-8.8 | L4-f9437d23 | Lib4mock-A | WTplasmid-A | f9437d23 | lib1 | 2018-11-14 | 0.00225 | 8.7700 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-3-postinf | ferret | post | lib1, 8.8% infectivity |
ferret-Pitt-3-postinf | lib2-1.7 | L5-f9437d23 | Lib5mock-A | WTplasmid-B | f9437d23 | lib2 | 2019-01-16 | 0.001375 | 1.7000 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-3-postinf | ferret | post | lib2, 1.7% infectivity |
ferret-Pitt-3-postinf | lib3-5.8 | L6-f9437d23 | Lib6mock-B | WTplasmid-H | f9437d23 | lib3 | 2019-03-26 | 0.00175 | 5.7800 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-3-postinf | ferret | post | lib3, 5.8% infectivity |
2010-age-21 | lib2-6.0 | L5-VIDD1 | Lib5mock-A | WTplasmid-B | VIDD1 | lib2 | 2019-01-16 | 0.0045 | 5.9500 | collected at Hutch in 2/2010 from person born in 1989 | VIDD_sera | 2010-age-21 | human | NaN | lib2, 6.0% infectivity |
2010-age-21 | lib1-4.1 | L4-VIDD1 | Lib4mock-B | WTplasmid-C | VIDD1 | lib1 | 2019-01-16 | 0.0175 | 4.0800 | collected at Hutch in 2/2010 from person born in 1989 | VIDD_sera | 2010-age-21 | human | NaN | lib1, 4.1% infectivity |
2010-age-21 | lib3-3.7 | L6-VIDD1 | Lib6mock-A | WTplasmid-E | VIDD1 | lib3 | 2019-03-06 | 0.0075 | 3.6800 | collected at Hutch in 2/2010 from person born in 1989 | VIDD_sera | 2010-age-21 | human | NaN | lib3, 3.7% infectivity |
ferret-WHO-Victoria2011 | lib2-3.9 | L5-WHOCCVic | Lib5mock-A | WTplasmid-B | WHOCCVic | lib2 | 2019-01-16 | 0.0025 | 3.9000 | ferret infected by Melbourne WHO CC with their Victoria/2011 strain | ferret | WHO-Victoria2011 | ferret | post | lib2, 3.9% infectivity |
ferret-WHO-Victoria2011 | lib1-1.3 | L4-WHOCCVic | Lib4mock-B | WTplasmid-C | WHOCCVic | lib1 | 2019-01-16 | 0.00625 | 1.2600 | ferret infected by Melbourne WHO CC with their Victoria/2011 strain | ferret | WHO-Victoria2011 | ferret | post | lib1, 1.3% infectivity |
ferret-WHO-Victoria2011 | lib3-3.6 | L6-WHOCCVic | Lib6mock-A | WTplasmid-E | WHOCCVic | lib3 | 2019-03-06 | 0.0025 | 3.6200 | ferret infected by Melbourne WHO CC with their Victoria/2011 strain | ferret | WHO-Victoria2011 | ferret | post | lib3, 3.6% infectivity |
2009-age-65 | lib2-5.2 | L5-VIDD5 | Lib5mock-A | WTplasmid-B | VIDD5 | lib2 | 2019-01-16 | 0.000875 | 5.2400 | collected at Hutch in 6/2009 from person born in 1944 | VIDD_sera | 2009-age-65 | human | NaN | lib2, 5.2% infectivity |
2009-age-65 | lib3-5.6 | L6-VIDD5 | Lib6mock-B | WTplasmid-H | VIDD5 | lib3 | 2019-03-26 | 0.002 | 5.5800 | collected at Hutch in 6/2009 from person born in 1944 | VIDD_sera | 2009-age-65 | human | NaN | lib3, 5.6% infectivity |
2009-age-65 | lib1-1.8 | L4-VIDD5 | Lib4mock-D | WTplasmid-G | VIDD5 | lib1 | 2019-03-26 | 0.003375 | 1.7700 | collected at Hutch in 6/2009 from person born in 1944 | VIDD_sera | 2009-age-65 | human | NaN | lib1, 1.8% infectivity |
2009-age-64 | lib2-0.87 | L5-VIDD4 | Lib5mock-A | WTplasmid-B | VIDD4 | lib2 | 2019-01-16 | 0.001675 | 0.8700 | collected at Hutch in 11/2008 from person born in 1945 | VIDD_sera | 2009-age-64 | human | NaN | lib2, 0.87% infectivity |
2009-age-64 | lib1-11 | L4-VIDD4 | Lib4mock-B | WTplasmid-C | VIDD4 | lib1 | 2019-01-16 | 0.004 | 10.7400 | collected at Hutch in 11/2008 from person born in 1945 | VIDD_sera | 2009-age-64 | human | NaN | lib1, 11% infectivity |
2009-age-64 | lib3-0.26 | L6-VIDD4 | Lib6mock-B | WTplasmid-H | VIDD4 | lib3 | 2019-03-26 | 0.00375 | 0.2600 | collected at Hutch in 11/2008 from person born in 1945 | VIDD_sera | 2009-age-64 | human | NaN | lib3, 0.26% infectivity |
2009-age-53 | lib2-2.0 | L5-VIDD2 | Lib5mock-A | WTplasmid-B | VIDD2 | lib2 | 2019-01-16 | 0.00375 | 2.0300 | collected at Hutch in 1/2009 from person born in 1956 | VIDD_sera | 2009-age-53 | human | NaN | lib2, 2.0% infectivity |
2009-age-53 | lib1-3.7 | L4-VIDD2 | Lib4mock-B | WTplasmid-C | VIDD2 | lib1 | 2019-01-16 | 0.00875 | 3.7400 | collected at Hutch in 1/2009 from person born in 1956 | VIDD_sera | 2009-age-53 | human | NaN | lib1, 3.7% infectivity |
2009-age-53 | lib3-6.2 | L6-VIDD2 | Lib6mock-A | WTplasmid-E | VIDD2 | lib3 | 2019-03-06 | 0.00375 | 6.2200 | collected at Hutch in 1/2009 from person born in 1956 | VIDD_sera | 2009-age-53 | human | NaN | lib3, 6.2% infectivity |
2009-age-53-plus-2-months | lib2-1.7 | L5-VIDD3 | Lib5mock-A | WTplasmid-B | VIDD3 | lib2 | 2019-01-16 | 0.0045 | 1.7300 | second sample collected at Hutch in 3/2009 from person born in 1956 | VIDD_sera | 2009-age-53-plus-2-months | human | NaN | lib2, 1.7% infectivity |
2009-age-53-plus-2-months | lib1-6.0 | L4-VIDD3 | Lib4mock-B | WTplasmid-C | VIDD3 | lib1 | 2019-01-16 | 0.0075 | 5.9800 | second sample collected at Hutch in 3/2009 from person born in 1956 | VIDD_sera | 2009-age-53-plus-2-months | human | NaN | lib1, 6.0% infectivity |
2009-age-53-plus-2-months | lib3-1.1 | L6-VIDD3 | Lib6mock-A | WTplasmid-E | VIDD3 | lib3 | 2019-03-06 | 0.00625 | 1.1200 | second sample collected at Hutch in 3/2009 from person born in 1956 | VIDD_sera | 2009-age-53-plus-2-months | human | NaN | lib3, 1.1% infectivity |
ferret-Pitt-2-postinf | lib2-1.9 | L5-f9435d23 | Lib5mock-A | WTplasmid-B | f9435d23 | lib2 | 2019-01-16 | 0.000625 | 1.8800 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-2-postinf | ferret | post | lib2, 1.9% infectivity |
ferret-Pitt-2-postinf | lib1-3.5 | L4-f9435d23 | Lib4mock-B | WTplasmid-C | f9435d23 | lib1 | 2019-01-16 | 0.0025 | 3.4500 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-2-postinf | ferret | post | lib1, 3.5% infectivity |
ferret-Pitt-2-postinf | lib3-3.0 | L6-f9435d23 | Lib6mock-A | WTplasmid-E | f9435d23 | lib3 | 2019-03-06 | 0.00125 | 3.0400 | Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA | ferret | Pitt-2-postinf | ferret | post | lib3, 3.0% infectivity |
2015-age-25-vacc | lib2-0.81 | L5-557v2 | Lib5mock-B | WTplasmid-D | 557v2 | lib2 | 2019-03-06 | 0.0005 | 0.8100 | collected after 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-vacc | human | post | lib2, 0.81% infectivity |
2015-age-25-vacc | lib3-1.6 | L6-557v2 | Lib6mock-A | WTplasmid-E | 557v2 | lib3 | 2019-03-06 | 0.0005 | 1.6400 | collected after 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-vacc | human | post | lib3, 1.6% infectivity |
2015-age-25-vacc | lib1-1.5 | L4-557v2 | Lib4mock-C | WTplasmid-F | 557v2 | lib1 | 2019-03-06 | 0.00125 | 1.5300 | collected after 2015/2016 vaccine from person born in 1990 | Hensley_sera | 2015-age-25-vacc | human | post | lib1, 1.5% infectivity |
2015-age-29-vacc | lib2-1.1 | L5-574v2 | Lib5mock-B | WTplasmid-D | 574v2 | lib2 | 2019-03-06 | 0.000875 | 1.1100 | collected after 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-vacc | human | post | lib2, 1.1% infectivity |
2015-age-29-vacc | lib1-1.5 | L4-574v2 | Lib4mock-C | WTplasmid-F | 574v2 | lib1 | 2019-03-06 | 0.002 | 1.5400 | collected after 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-vacc | human | post | lib1, 1.5% infectivity |
2015-age-29-vacc | lib3-1.9 | L6-574v2 | Lib6mock-B | WTplasmid-H | 574v2 | lib3 | 2019-03-26 | 0.00075 | 1.9200 | collected after 2015/2016 vaccine from person born in 1986 | Hensley_sera | 2015-age-29-vacc | human | post | lib3, 1.9% infectivity |
antibody-4F03 | lib2-23 | L5-4F03-c1 | Lib5mock-B | WTplasmid-D | 4F03 | lib2 | 2019-03-06 | 0.3 | 22.9900 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib2, 23% infectivity |
antibody-4F03 | lib2-3.7 | L5-4F03-c3 | Lib5mock-B | WTplasmid-D | 4F03 | lib2 | 2019-03-06 | 1.5 | 3.7100 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib2, 3.7% infectivity |
antibody-4F03 | lib2-4.7 | L5-4F03-c2 | Lib5mock-B | WTplasmid-D | 4F03 | lib2 | 2019-03-06 | 0.7 | 4.7100 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib2, 4.7% infectivity |
antibody-4F03 | lib1-3.5 | L4-4F03-c2 | Lib4mock-C | WTplasmid-F | 4F03 | lib1 | 2019-03-06 | 1.0 | 3.4700 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib1, 3.5% infectivity |
antibody-4F03 | lib1-0.78 | L4-4F03-c3 | Lib4mock-C | WTplasmid-F | 4F03 | lib1 | 2019-03-06 | 2.0 | 0.7800 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib1, 0.78% infectivity |
antibody-4F03 | lib3-24 | L6-4F03-c1 | Lib6mock-B | WTplasmid-H | 4F03 | lib3 | 2019-03-26 | 0.3 | 23.9100 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib3, 24% infectivity |
antibody-4F03 | lib3-0.84 | L6-4F03-c3 | Lib6mock-B | WTplasmid-H | 4F03 | lib3 | 2019-03-26 | 1.4 | 0.8400 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib3, 0.84% infectivity |
antibody-4F03 | lib3-7.2 | L6-4F03-c2 | Lib6mock-B | WTplasmid-H | 4F03 | lib3 | 2019-03-26 | 0.75 | 7.1600 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib3, 7.2% infectivity |
antibody-4F03 | lib1-28 | L4-4F03-c1 | Lib4mock-D | WTplasmid-G | 4F03 | lib1 | 2019-03-26 | 0.3 | 28.2400 | lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley | antibody_lower_head | antibody-4F03 | NaN | NaN | lib1, 28% infectivity |
antibody-5A01 | lib2-1.4 | L5-5A01 | Lib5mock-B | WTplasmid-D | 5A01 | lib2 | 2019-03-06 | 0.025 | 1.3600 | site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley | antibody_region_B | antibody-5A01 | NaN | NaN | lib2, 1.4% infectivity |
antibody-5A01 | lib1-4.3 | L4-5A01 | Lib4mock-C | WTplasmid-F | 5A01 | lib1 | 2019-03-06 | 0.07 | 4.3200 | site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley | antibody_region_B | antibody-5A01 | NaN | NaN | lib1, 4.3% infectivity |
antibody-5A01 | lib3-1.0 | L6-5A01 | Lib6mock-B | WTplasmid-H | 5A01 | lib3 | 2019-03-26 | 0.025 | 1.0200 | site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley | antibody_region_B | antibody-5A01 | NaN | NaN | lib3, 1.0% infectivity |
2009-age-65-with-low-4F03 | lib2-5.0 | L5-VIDD5-low4F03 | Lib5mock-B | WTplasmid-D | VIDD5andlow4F03 | lib2 | 2019-03-06 | 0.000875+0.3 | 5.0000 | collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency | serum_mAb_spike | 2009-age-65-with-low-4F03 | human | NaN | lib2, 5.0% infectivity |
2009-age-65-with-low-4F03 | lib3-0.20 | L6-VIDD5-low4F03 | Lib6mock-B | WTplasmid-H | VIDD5andlow4F03 | lib3 | 2019-03-26 | 0.002+0.3 | 0.2040 | collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency | serum_mAb_spike | 2009-age-65-with-low-4F03 | human | NaN | lib3, 0.20% infectivity |
2009-age-65-with-low-4F03 | lib1-0.030 | L4-VIDD5-low4F03 | Lib4mock-D | WTplasmid-G | VIDD5andlow4F03 | lib1 | 2019-03-26 | 0.003375+0.3 | 0.0300 | collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency | serum_mAb_spike | 2009-age-65-with-low-4F03 | human | NaN | lib1, 0.030% infectivity |
2009-age-65-with-mid-4F03 | lib2-0.46 | L5-VIDD5-mid4F03 | Lib5mock-B | WTplasmid-D | VIDD5andmid4F03 | lib2 | 2019-03-06 | 0.000875+0.7 | 0.4600 | collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency | serum_mAb_spike | 2009-age-65-with-mid-4F03 | human | NaN | lib2, 0.46% infectivity |
2009-age-65-with-mid-4F03 | lib3-0.0077 | L6-VIDD5-mid4F03 | Lib6mock-B | WTplasmid-H | VIDD5andmid4F03 | lib3 | 2019-03-26 | 0.002+0.75 | 0.0077 | collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency | serum_mAb_spike | 2009-age-65-with-mid-4F03 | human | NaN | lib3, 0.0077% infectivity |
2009-age-65-with-mid-4F03 | lib1-0.0037 | L4-VIDD5-mid4F03 | Lib4mock-D | WTplasmid-G | VIDD5andmid4F03 | lib1 | 2019-03-26 | 0.003375+1 | 0.0037 | collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency | serum_mAb_spike | 2009-age-65-with-mid-4F03 | human | NaN | lib1, 0.0037% infectivity |
2009-age-65-with-hi-4F03 | lib2-0.040 | L5-VIDD5-hi4F03 | Lib5mock-B | WTplasmid-D | VIDD5andhi4F03 | lib2 | 2019-03-06 | 0.000875+1.5 | 0.0400 | collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency | serum_mAb_spike | 2009-age-65-with-hi-4F03 | human | NaN | lib2, 0.040% infectivity |
2009-age-65-with-hi-4F03 | lib3-0.0032 | L6-VIDD5-hi4F03 | Lib6mock-B | WTplasmid-H | VIDD5andhi4F03 | lib3 | 2019-03-26 | 0.002+1.4 | 0.0032 | collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency | serum_mAb_spike | 2009-age-65-with-hi-4F03 | human | NaN | lib3, 0.0032% infectivity |
2009-age-65-with-hi-4F03 | lib1-0.0016 | L4-VIDD5-hi4F03 | Lib4mock-D | WTplasmid-G | VIDD5andhi4F03 | lib1 | 2019-03-26 | 0.003375+2 | 0.0016 | collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency | serum_mAb_spike | 2009-age-65-with-hi-4F03 | human | NaN | lib1, 0.0016% infectivity |
Now we run dms2_batch_diffsel to compute the immune selection.
We then add to our selections
data frame the name of the files holding the computed site (site) and mutation (mut) level selection for each sample.
The next cell does all of this:
outdir = config['diffseldir']
os.makedirs(outdir, exist_ok=True)
# write batch file used by program
batchfile = os.path.join(outdir, 'batch.csv')
(selections
.rename(columns={'serum_name_formatted': 'group'})
.to_csv(batchfile, index=False)
)
cmds = ['dms2_batch_diffsel',
'--summaryprefix', 'summary',
'--batchfile', batchfile,
'--outdir', outdir,
'--indir', config['renumbcountsdir'],
'--use_existing', config['use_existing'],
'--ncpus', str(config['ncpus'])
]
print(f"Computing diffsel using dms2_batch_diffsel with command:\n{' '.join(cmds)}")
subprocess.check_output(cmds)
selfilecols = []
for selfile in ['mutdiffsel', 'sitediffsel']:
selfilecol = selfile + '_file'
selfilecols.append(selfilecol)
selections[selfilecol] = (outdir + '/' + selections['serum_name_formatted']
+ '-' + selections['name'] + '_' +
selfile + '.csv')
missing_files = [f for f in selections[selfilecol]
if not os.path.isfile(f)]
assert not missing_files, f"missing files: {missing_files}"
print(f"Created {len(selections[selfilecol])} {selfile} files, adding to "
f"`selections` data frame in column {selfilecol}")
Computing diffsel using dms2_batch_diffsel with command: dms2_batch_diffsel --summaryprefix summary --batchfile results/diffsel/batch.csv --outdir results/diffsel --indir results/renumbered_codoncounts --use_existing yes --ncpus 16 Created 92 mutdiffsel files, adding to `selections` data frame in column mutdiffsel_file Created 92 sitediffsel files, adding to `selections` data frame in column sitediffsel_file
For further processing, we want to create a dataframe that holds all of the selection information at the site and mutation levels for all samples. We create such a dataframe, sel_df, by reading the files in selections into the data frame using dms_tools2.diffsel.df_read_filecols:
sel_df = (dms_tools2.diffsel.df_read_filecols(selections, selfilecols)
.drop(columns=selfilecols)
)
Now sel_df is a very large data frame, but it has all the information we want to plot. Here are the first few rows:
print(f"sel_df has {len(sel_df)} rows. Here are the first few:")
display(HTML(sel_df.head(n=5).to_html(index=False)))
sel_df has 1041440 rows. Here are the first few:
serum_name_formatted | name | sel | mock | err | serum | library | date | serum_dilution | percent_infectivity | serum_description | serum_group | serum_name | serum_species | serum_vaccination | name_formatted | site | wildtype | mutation | mutdiffsel | abs_diffsel | positive_diffsel | negative_diffsel | max_diffsel | min_diffsel | isite |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
antibody-3C06 | lib2-0.010 | L2-3C06 | Lib2mock-mAb | WTplasmid-mAb-A | 3C06 | lib2 | 2018-07-20 | 0.55 | 0.01 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN | lib2, 0.010% infectivity | 167 | T | A | 12.307415 | 26.366193 | 24.254047 | -2.112146 | 12.307415 | -2.112146 | 182 |
antibody-3C06 | lib2-0.010 | L2-3C06 | Lib2mock-mAb | WTplasmid-mAb-A | 3C06 | lib2 | 2018-07-20 | 0.55 | 0.01 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN | lib2, 0.010% infectivity | 167 | T | Q | 1.227785 | 26.366193 | 24.254047 | -2.112146 | 12.307415 | -2.112146 | 182 |
antibody-3C06 | lib2-0.010 | L2-3C06 | Lib2mock-mAb | WTplasmid-mAb-A | 3C06 | lib2 | 2018-07-20 | 0.55 | 0.01 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN | lib2, 0.010% infectivity | 167 | T | G | 1.189062 | 26.366193 | 24.254047 | -2.112146 | 12.307415 | -2.112146 | 182 |
antibody-3C06 | lib2-0.010 | L2-3C06 | Lib2mock-mAb | WTplasmid-mAb-A | 3C06 | lib2 | 2018-07-20 | 0.55 | 0.01 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN | lib2, 0.010% infectivity | 167 | T | V | 1.085952 | 26.366193 | 24.254047 | -2.112146 | 12.307415 | -2.112146 | 182 |
antibody-3C06 | lib2-0.010 | L2-3C06 | Lib2mock-mAb | WTplasmid-mAb-A | 3C06 | lib2 | 2018-07-20 | 0.55 | 0.01 | site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley | antibody_region_B | antibody-3C06 | NaN | NaN | lib2, 0.010% infectivity | 167 | T | W | 0.742358 | 26.366193 | 24.254047 | -2.112146 | 12.307415 | -2.112146 | 182 |
For some serum, we have multiple samples. These include the samples from multiple library replicates, but sometimes also several samples for the same library replicate at different serum concentrations that yield different percent viral infectivity remaining.
Which of these samples do we want to retain? We probably want just one sample per library (so we don't over-weight some libraries in our results), and we need to select the serum concentration (percent viral infectivity remaining) that is "best" in terms of giving reproducible results between replicates with maximal signal to noise.
We first plot the site-level selection for all samples.
To do this, we look over all sites and use the facet_plot
command of dmslogo to plot the site-level selection for all samples for each serum.
We also compute the correlation between samples for each serum:
for serum_name, serum_sel_df in sel_df.groupby('serum_name_formatted'):
print(f"\n\n******************* {serum_name} *******************")
fig, ax = dmslogo.facet_plot(
serum_sel_df,
x_col='isite',
gridcol_col='name_formatted',
show_col=None,
wspace=0.6,
draw_line_kwargs=dict(
xtick_col='site',
height_col='positive_diffsel',
ylabel='positive_diffsel',
)
)
display(fig)
plt.close(fig)
corr_df = (serum_sel_df
.rename(columns={'name_formatted': 'sample'})
.pivot_table(values='positive_diffsel',
columns='sample',
index=['site'])
.corr()
.round(3)
)
display(HTML(corr_df.to_html()))
******************* 2009-age-53 *******************
sample | lib1, 3.7% infectivity | lib2, 2.0% infectivity | lib3, 6.2% infectivity |
---|---|---|---|
sample | |||
lib1, 3.7% infectivity | 1.000 | 0.439 | 0.396 |
lib2, 2.0% infectivity | 0.439 | 1.000 | 0.380 |
lib3, 6.2% infectivity | 0.396 | 0.380 | 1.000 |
******************* 2009-age-53-plus-2-months *******************
sample | lib1, 6.0% infectivity | lib2, 1.7% infectivity | lib3, 1.1% infectivity |
---|---|---|---|
sample | |||
lib1, 6.0% infectivity | 1.000 | 0.446 | 0.451 |
lib2, 1.7% infectivity | 0.446 | 1.000 | 0.434 |
lib3, 1.1% infectivity | 0.451 | 0.434 | 1.000 |
******************* 2009-age-64 *******************
sample | lib1, 11% infectivity | lib2, 0.87% infectivity | lib3, 0.26% infectivity |
---|---|---|---|
sample | |||
lib1, 11% infectivity | 1.000 | 0.498 | 0.473 |
lib2, 0.87% infectivity | 0.498 | 1.000 | 0.438 |
lib3, 0.26% infectivity | 0.473 | 0.438 | 1.000 |
******************* 2009-age-65 *******************
sample | lib1, 1.8% infectivity | lib2, 5.2% infectivity | lib3, 5.6% infectivity |
---|---|---|---|
sample | |||
lib1, 1.8% infectivity | 1.000 | 0.491 | 0.356 |
lib2, 5.2% infectivity | 0.491 | 1.000 | 0.370 |
lib3, 5.6% infectivity | 0.356 | 0.370 | 1.000 |
******************* 2009-age-65-with-hi-4F03 *******************
sample | lib1, 0.0016% infectivity | lib2, 0.040% infectivity | lib3, 0.0032% infectivity |
---|---|---|---|
sample | |||
lib1, 0.0016% infectivity | 1.000 | 0.327 | 0.346 |
lib2, 0.040% infectivity | 0.327 | 1.000 | 0.632 |
lib3, 0.0032% infectivity | 0.346 | 0.632 | 1.000 |
******************* 2009-age-65-with-low-4F03 *******************
sample | lib1, 0.030% infectivity | lib2, 5.0% infectivity | lib3, 0.20% infectivity |
---|---|---|---|
sample | |||
lib1, 0.030% infectivity | 1.000 | 0.409 | 0.381 |
lib2, 5.0% infectivity | 0.409 | 1.000 | 0.500 |
lib3, 0.20% infectivity | 0.381 | 0.500 | 1.000 |
******************* 2009-age-65-with-mid-4F03 *******************
sample | lib1, 0.0037% infectivity | lib2, 0.46% infectivity | lib3, 0.0077% infectivity |
---|---|---|---|
sample | |||
lib1, 0.0037% infectivity | 1.000 | 0.407 | 0.476 |
lib2, 0.46% infectivity | 0.407 | 1.000 | 0.512 |
lib3, 0.0077% infectivity | 0.476 | 0.512 | 1.000 |
******************* 2010-age-21 *******************
sample | lib1, 4.1% infectivity | lib2, 6.0% infectivity | lib3, 3.7% infectivity |
---|---|---|---|
sample | |||
lib1, 4.1% infectivity | 1.000 | 0.518 | 0.539 |
lib2, 6.0% infectivity | 0.518 | 1.000 | 0.614 |
lib3, 3.7% infectivity | 0.539 | 0.614 | 1.000 |
******************* 2015-age-25-prevacc *******************
sample | lib1, 6.9% infectivity | lib2, 2.3% infectivity | lib3, 1.6% infectivity |
---|---|---|---|
sample | |||
lib1, 6.9% infectivity | 1.000 | 0.753 | 0.763 |
lib2, 2.3% infectivity | 0.753 | 1.000 | 0.757 |
lib3, 1.6% infectivity | 0.763 | 0.757 | 1.000 |
******************* 2015-age-25-vacc *******************
sample | lib1, 1.5% infectivity | lib2, 0.81% infectivity | lib3, 1.6% infectivity |
---|---|---|---|
sample | |||
lib1, 1.5% infectivity | 1.000 | 0.708 | 0.693 |
lib2, 0.81% infectivity | 0.708 | 1.000 | 0.747 |
lib3, 1.6% infectivity | 0.693 | 0.747 | 1.000 |
******************* 2015-age-29-prevacc *******************
sample | lib1, 9.5% infectivity | lib2, 7.6% infectivity | lib3, 3.3% infectivity |
---|---|---|---|
sample | |||
lib1, 9.5% infectivity | 1.000 | 0.356 | 0.385 |
lib2, 7.6% infectivity | 0.356 | 1.000 | 0.468 |
lib3, 3.3% infectivity | 0.385 | 0.468 | 1.000 |
******************* 2015-age-29-vacc *******************
sample | lib1, 1.5% infectivity | lib2, 1.1% infectivity | lib3, 1.9% infectivity |
---|---|---|---|
sample | |||
lib1, 1.5% infectivity | 1.000 | 0.457 | 0.309 |
lib2, 1.1% infectivity | 0.457 | 1.000 | 0.340 |
lib3, 1.9% infectivity | 0.309 | 0.340 | 1.000 |
******************* 2015-age-48-prevacc *******************
sample | lib1, 14% infectivity | lib2, 20% infectivity | lib3, 18% infectivity |
---|---|---|---|
sample | |||
lib1, 14% infectivity | 1.000 | 0.320 | 0.345 |
lib2, 20% infectivity | 0.320 | 1.000 | 0.413 |
lib3, 18% infectivity | 0.345 | 0.413 | 1.000 |
******************* 2015-age-48-vacc *******************
sample | lib1, 4.6% infectivity | lib2, 4.7% infectivity | lib3, 4.1% infectivity |
---|---|---|---|
sample | |||
lib1, 4.6% infectivity | 1.000 | 0.578 | 0.668 |
lib2, 4.7% infectivity | 0.578 | 1.000 | 0.571 |
lib3, 4.1% infectivity | 0.668 | 0.571 | 1.000 |
******************* 2015-age-49-prevacc *******************
sample | lib1, 20% infectivity | lib2, 12% infectivity | lib3, 28% infectivity |
---|---|---|---|
sample | |||
lib1, 20% infectivity | 1.000 | 0.394 | 0.371 |
lib2, 12% infectivity | 0.394 | 1.000 | 0.376 |
lib3, 28% infectivity | 0.371 | 0.376 | 1.000 |
******************* 2015-age-49-vacc *******************
sample | lib1, 5.5% infectivity | lib2, 1.1% infectivity | lib3, 7.8% infectivity |
---|---|---|---|
sample | |||
lib1, 5.5% infectivity | 1.000 | 0.375 | 0.388 |
lib2, 1.1% infectivity | 0.375 | 1.000 | 0.449 |
lib3, 7.8% infectivity | 0.388 | 0.449 | 1.000 |
******************* antibody-1C04 *******************
sample | lib2, 10% infectivity | lib3, 7.9% infectivity |
---|---|---|
sample | ||
lib2, 10% infectivity | 1.000 | 0.732 |
lib3, 7.9% infectivity | 0.732 | 1.000 |
******************* antibody-3C04 *******************
sample | lib1, 0.078% infectivity | lib2, 0.033% infectivity |
---|---|---|
sample | ||
lib1, 0.078% infectivity | 1.000 | 0.647 |
lib2, 0.033% infectivity | 0.647 | 1.000 |
******************* antibody-3C06 *******************
sample | lib1, 0.069% infectivity | lib2, 0.010% infectivity |
---|---|---|
sample | ||
lib1, 0.069% infectivity | 1.000 | 0.576 |
lib2, 0.010% infectivity | 0.576 | 1.000 |
******************* antibody-4C01 *******************
sample | lib1, 0.12% infectivity | lib2, 0.018% infectivity |
---|---|---|
sample | ||
lib1, 0.12% infectivity | 1.000 | 0.822 |
lib2, 0.018% infectivity | 0.822 | 1.000 |
******************* antibody-4F03 *******************
sample | lib1, 0.78% infectivity | lib1, 28% infectivity | lib1, 3.5% infectivity | lib2, 23% infectivity | lib2, 3.7% infectivity | lib2, 4.7% infectivity | lib3, 0.84% infectivity | lib3, 24% infectivity | lib3, 7.2% infectivity |
---|---|---|---|---|---|---|---|---|---|
sample | |||||||||
lib1, 0.78% infectivity | 1.000 | 0.293 | 0.850 | 0.412 | 0.792 | 0.718 | 0.768 | 0.507 | 0.728 |
lib1, 28% infectivity | 0.293 | 1.000 | 0.395 | 0.357 | 0.368 | 0.370 | 0.326 | 0.379 | 0.388 |
lib1, 3.5% infectivity | 0.850 | 0.395 | 1.000 | 0.451 | 0.745 | 0.720 | 0.731 | 0.578 | 0.743 |
lib2, 23% infectivity | 0.412 | 0.357 | 0.451 | 1.000 | 0.576 | 0.630 | 0.459 | 0.533 | 0.506 |
lib2, 3.7% infectivity | 0.792 | 0.368 | 0.745 | 0.576 | 1.000 | 0.819 | 0.761 | 0.576 | 0.744 |
lib2, 4.7% infectivity | 0.718 | 0.370 | 0.720 | 0.630 | 0.819 | 1.000 | 0.786 | 0.688 | 0.806 |
lib3, 0.84% infectivity | 0.768 | 0.326 | 0.731 | 0.459 | 0.761 | 0.786 | 1.000 | 0.687 | 0.846 |
lib3, 24% infectivity | 0.507 | 0.379 | 0.578 | 0.533 | 0.576 | 0.688 | 0.687 | 1.000 | 0.799 |
lib3, 7.2% infectivity | 0.728 | 0.388 | 0.743 | 0.506 | 0.744 | 0.806 | 0.846 | 0.799 | 1.000 |
******************* antibody-5A01 *******************
sample | lib1, 4.3% infectivity | lib2, 1.4% infectivity | lib3, 1.0% infectivity |
---|---|---|---|
sample | |||
lib1, 4.3% infectivity | 1.000 | 0.762 | 0.705 |
lib2, 1.4% infectivity | 0.762 | 1.000 | 0.717 |
lib3, 1.0% infectivity | 0.705 | 0.717 | 1.000 |
******************* ferret-Pitt-1-postinf *******************
sample | lib1, 4.4% infectivity | lib2, 5.3% infectivity | lib3, 1.7% infectivity |
---|---|---|---|
sample | |||
lib1, 4.4% infectivity | 1.000 | 0.517 | 0.58 |
lib2, 5.3% infectivity | 0.517 | 1.000 | 0.58 |
lib3, 1.7% infectivity | 0.580 | 0.580 | 1.00 |
******************* ferret-Pitt-1-preinf *******************
sample | lib1, 100% infectivity | lib2, 100% infectivity | lib3, 100% infectivity |
---|---|---|---|
sample | |||
lib1, 100% infectivity | 1.000 | 0.248 | 0.327 |
lib2, 100% infectivity | 0.248 | 1.000 | 0.231 |
lib3, 100% infectivity | 0.327 | 0.231 | 1.000 |
******************* ferret-Pitt-2-postinf *******************
sample | lib1, 3.5% infectivity | lib2, 1.9% infectivity | lib3, 3.0% infectivity |
---|---|---|---|
sample | |||
lib1, 3.5% infectivity | 1.000 | 0.487 | 0.551 |
lib2, 1.9% infectivity | 0.487 | 1.000 | 0.509 |
lib3, 3.0% infectivity | 0.551 | 0.509 | 1.000 |
******************* ferret-Pitt-2-preinf *******************
sample | lib1, 100% infectivity | lib2, 100% infectivity | lib3, 100% infectivity |
---|---|---|---|
sample | |||
lib1, 100% infectivity | 1.00 | 0.260 | 0.270 |
lib2, 100% infectivity | 0.26 | 1.000 | 0.328 |
lib3, 100% infectivity | 0.27 | 0.328 | 1.000 |
******************* ferret-Pitt-3-postinf *******************
sample | lib1, 8.8% infectivity | lib2, 1.7% infectivity | lib3, 5.8% infectivity |
---|---|---|---|
sample | |||
lib1, 8.8% infectivity | 1.000 | 0.589 | 0.563 |
lib2, 1.7% infectivity | 0.589 | 1.000 | 0.629 |
lib3, 5.8% infectivity | 0.563 | 0.629 | 1.000 |
******************* ferret-Pitt-3-preinf *******************
sample | lib1, 100% infectivity | lib2, 100% infectivity | lib3, 100% infectivity |
---|---|---|---|
sample | |||
lib1, 100% infectivity | 1.000 | 0.286 | 0.346 |
lib2, 100% infectivity | 0.286 | 1.000 | 0.336 |
lib3, 100% infectivity | 0.346 | 0.336 | 1.000 |
******************* ferret-WHO-Perth2009 *******************
sample | lib1, 2.3% infectivity | lib2, 1.9% infectivity | lib3, 4.1% infectivity |
---|---|---|---|
sample | |||
lib1, 2.3% infectivity | 1.000 | 0.502 | 0.534 |
lib2, 1.9% infectivity | 0.502 | 1.000 | 0.449 |
lib3, 4.1% infectivity | 0.534 | 0.449 | 1.000 |
******************* ferret-WHO-Victoria2011 *******************
sample | lib1, 1.3% infectivity | lib2, 3.9% infectivity | lib3, 3.6% infectivity |
---|---|---|---|
sample | |||
lib1, 1.3% infectivity | 1.000 | 0.600 | 0.618 |
lib2, 3.9% infectivity | 0.600 | 1.000 | 0.631 |
lib3, 3.6% infectivity | 0.618 | 0.631 | 1.000 |
From the plots and correlations immediately above, several things are apparent:
Based on the above, we will filter our samples down to those that have closest to the target amount of viral infectivity remaining. Because we qualitatively estimated above that the best results are obtained when between 1% to 5% of infectivity is remaining, we will target 2% infectivity. Then for any library / serum with multiple samples, we will retain the one that is closest (in log space) to that target infectivity. We do this in log space because in linear space, 0% infectivity would be closer to 2% infectivity as would 6% infectivity--even though the latter is clearly preferable if our target is 2%.
target_infectivity = 2
print(f"Choosing samples closest to {target_infectivity:.2f}% infectivity.")
Choosing samples closest to 2.00% infectivity.
Mark the samples to retain by adding a retained column to both the selections and sel_df data frames, and flagging it as True only for the sample for that serum and library that has the smallest log-space distance to the target infectivity:
selections = (
selections
.assign(retained=lambda x: x.assign(dist=lambda y: (y.percent_infectivity /
target_infectivity)
.apply(math.log).apply(abs))
.groupby(['serum_name_formatted', 'library'])
['dist']
.transform(lambda y: y <= y.min())
)
)
print(f"Retaining {len(selections.query('retained'))} of {len(selections)}")
sel_df = sel_df.merge(selections[['serum_name_formatted', 'name', 'retained']],
validate='many_to_one')
Retaining 86 of 92
Plot the samples to retain and their percent infectivity. In the plot below, the dashed horizontal line indicates the target percent infectivity, and the colors of the points indicate which sample we retained for each serum / library:
p = (ggplot(selections.assign(xlabel=lambda x: (x['serum_name_formatted'] +
', ' + x['library'])),
aes('xlabel', 'percent_infectivity', color='retained')) +
geom_point(size=3, alpha=0.7) +
theme(
axis_text_x=element_text(angle=90),
figure_size=(0.25 * len(selections.groupby(['serum_name_formatted',
'library'])), 2.5)
) +
scale_y_log10(name='percent infectivity') +
xlab('') +
scale_color_manual(values=PALETTE) +
geom_hline(yintercept=target_infectivity, linetype='dashed',
alpha=0.7, color=PALETTE[2])
)
_ = p.draw()
Here is a small table listing the samples that we retained for each serum and
os.makedirs(config['selectiontabledir'], exist_ok=True)
for valtype in ['percent_infectivity', 'serum_dilution']:
table = (selections
.assign(serum_name=lambda x:
pd.Categorical(x['serum_name'],
sera['serum_name'],
ordered=True))
.query('retained')
.sort_values('serum_name')
[['serum_group', 'serum_name_formatted', 'library', valtype]]
.pivot_table(index=['serum_group', 'serum_name_formatted'],
columns='library',
values=valtype,
aggfunc=lambda x: ' '.join(str(v) for v in x))
)
print(f"\n************* {valtype} for each replicate *************")
display(HTML(table.to_html()))
table_file = os.path.join(config['selectiontabledir'], f"{valtype}_table.csv")
print(f"Writing to {table_file}")
table.to_csv(table_file)
************* percent_infectivity for each replicate *************
library | lib1 | lib2 | lib3 | |
---|---|---|---|---|
serum_group | serum_name_formatted | |||
Hensley_sera | 2015-age-25-prevacc | 6.9000 | 2.3100 | 1.5700 |
2015-age-25-vacc | 1.5300 | 0.8100 | 1.6400 | |
2015-age-29-prevacc | 9.4800 | 7.5700 | 3.3000 | |
2015-age-29-vacc | 1.5400 | 1.1100 | 1.9200 | |
2015-age-48-prevacc | 13.8200 | 20.3200 | 18.4000 | |
2015-age-48-vacc | 4.5800 | 4.6800 | 4.0600 | |
2015-age-49-prevacc | 19.6200 | 11.8600 | 27.7400 | |
2015-age-49-vacc | 5.4800 | 1.0900 | 7.7900 | |
VIDD_sera | 2009-age-53 | 3.7400 | 2.0300 | 6.2200 |
2009-age-53-plus-2-months | 5.9800 | 1.7300 | 1.1200 | |
2009-age-64 | 10.7400 | 0.8700 | 0.2600 | |
2009-age-65 | 1.7700 | 5.2400 | 5.5800 | |
2010-age-21 | 4.0800 | 5.9500 | 3.6800 | |
antibody_lower_head | antibody-1C04 | NaN | 10.3190 | 7.9040 |
antibody-4F03 | 3.4700 | 3.7100 | 0.8400 | |
antibody_region_B | antibody-3C04 | 0.0777 | 0.0331 | NaN |
antibody-3C06 | 0.0688 | 0.0100 | NaN | |
antibody-4C01 | 0.1190 | 0.0185 | NaN | |
antibody-5A01 | 4.3200 | 1.3600 | 1.0200 | |
ferret | ferret-Pitt-1-postinf | 4.3600 | 5.3500 | 1.6900 |
ferret-Pitt-1-preinf | 100.0000 | 100.0000 | 100.0000 | |
ferret-Pitt-2-postinf | 3.4500 | 1.8800 | 3.0400 | |
ferret-Pitt-2-preinf | 100.0000 | 100.0000 | 100.0000 | |
ferret-Pitt-3-postinf | 8.7700 | 1.7000 | 5.7800 | |
ferret-Pitt-3-preinf | 100.0000 | 100.0000 | 100.0000 | |
ferret-WHO-Perth2009 | 2.3200 | 1.8800 | 4.1000 | |
ferret-WHO-Victoria2011 | 1.2600 | 3.9000 | 3.6200 | |
serum_mAb_spike | 2009-age-65-with-hi-4F03 | 0.0016 | 0.0400 | 0.0032 |
2009-age-65-with-low-4F03 | 0.0300 | 5.0000 | 0.2040 | |
2009-age-65-with-mid-4F03 | 0.0037 | 0.4600 | 0.0077 |
Writing to results/selection_tables/percent_infectivity_table.csv ************* serum_dilution for each replicate *************
library | lib1 | lib2 | lib3 | |
---|---|---|---|---|
serum_group | serum_name_formatted | |||
Hensley_sera | 2015-age-25-prevacc | 0.0075 | 0.005 | 0.005 |
2015-age-25-vacc | 0.00125 | 0.0005 | 0.0005 | |
2015-age-29-prevacc | 0.0185 | 0.0125 | 0.015 | |
2015-age-29-vacc | 0.002 | 0.000875 | 0.00075 | |
2015-age-48-prevacc | 0.0185 | 0.0175 | 0.0175 | |
2015-age-48-vacc | 0.000875 | 0.000375 | 0.000375 | |
2015-age-49-prevacc | 0.0185 | 0.0185 | 0.0185 | |
2015-age-49-vacc | 0.0185 | 0.01 | 0.005 | |
VIDD_sera | 2009-age-53 | 0.00875 | 0.00375 | 0.00375 |
2009-age-53-plus-2-months | 0.0075 | 0.0045 | 0.00625 | |
2009-age-64 | 0.004 | 0.001675 | 0.00375 | |
2009-age-65 | 0.003375 | 0.000875 | 0.002 | |
2010-age-21 | 0.0175 | 0.0045 | 0.0075 | |
antibody_lower_head | antibody-1C04 | NaN | 16.0 | 18.0 |
antibody-4F03 | 1.0 | 1.5 | 1.4 | |
antibody_region_B | antibody-3C04 | 0.1 | 0.65 | NaN |
antibody-3C06 | 0.1 | 0.55 | NaN | |
antibody-4C01 | 0.2 | 0.60 | NaN | |
antibody-5A01 | 0.07 | 0.025 | 0.025 | |
ferret | ferret-Pitt-1-postinf | 0.00075 | 0.000175 | 0.000375 |
ferret-Pitt-1-preinf | 0.00075 | 0.00075 | 0.00075 | |
ferret-Pitt-2-postinf | 0.0025 | 0.000625 | 0.00125 | |
ferret-Pitt-2-preinf | 0.001 | 0.0025 | 0.0025 | |
ferret-Pitt-3-postinf | 0.00225 | 0.001375 | 0.00175 | |
ferret-Pitt-3-preinf | 0.00225 | 0.00625 | 0.00625 | |
ferret-WHO-Perth2009 | 0.0185 | 0.0075 | 0.0075 | |
ferret-WHO-Victoria2011 | 0.00625 | 0.0025 | 0.0025 | |
serum_mAb_spike | 2009-age-65-with-hi-4F03 | 0.003375+2 | 0.000875+1.5 | 0.002+1.4 |
2009-age-65-with-low-4F03 | 0.003375+0.3 | 0.000875+0.3 | 0.002+0.3 | |
2009-age-65-with-mid-4F03 | 0.003375+1 | 0.000875+0.7 | 0.002+0.75 |
Writing to results/selection_tables/serum_dilution_table.csv
We now compute the average selection among the retained samples for each serum.
First, confirm that we have retained just one sample per serum / library:
assert all(len(group) == 1 for _, group in
(selections
.query('retained')
.groupby(['serum_name_formatted', 'library'])
)
)
We can compute the "average" selection using either the mean or the median; which one to use is specified in the config file:
avg_type = config['avg_type']
print(f"Computing across replicate averages as the {avg_type}")
Computing across replicate averages as the median
Do a sanity check and make sure none of our libraries are named the same as the average type:
assert all(avg_type not in selections[col].values for col in ['library', 'name_formatted'])
Now loop over all sera and compute the average selection for all retained samples for that sera. Note that the averages are computed on the mutation-level data, and then the site data are computed from those averaged mutation-level data. The averages (along with the samples used to compute these averages) are then added to a new data frame similar to selections that is called avg_selections:
avgdir = config['avgdiffseldir']
os.makedirs(avgdir, exist_ok=True)
avg_selections = []
for serum_name_formatted, group in (
selections
.query('retained')
[['serum_name_formatted', 'library', 'name_formatted'] +
list(sera.columns) + selfilecols]
.groupby('serum_name_formatted')
):
avg_selections.append(group)
# build row of data frame with average
avg_row = group.iloc[0].to_dict(into=collections.OrderedDict)
avg_row['library'] = avg_type
avg_row['name_formatted'] = avg_type
avg_row['mutdiffsel_file'] = (f"{avgdir}/{serum_name_formatted}-"
f"mutdiffsel-{avg_type}.csv")
(dms_tools2.diffsel.avgMutDiffSel(group['mutdiffsel_file'], avg_type)
.to_csv(avg_row['mutdiffsel_file'], index=False))
avg_row['sitediffsel_file'] = (f"{avgdir}/{serum_name_formatted}-"
f"sitediffsel-{avg_type}.csv")
(dms_tools2.diffsel.mutToSiteDiffSel(pd.read_csv(avg_row['mutdiffsel_file']))
.to_csv(avg_row['sitediffsel_file'], index=False))
avg_row = pd.Series(avg_row).to_frame().transpose()
assert all(avg_row.columns == group.columns)
avg_selections.append(avg_row)
# put avg_selections in data frame, sort to match config['sera']
avg_selections = (pd.concat(avg_selections)
.assign(serum_name=lambda x:
pd.Categorical(x['serum_name'],
sera['serum_name'],
ordered=True))
.sort_values(['serum_name', 'library'])
.assign(serum_name_formatted=lambda x:
pd.Categorical(x['serum_name_formatted'],
x['serum_name_formatted'].unique(),
ordered=True))
.reset_index(drop=True)
)
Now the avg_selections
data frame lists the files giving all the retained library replicates plus the average calculated from them these replicates.
Now we create the data frame avg_sel_df
which actually holds the site- and mutation-level averages for all sera as well as the samples (one replicate per library) that we used to compute these averages:
avg_sel_df = (dms_tools2.diffsel.df_read_filecols(avg_selections, selfilecols)
.drop(columns=selfilecols)
# preserve order of sera as in `avg_selections`
.assign(serum_name_formatted=lambda x:
pd.Categorical(x['serum_name_formatted'],
(avg_selections['serum_name_formatted']
.unique()),
ordered=True))
)
This avg_sel_df
data frame differs from sel_df
only in that it includes the averages as a library type, and only has the retained replicates for each library.
We want to identify the sites that are under "significant" immune selection. The reason is that we can then zoom in on these sites in logo plots.
In dms_tools2.plot.findSigSel function, we have implemented a heuristic way to do this. Essentially, this function fits a gamma distribution to the selection values for each site, and then identifies those that are "outliers" of high selection.
First, we define a cutoff for what constitutes significant:
fdr_cutoff = 0.05
Now we use dms_tools2.plot.findSigSel to get a dataframe (sigsites_df
) that lists the "significant" sites for each serum.
That function finds these sites by fitting a gamma distribution to the data and then finding sites that are far outside the range of the distribution (thereby computing a heuristic P-value).
It has two methods, robust_hist and mle; we use both and take any sites found by either method.
Because the pre-vaccination and pre-infection serum generally have weak signal and the vaccination at most boosts existing specificities in the human samples that we have, we will ignore the pre-vaccination samples when identifying significant sites.
The cell below also saves plots showing the fit gamma distribution (you can inspect these separately if you want to look in more detail):
os.makedirs(config['avgdiffsel_sigsites_dir'], exist_ok=True)
plotfile_template = os.path.join(config['avgdiffsel_sigsites_dir'],
'sigsites_{serum}_{method}.pdf')
print(f"Identifying sites of significant selection at a FDR of {fdr_cutoff}.\n"
f"Plots of distribution fitting saved as {plotfile_template}")
sigsites_df = []
sigsites_cols = ['serum_group', 'serum_name_formatted', 'isite', 'site',
'positive_diffsel']
for serum_name_formatted, group in (
avg_sel_df
.query('serum_vaccination != "pre"')
.query('library == @avg_type')
[sigsites_cols]
.drop_duplicates()
.groupby('serum_name_formatted', observed=True)
):
for method in ['robust_hist', 'mle']:
plotfile = plotfile_template.format(serum=serum_name_formatted,
method=method)
df, _, _ = dms_tools2.plot.findSigSel(
group,
'positive_diffsel',
plotfile,
fdr=fdr_cutoff,
title=serum_name_formatted,
method=method
)
sigsites_df.append(df)
sigsites_df = (pd.concat(sigsites_df, ignore_index=True)
.groupby(sigsites_cols)
.aggregate({'sig': 'sum'})
.reset_index()
.assign(sig=lambda x: x['sig'].astype(bool))
.query('sig')
)
print('Here are the first few rows of sigsites_df:')
display(HTML(sigsites_df.head(n=4).to_html(index=False)))
Identifying sites of significant selection at a FDR of 0.05. Plots of distribution fitting saved as results/avgdiffsel/sigsites/sigsites_{serum}_{method}.pdf Here are the first few rows of sigsites_df:
serum_group | serum_name_formatted | isite | site | positive_diffsel | sig |
---|---|---|---|---|---|
Hensley_sera | 2015-age-25-vacc | 160 | 145 | 14.630931 | True |
Hensley_sera | 2015-age-25-vacc | 174 | 159 | 26.089489 | True |
Hensley_sera | 2015-age-25-vacc | 175 | 160 | 6.151038 | True |
Hensley_sera | 2015-age-25-vacc | 176 | 161 | 5.219629 | True |
Now display lists of the significant sites for each serum (we did not determine significant sites for pre-vaccination samples as described above):
display(HTML(sigsites_df
.sort_values('isite')
.assign(nsites=1)
.groupby('serum_name_formatted')
.aggregate({'site': lambda x: ', '.join(list(x)),
'nsites': 'sum'})
.rename(columns={'site': 'significant sites',
'nsites': 'number of sites'})
.to_html()
))
significant sites | number of sites | |
---|---|---|
serum_name_formatted | ||
antibody-5A01 | 157, 158, 159, 160, 193, 222, 227, 244 | 8 |
antibody-3C04 | 159, 160, 192, 193 | 4 |
antibody-3C06 | 137, 145, 159, 160, 167, 193, 207, 244, 246 | 9 |
antibody-4C01 | 193 | 1 |
antibody-4F03 | 80, 81, 83, 121, 122, 220, 244, 259, (HA2)78 | 9 |
antibody-1C04 | 53, 54, 57, 82, 83, 188, 210, 220, 244, (HA2)61 | 10 |
ferret-Pitt-1-preinf | 0 | |
ferret-Pitt-1-postinf | 189, 193, 222 | 3 |
ferret-Pitt-2-preinf | 0 | |
ferret-Pitt-2-postinf | 142, 144, 189, 193, 222 | 5 |
ferret-Pitt-3-preinf | 0 | |
ferret-Pitt-3-postinf | 189, 193 | 2 |
ferret-WHO-Perth2009 | 50, 189, 193 | 3 |
ferret-WHO-Victoria2011 | 50, 159, 189, 193, 222, 275 | 6 |
2010-age-21 | 144, 159, 193, 222 | 4 |
2009-age-53 | 157, 160 | 2 |
2009-age-53-plus-2-months | 157, 160, 244 | 3 |
2009-age-64 | 159, 222, 244 | 3 |
2009-age-65 | 159, 160, 193 | 3 |
2015-age-25-prevacc | 0 | |
2015-age-25-vacc | 145, 159, 160, 161, 192, 193, 207, 220, 222, 224, 225, 244 | 12 |
2015-age-29-prevacc | 0 | |
2015-age-29-vacc | 144, 145, 159, 160, 222, 227 | 6 |
2015-age-48-prevacc | 0 | |
2015-age-48-vacc | 189 | 1 |
2015-age-49-prevacc | 0 | |
2015-age-49-vacc | 0 | |
2009-age-65-with-low-4F03 | 80, 121, 159, 160, 193, 244 | 6 |
2009-age-65-with-mid-4F03 | 121 | 1 |
2009-age-65-with-hi-4F03 | 0 |
In the analyses below, we may want to plot the sites that are significant within at least one serum sample for each serum group. Therefore, we determine the significant sites within each serum group:
sigsites_by_serumgroup = (
sigsites_df
[['serum_group', 'isite', 'site']]
.drop_duplicates()
.sort_values('isite')
.groupby('serum_group')
.aggregate(lambda x: list(x))
.assign(nsites=lambda x: x['site'].apply(len))
.sort_values('nsites')
)
display(HTML(sigsites_by_serumgroup.to_html()))
isite | site | nsites | |
---|---|---|---|
serum_group | |||
serum_mAb_spike | [95, 136, 174, 175, 208, 259] | [80, 121, 159, 160, 193, 244] | 6 |
VIDD_sera | [159, 172, 174, 175, 208, 237, 259] | [144, 157, 159, 160, 193, 222, 244] | 7 |
ferret | [65, 157, 159, 174, 204, 208, 237, 290] | [50, 142, 144, 159, 189, 193, 222, 275] | 8 |
antibody_region_B | [152, 160, 172, 173, 174, 175, 182, 207, 208, 222, 237, 242, 259, 261] | [137, 145, 157, 158, 159, 160, 167, 192, 193, 207, 222, 227, 244, 246] | 14 |
Hensley_sera | [159, 160, 174, 175, 176, 204, 207, 208, 222, 235, 237, 239, 240, 242, 259] | [144, 145, 159, 160, 161, 189, 192, 193, 207, 220, 222, 224, 225, 227, 244] | 15 |
antibody_lower_head | [68, 69, 72, 95, 96, 97, 98, 136, 137, 203, 225, 235, 259, 274, 405, 422] | [53, 54, 57, 80, 81, 82, 83, 121, 122, 188, 210, 220, 244, 259, (HA2)61, (HA2)78] | 16 |
Now we plot the average (across libraries) selection for each serum.
In the plots, we will zoom in on important sites using logo plots. The reason that we zoom in on just a subset of sites is to keep the logo plots relatively compact.
The sites we zoom in on will be those identified above as being under "significant" selection for all sera in that serum group.
We also may want to "pad" the sites that we zoom in on by zooming in on this many sites before and after each significant site. This is useful if you want to give some context to the zoomed sites. Below, set the amount of padding you want around each significant site; a value of 0 means no padding:
zoom_pad = 0
Now build a dict, zoom_sites that is keyed first by serum_group and then by isite and site, with values being the list of sites to zoom for that serum group:
zoom_sites = {}
for tup in sigsites_by_serumgroup.reset_index().itertuples(index=False):
isites = set(tup.isite)
for isite in list(isites):
for pad in range(-zoom_pad, zoom_pad + 1):
isites.add(isite + pad)
# only keep valid sites
isites = isites.intersection(set(avg_sel_df['isite']))
isites = sorted(isites) # get as sorted list
# get sites corresponding to each isite
sites = (avg_sel_df
[['isite', 'site']]
.drop_duplicates()
.sort_values('isite')
.query('isite in @isites')
['site']
.tolist()
)
zoom_sites[tup.serum_group] = {'isite': isites,
'site': sites,
'nsites': len(sites)}
Here are the sites we will zoom in on for each serum group:
display(HTML(pd.DataFrame.from_dict(zoom_sites, orient='index').to_html()))
isite | site | nsites | |
---|---|---|---|
Hensley_sera | [159, 160, 174, 175, 176, 204, 207, 208, 222, 235, 237, 239, 240, 242, 259] | [144, 145, 159, 160, 161, 189, 192, 193, 207, 220, 222, 224, 225, 227, 244] | 15 |
VIDD_sera | [159, 172, 174, 175, 208, 237, 259] | [144, 157, 159, 160, 193, 222, 244] | 7 |
antibody_lower_head | [68, 69, 72, 95, 96, 97, 98, 136, 137, 203, 225, 235, 259, 274, 405, 422] | [53, 54, 57, 80, 81, 82, 83, 121, 122, 188, 210, 220, 244, 259, (HA2)61, (HA2)78] | 16 |
antibody_region_B | [152, 160, 172, 173, 174, 175, 182, 207, 208, 222, 237, 242, 259, 261] | [137, 145, 157, 158, 159, 160, 167, 192, 193, 207, 222, 227, 244, 246] | 14 |
ferret | [65, 157, 159, 174, 204, 208, 237, 290] | [50, 142, 144, 159, 189, 193, 222, 275] | 8 |
serum_mAb_spike | [95, 136, 174, 175, 208, 259] | [80, 121, 159, 160, 193, 244] | 6 |
Add a column (zoom_site
) to avg_sel_df
that indicates which sites to zoom in on:
avg_sel_df = pd.concat(
[df.assign(zoom_site=lambda x: x['isite'].isin(
zoom_sites[serum_group]['isite']))
for serum_group, df in avg_sel_df.groupby('serum_group')],
ignore_index=True
)
We now have all the information used to display the data in the tidy data frame avg_sel_df
, which has the selection for every mutation for each antibody averaged across this replicates.
We write this data frame to a CSV file, getting rid of some unneeded columns:
avg_sel_df_file = os.path.join(config['avgdiffseldir'],
'avg_sel_tidy.csv')
print(f"Writing average selection information to {avg_sel_df_file}")
(avg_sel_df
.query('library == @avg_type')
.drop(columns=['library', 'name_formatted', 'serum_description',
'serum_name', 'serum_species'])
.to_csv(avg_sel_df_file, index=False, float_format='%.5g')
)
Writing average selection information to results/avgdiffsel/avg_sel_tidy.csv
For each group of sera we make line plots that show the site-level selection and logo plots that zoom in on mutations at the sites of significant selection.
We make these plots using the facet_plot
command of dmslogo.
We want to label the logo plots with site numbers and wildtype residue, so first we create a column that contains this information:
avg_sel_df = avg_sel_df.assign(site_label=lambda x: x['wildtype'] +
x['site'].astype('str'))
We will use axes with shared ylimits across rows for all plots except for the antibody serum group:
share_ylim_across_rows = {serum_group: ('antibody' not in serum_group)
for serum_group in avg_sel_df['serum_group'].unique()}
Now we make the line and logo plots. We also save PDF versions of each plot:
os.makedirs(config['avgdiffsel_zoom_dir'], exist_ok=True)
for serum_group, df in avg_sel_df.groupby('serum_group'):
plotfile = os.path.join(config['avgdiffsel_zoom_dir'],
f"{serum_group}_avg.pdf")
print(f"\n\n{'*' * 72}\nSerum group {serum_group}, saving to {plotfile}\n")
fig, axes = dmslogo.facet_plot(
data=df.query('library == @avg_type'),
x_col='isite',
show_col='zoom_site',
gridrow_col='serum_name_formatted',
share_xlabel=True,
share_ylabel=True,
share_ylim_across_rows=share_ylim_across_rows[serum_group],
wspace=0.6,
draw_line_kwargs=dict(
height_col='positive_diffsel',
xtick_col='site',
ylabel='immune selection',
),
draw_logo_kwargs=dict(
letter_col='mutation',
letter_height_col='mutdiffsel',
xtick_col='site_label',
xlabel='site',
ylabel='immune selection',
clip_negative_heights=True,
),
)
display(fig)
fig.savefig(plotfile)
plt.close(fig)
************************************************************************ Serum group Hensley_sera, saving to results/avgdiffsel/zoomed_plots/Hensley_sera_avg.pdf
************************************************************************ Serum group VIDD_sera, saving to results/avgdiffsel/zoomed_plots/VIDD_sera_avg.pdf
************************************************************************ Serum group antibody_lower_head, saving to results/avgdiffsel/zoomed_plots/antibody_lower_head_avg.pdf
************************************************************************ Serum group antibody_region_B, saving to results/avgdiffsel/zoomed_plots/antibody_region_B_avg.pdf
************************************************************************ Serum group ferret, saving to results/avgdiffsel/zoomed_plots/ferret_avg.pdf
************************************************************************ Serum group serum_mAb_spike, saving to results/avgdiffsel/zoomed_plots/serum_mAb_spike_avg.pdf
Finally, we make whole-gene logo plots for each serum that shows the replicate-average selection for all sites. We make these whole-gene plots using dms2_logoplot. They are too large to be useful to show visually in this notebook, but the cell below gives the name of the PDF holding each logo plot so you can examine them individually if you'd like.
outdir = config['avgdiffsel_full_dir'] # save plots here
os.makedirs(outdir, exist_ok=True)
for tup in (avg_selections
.query('library == @avg_type')
.itertuples(index=False)
):
name = getattr(tup, "serum_name_formatted")
plotfile = os.path.join(outdir, f"{name}_diffsel.pdf")
if os.path.isfile(plotfile) and config['use_existing'] == 'yes':
print(f"{plotfile} already exists.")
continue
datafile = getattr(tup, 'mutdiffsel_file')
maxsitediffsel = (pd.read_csv(getattr(tup, 'sitediffsel_file'))
['positive_diffsel']
.max()
)
barheight = f"{0.5 * maxsitediffsel:.2g}"
barlabel = f"differential selection = {barheight}"
cmds = ['dms2_logoplot',
'--outdir', outdir,
'--diffsel', datafile,
'--name', name,
'--nperline', '95',
'--overlay1', datafile, 'wildtype', 'wildtype',
'--underlay', 'yes',
'--restrictdiffsel', 'positive',
'--scalebar', barheight, barlabel,
'--use_existing', config['use_existing'],
]
print(f"Plotting {name} to {plotfile}")
subprocess.check_output(cmds)
assert os.path.isfile(plotfile)
results/avgdiffsel/full_logo_plots/antibody-5A01_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/antibody-3C04_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/antibody-3C06_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/antibody-4C01_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/antibody-4F03_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/antibody-1C04_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-Pitt-1-preinf_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-Pitt-1-postinf_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-Pitt-2-preinf_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-Pitt-2-postinf_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-Pitt-3-preinf_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-Pitt-3-postinf_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-WHO-Perth2009_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/ferret-WHO-Victoria2011_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2010-age-21_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2009-age-53_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2009-age-53-plus-2-months_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2009-age-64_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2009-age-65_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-25-prevacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-25-vacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-29-prevacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-29-vacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-48-prevacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-48-vacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-49-prevacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2015-age-49-vacc_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2009-age-65-with-low-4F03_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2009-age-65-with-mid-4F03_diffsel.pdf already exists. results/avgdiffsel/full_logo_plots/2009-age-65-with-hi-4F03_diffsel.pdf already exists.
In order to determine how variable the selection is in each group, we compute the beta diversity, using both the Shannon and Simpson indices. We look only at post-vaccination samples, excluding the duplicate 2009 age 53 sample and not examing the spike-in experiment. Here are the results:
for index in ['shannon', 'simpson']:
print(f"\n{index} beta diversity:")
print(avg_sel_df
.query('library == @avg_type')
.query('serum_vaccination != "pre"')
.query('serum_name_formatted != "2009-age-53-plus-2-months"')
.query('serum_group != "serum_mAb_spike"')
.groupby(['serum_group'])
.apply(dms_tools2.diffsel.beta_diversity,
samplecol='serum',
sitecol='site',
valcol='positive_diffsel',
index=index)
)
shannon beta diversity: serum_group Hensley_sera 1.230070 VIDD_sera 1.196409 antibody_lower_head 1.139691 antibody_region_B 1.452823 ferret 1.135527 dtype: float64 simpson beta diversity: serum_group Hensley_sera 1.600488 VIDD_sera 1.375323 antibody_lower_head 1.379706 antibody_region_B 1.594487 ferret 1.121852 dtype: float64
Large beta diversity indicates more variability among the samples in a group. As expected, the ferrets have lower diversity than the human sera. This is especially pronounced when using the Simpson index, which emphasizes the effect of strong-selection sites.
In the above section, we plotted the average for each serum. Here we also show some data for the replicates that went into this average.
Here we make line and zoomed logo plots for each replicate that goes into the average:
os.makedirs(config['avgdiffsel_reps_dir'], exist_ok=True)
for serum_group, df in avg_sel_df.groupby('serum_group'):
plotfile = os.path.join(config['avgdiffsel_reps_dir'],
f"{serum_group}_reps.pdf")
svgplotfile = os.path.splitext(plotfile)[0] + '.svg'
print(f"\n\n{'*' * 72}\n{serum_group}, saving to {plotfile} and {svgplotfile}\n")
fig, axes = dmslogo.facet_plot(
data=df.query('library != @avg_type'),
x_col='isite',
show_col='zoom_site',
gridrow_col='serum_name_formatted',
gridcol_col='library',
share_xlabel=True,
share_ylabel=True,
share_ylim_across_rows=share_ylim_across_rows[serum_group],
wspace=0.6,
draw_line_kwargs=dict(
height_col='positive_diffsel',
xtick_col='site',
ylabel='immune selection',
),
draw_logo_kwargs=dict(
letter_col='mutation',
letter_height_col='mutdiffsel',
xtick_col='site_label',
xlabel='site',
ylabel='immune selection',
clip_negative_heights=True,
),
)
display(fig)
fig.savefig(plotfile)
fig.savefig(svgplotfile)
plt.close(fig)
************************************************************************ Hensley_sera, saving to results/avgdiffsel/replicates/Hensley_sera_reps.pdf and results/avgdiffsel/replicates/Hensley_sera_reps.svg
************************************************************************ VIDD_sera, saving to results/avgdiffsel/replicates/VIDD_sera_reps.pdf and results/avgdiffsel/replicates/VIDD_sera_reps.svg
************************************************************************ antibody_lower_head, saving to results/avgdiffsel/replicates/antibody_lower_head_reps.pdf and results/avgdiffsel/replicates/antibody_lower_head_reps.svg
************************************************************************ antibody_region_B, saving to results/avgdiffsel/replicates/antibody_region_B_reps.pdf and results/avgdiffsel/replicates/antibody_region_B_reps.svg
************************************************************************ ferret, saving to results/avgdiffsel/replicates/ferret_reps.pdf and results/avgdiffsel/replicates/ferret_reps.svg
************************************************************************ serum_mAb_spike, saving to results/avgdiffsel/replicates/serum_mAb_spike_reps.pdf and results/avgdiffsel/replicates/serum_mAb_spike_reps.svg
Now we plot the correlation among the replicates that we retained for each serum. First, we make a tidy data frame with the correlations between all pairs of replicates for the same serum:
corr_df = []
serum_names = avg_sel_df['serum_name_formatted'].unique()
libraries = [lib for lib in avg_sel_df['library'].unique() if lib != avg_type]
for serum_name, serum_sel_df in avg_sel_df.groupby('serum_name_formatted'):
corr_df.append(
serum_sel_df
.query('library in @libraries')
.pivot_table(values='positive_diffsel',
columns='library',
index=['site'])
.corr(method='pearson')
.reset_index()
.melt(id_vars='library',
var_name='lib_B',
value_name='correlation')
.rename(columns={'library': 'lib_A'})
.query('lib_A <= lib_B')
.assign(serum_name=serum_name)
)
corr_df = (pd.concat(corr_df, ignore_index=True)
.assign(serum_name=lambda x: pd.Categorical(x['serum_name'],
serum_names,
ordered=True),
lib_A=lambda x: pd.Categorical(x['lib_A'],
libraries,
ordered=True),
lib_B=lambda x: pd.Categorical(x['lib_B'],
reversed(libraries),
ordered=True),
corr_str=lambda x: x['correlation'].apply('{:.2f}'.format)
)
)
print('Here are the first few lines of the tidy correlation data frame:')
display(HTML(corr_df.head().to_html(index=False)))
Here are the first few lines of the tidy correlation data frame:
lib_A | lib_B | correlation | serum_name | corr_str |
---|---|---|---|---|
lib1 | lib1 | 1.000000 | antibody-5A01 | 1.00 |
lib1 | lib2 | 0.762430 | antibody-5A01 | 0.76 |
lib2 | lib2 | 1.000000 | antibody-5A01 | 1.00 |
lib1 | lib3 | 0.705409 | antibody-5A01 | 0.71 |
lib2 | lib3 | 0.717227 | antibody-5A01 | 0.72 |
Now we use plotnine to plot a correlation matrix for each serum, and then save to a file as well as showing it:
ncol = 5
nsera = len(corr_df['serum_name'].unique())
corr_plot = (
ggplot(corr_df, aes('lib_A', 'lib_B',
fill='correlation', label='corr_str')) +
geom_tile(color='white', size=0.5) +
geom_text() +
facet_wrap('~ serum_name', ncol=ncol) +
theme(figure_size=(2.55 * ncol, 2.55 * math.ceil(nsera / ncol)),
panel_grid_major=element_blank()
) +
scale_fill_continuous(limits=(0, 1)) +
xlab('') +
ylab('')
)
_ = corr_plot.draw()
rep_corr_plot = os.path.join(config['avgdiffsel_reps_dir'], 'rep_corr_matrix.pdf')
print(f"Saving plot to {rep_corr_plot}")
corr_plot.save(rep_corr_plot)
Saving plot to results/avgdiffsel/replicates/rep_corr_matrix.pdf
In the following section, we will generate additional customized paper figures not created above.
These go in the following directory:
print(f"Putting additional figures in {config['figsdir']}")
os.makedirs(config['figsdir'], exist_ok=True)
Putting additional figures in results/figures
We will save each of these figures with the following extensions:
fig_extensions = ['.pdf', '.svg']
Now we are going to make versions of the zoomed logo plots that are slightly different than the ones above. We read in colors for specific mutations from a YAML file; these colors match those used to plot the neutralization curves. We also plot a slightly different set of sites: not just the "significant" ones, but also others of interest (these are sites of strong selection in ferrets).
First, get data frames with sites to zoom on and how to color them, and how to color specific mutations:
with open(config['figure_config']) as f:
fig_config = yaml.safe_load(f)
zoom_sites = []
mutation_colors = []
for figure, fig_d in fig_config['figures'].items():
# process zoom sites
site_color_map = collections.defaultdict(lambda: fig_config['default_logo_color'])
if 'site_colors' in fig_d:
for r, c in fig_d['site_colors'].items():
site_color_map[str(r)] = c
zoom_sites.append(
pd.DataFrame({'figure': figure,
'sera': fig_d['sera'],
'dummy': 0})
.merge(pd.DataFrame({'site': avg_sel_df['site'].unique(),
'dummy': 0}))
.drop(columns='dummy')
.assign(zoom=lambda x: x['site'].isin([str(r) for r in fig_d['sites']]),
color=lambda x: x['site'].map(site_color_map)
)
)
# process colors
for serum in fig_d['sera']:
if ('ignore_serum_virus' in fig_d) and (serum in fig_d['ignore_serum_virus']):
ignore_muts = fig_d['ignore_serum_virus'][serum]
else:
ignore_muts = []
muts = [mut for mut in fig_d['colors'] if (mut not in {'wt', 'syn'}) and
(mut not in ignore_muts)]
mutation_colors.append(pd.DataFrame(
{'figure': figure,
'sera': serum,
'site': [mut[1: -1] for mut in muts],
'mutation': [mut[-1] for mut in muts],
'color': [fig_d['colors'][mut][0] for mut in muts],
}))
zoom_sites = pd.concat(zoom_sites)
mutation_colors = pd.concat(mutation_colors)
Merge the zoom and mutation-color data frames with the data on all of the immune selection values:
colored_zoom_df = (
avg_sel_df
.query('library == @avg_type')
[['serum_name_formatted', 'isite', 'site', 'mutation',
'positive_diffsel', 'mutdiffsel', 'site_label']]
.rename(columns={'serum_name_formatted': 'sera'})
.merge(zoom_sites,
on=['sera', 'site'],
how='left')
.assign(zoom=lambda x: x['zoom'].fillna(False))
.merge(mutation_colors,
on=['figure', 'sera', 'site', 'mutation'],
how='left')
.assign(color=lambda x: x['color_x'].where(x['color_y'].isna(), x['color_y']))
.drop(columns=['color_x', 'color_y'])
)
Make the zoomed logo plots:
for figure, df in colored_zoom_df.groupby('figure'):
if 'sera_names' in fig_config['figures'][figure]:
name_map = fig_config['figures'][figure]['sera_names']
else:
name_map = {s: s.replace('-', ' ') for s in df.sera.unique()}
df = df.assign(sera_names=lambda x: pd.Categorical(x.sera.map(name_map),
x.sera.map(name_map).unique(),
ordered=True))
fig, axes = dmslogo.facet_plot(
data=df,
x_col='isite',
show_col='zoom',
gridrow_col='sera_names',
share_xlabel=True,
share_ylabel=True,
share_ylim_across_rows=share_ylim_across_rows[serum_group],
wspace=0.6,
rmargin=0.7,
draw_line_kwargs=dict(
height_col='positive_diffsel',
xtick_col='site',
ylabel='immune selection',
show_color=PALETTE[-2],
),
draw_logo_kwargs=dict(
letter_col='mutation',
letter_height_col='mutdiffsel',
color_col='color',
xtick_col='site_label',
xlabel='site',
ylabel='immune selection',
clip_negative_heights=True,
),
)
display(fig)
for ext in fig_extensions:
plotfile = os.path.join(config['figsdir'], f"{figure}_logo{ext}")
print(f"Saving figure to {plotfile}")
fig.savefig(plotfile)
plt.close(fig)
Saving figure to results/figures/2009_age_53_samples_logo.pdf Saving figure to results/figures/2009_age_53_samples_logo.svg
Saving figure to results/figures/Hensley_sera_logo.pdf Saving figure to results/figures/Hensley_sera_logo.svg
Saving figure to results/figures/VIDD_sera_logo.pdf Saving figure to results/figures/VIDD_sera_logo.svg
Saving figure to results/figures/antibody_lower_head_logo.pdf Saving figure to results/figures/antibody_lower_head_logo.svg
Saving figure to results/figures/antibody_region_B_logo.pdf Saving figure to results/figures/antibody_region_B_logo.svg
Saving figure to results/figures/antibody_spikein_logo.pdf Saving figure to results/figures/antibody_spikein_logo.svg
Saving figure to results/figures/ferret_logo.pdf Saving figure to results/figures/ferret_logo.svg
Plot replicate-replicate correlations for the sera in each figure:
for figure, fig_d in fig_config['figures'].items():
sera = fig_d['sera']
if len(sera) == 5:
ncol = 5
else:
ncol = 4
fig_corr_df = (corr_df
.query('serum_name in @sera')
.assign(serum_name=lambda x: x['serum_name'].str.replace('-', ' '))
)
fig_corr_plot = (
ggplot(fig_corr_df, aes('lib_A', 'lib_B',
fill='correlation', label='corr_str')) +
geom_tile(color='white', size=0.5) +
geom_text() +
facet_wrap('~ serum_name', ncol=ncol) +
theme(figure_size=(2.55 * min(len(sera), ncol), 2.55 * math.ceil(len(sera) / ncol)),
panel_grid_major=element_blank()
) +
scale_fill_continuous(limits=(0, 1)) +
xlab('') +
ylab('')
)
print(fig_corr_plot)
for ext in fig_extensions:
plotfile = os.path.join(config['figsdir'], f"{figure}_rep_corr{ext}")
print(f"Saving figure to {plotfile}")
fig_corr_plot.save(plotfile)
<ggplot: (8744359022719)> Saving figure to results/figures/VIDD_sera_rep_corr.pdf Saving figure to results/figures/VIDD_sera_rep_corr.svg
<ggplot: (8744358702870)> Saving figure to results/figures/2009_age_53_samples_rep_corr.pdf Saving figure to results/figures/2009_age_53_samples_rep_corr.svg
<ggplot: (8744656139766)> Saving figure to results/figures/Hensley_sera_rep_corr.pdf Saving figure to results/figures/Hensley_sera_rep_corr.svg
<ggplot: (8744399698780)> Saving figure to results/figures/ferret_rep_corr.pdf Saving figure to results/figures/ferret_rep_corr.svg
<ggplot: (8744655353541)> Saving figure to results/figures/antibody_region_B_rep_corr.pdf Saving figure to results/figures/antibody_region_B_rep_corr.svg
<ggplot: (8744359031374)> Saving figure to results/figures/antibody_lower_head_rep_corr.pdf Saving figure to results/figures/antibody_lower_head_rep_corr.svg
<ggplot: (-9223363292486854961)> Saving figure to results/figures/antibody_spikein_rep_corr.pdf Saving figure to results/figures/antibody_spikein_rep_corr.svg
Plot the percent infectivity remaining for the library for each selection.
def format_label(x):
if x == 100:
lab = '100'
else:
lab = '{0:#.2g}'.format(x)
if lab[-1] == '.':
lab = lab[: -1]
return lab + '%'
for figure, fig_d in fig_config['figures'].items():
sera = fig_d['sera']
df = (selections
.query('retained')
.query('serum_name_formatted in @sera')
[['serum_name_formatted', 'library', 'percent_infectivity']]
.assign(serum=lambda x: x['serum_name_formatted'].str.replace('-', ' '),
label=lambda x: x['percent_infectivity'].apply(format_label)
)
)
ymax = max(100, 10**math.ceil(math.log10(df['percent_infectivity'].max())))
ymin = min(0.1, 10**math.floor(math.log10(df['percent_infectivity'].min())))
yextent = math.log10(ymax / ymin)
if df['percent_infectivity'].max() > 50:
nudge_y = -0.1 * yextent
else:
nudge_y = 0.1 * yextent
if len(sera) == 5:
ncol = 5
else:
ncol = 4
plot_percent_infectivity = (
ggplot(df, aes('library', 'percent_infectivity', label='label', color='library')) +
geom_point(size=3) +
geom_text(size=10, nudge_y=nudge_y) +
facet_wrap('~ serum', ncol=ncol) +
scale_y_log10(limits=(ymin, ymax), name='percent infectivity') +
theme(figure_size=(2.55 * min(ncol, len(sera)), 2.55 * math.ceil(len(sera) / ncol))) +
scale_color_manual(values=PALETTE[1: ], guide=False)
)
print(plot_percent_infectivity)
for ext in fig_extensions:
plotfile = os.path.join(config['figsdir'], f"{figure}_percent_infectivity{ext}")
print(f"Saving figure to {plotfile}")
plot_percent_infectivity.save(plotfile)
<ggplot: (-9223363292500806326)> Saving figure to results/figures/VIDD_sera_percent_infectivity.pdf Saving figure to results/figures/VIDD_sera_percent_infectivity.svg
<ggplot: (-9223363292196862478)> Saving figure to results/figures/2009_age_53_samples_percent_infectivity.pdf Saving figure to results/figures/2009_age_53_samples_percent_infectivity.svg
<ggplot: (8744664887837)> Saving figure to results/figures/Hensley_sera_percent_infectivity.pdf Saving figure to results/figures/Hensley_sera_percent_infectivity.svg
<ggplot: (-9223363292196645544)> Saving figure to results/figures/ferret_percent_infectivity.pdf Saving figure to results/figures/ferret_percent_infectivity.svg
<ggplot: (8744665275007)> Saving figure to results/figures/antibody_region_B_percent_infectivity.pdf Saving figure to results/figures/antibody_region_B_percent_infectivity.svg
<ggplot: (8744405357695)> Saving figure to results/figures/antibody_lower_head_percent_infectivity.pdf Saving figure to results/figures/antibody_lower_head_percent_infectivity.svg
<ggplot: (8744367265649)> Saving figure to results/figures/antibody_spikein_percent_infectivity.pdf Saving figure to results/figures/antibody_spikein_percent_infectivity.svg