Analysis of serum mutational antigenic profiling¶

This Python Jupyter notebook analyzes mutational antigenic profiling of serum against virus carrying the A/Perth/2009 (H3N2) HA.

Configure analysis¶

We first configure the analysis by importing packages and getting information on the samples that we will analyze.

Import Python modules / packages¶

Import modules / packages. In particular, we use:

plotnine for ggplot2-like plotting syntax
dmslogo to draw sequence logo plots
dms_tools2 for much of the analysis

In [1]:

import os
import math
import collections
import itertools
import warnings
import subprocess

import yaml
import pandas as pd

from IPython.display import display, HTML

import matplotlib.pyplot as plt

from plotnine import *

import dms_tools2
from dms_tools2.ipython_utils import showPDF
from dms_tools2.plot import COLOR_BLIND_PALETTE_GRAY as PALETTE

import dmslogo

Turn on interactive matplotlib plotting:

In [2]:

plt.ion()  # turn on interactive plotting

Print versions of Bloom lab software:

In [3]:

print(f"Using dms_tools2 version {dms_tools2.__version__}")
print(f"Using dmslogo version {dmslogo.__version__}")

Using dms_tools2 version 2.4.16
Using dmslogo version 0.2.3

Set data frame display options:

In [4]:

pd.set_option('display.max_colwidth', 200)

Ignore warnings that can clutter notebook:

In [5]:

warnings.simplefilter('ignore')

Get config variables¶

Read the variables from the config file. These variables specify input / output paths and key parameters for analysis:

In [6]:

configfile = 'config.yaml'

with open(configfile) as f:
    config = yaml.safe_load(f)

print(f"Read the following configuration from {configfile}:")
display(HTML(pd.Series(config)
             .to_frame('value')
             .to_html()
             ))

Read the following configuration from config.yaml:

	value
serum_info	data/serum_info.yaml
sample_list	data/sample_list.csv
refseq	data/Perth09_HA_reference.fa
renumbering_scheme	data/H3renumbering_scheme.csv
seq_data_source	SRA_accession
fastq_dir	results/FASTQ_files
fastq_dump	fastq-dump
ascp	/app/aspera-connect/3.7.5/bin/ascp
asperakey	/app/aspera-connect/3.7.5/etc/asperaweb_id_dsa.openssh
alignspecs	1,285,38,40 286,567,33,34 568,852,34,30 853,1137,34,31 1138,1422,36,29 1423,1701,39,44
R1trim	200
R2trim	165
ncpus	16
use_existing	yes
avg_type	median
natseqs	data/human_H3N2_HA_2007-2018.fasta.gz
neut_config	data/neut_assays/neut_config.yaml
figure_config	data/figure_config.yaml
map_on_struct_template	map_on_struct_template.ipynb
pdb_id	4o5n
site_to_pdb	data/H3_site_to_PDB_4o5n.csv
countsdir	results/codoncounts
renumbcountsdir	results/renumbered_codoncounts
selectiontabledir	results/selection_tables
diffseldir	results/diffsel
avgdiffseldir	results/avgdiffsel
avgdiffsel_sigsites_dir	results/avgdiffsel/sigsites
avgdiffsel_zoom_dir	results/avgdiffsel/zoomed_plots
avgdiffsel_reps_dir	results/avgdiffsel/replicates
avgdiffsel_full_dir	results/avgdiffsel/full_logo_plots
natseqs_dir	results/natseqs
neutresultsdir	results/neutralization_assays
structsdir	results/structs
figsdir	results/figures
finalfigsdir	results/figures/final
notebookdir	results/notebooks

Get sera information¶

Read information on the sera that are being mapped.

For each serum sample below, we get:

serum: the abbreviation used for that serum in the experiments
serum_description: description of the serum
serum_group: group to which the serum belongs
serum_name: a short informative name
serum_species: species from which the serum is derived
serum_vaccination: any relevant information on vaccination status (some serum are paired pre- and post-vaccination)

In [7]:

with open(config['serum_info']) as f:
    sera = (pd.DataFrame(yaml.safe_load(f))
            .transpose()
            .add_prefix('serum_')
            .rename_axis('serum')
            .reset_index()
            )

assert len(sera) == len(sera['serum'].unique()), 'sera not unique'

print(f"Read the following sera information from {config['serum_info']}:")
display(HTML(sera.to_html(index=False)))

Read the following sera information from data/serum_info.yaml:

serum	serum_description	serum_group	serum_name	serum_species	serum_vaccination
mock	no-serum control	mock	no-serum	NaN	NaN
plasmid	plasmid used as control to estimate sequencing error rate	plasmid	plasmid	NaN	NaN
5A01	site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley	antibody_region_B	antibody-5A01	NaN	NaN
3C04	site B-targeting monoclonal antibody 3C04 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C04	NaN	NaN
3C06	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN
4C01	site B-targeting monoclonal antibody 4C01 from Seth Zost and Scott Hensley	antibody_region_B	antibody-4C01	NaN	NaN
4F03	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN
1C04	lower head-targeting monoclonal antibody 1C04 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-1C04	NaN	NaN
f9267neg	Lakdawala lab ferret, before infection	ferret	Pitt-1-preinf	ferret	pre
f9267d23	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-1-postinf	ferret	post
f9435neg	Lakdawala lab ferret, before infection	ferret	Pitt-2-preinf	ferret	pre
f9435d23	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-2-postinf	ferret	post
f9437neg	Lakdawala lab ferret, before infection	ferret	Pitt-3-preinf	ferret	pre
f9437d23	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-3-postinf	ferret	post
WHOCCPerth	ferret infected by Melbourne WHO CC with their Perth/2009 strain	ferret	WHO-Perth2009	ferret	post
WHOCCVic	ferret infected by Melbourne WHO CC with their Victoria/2011 strain	ferret	WHO-Victoria2011	ferret	post
VIDD1	collected at Hutch in 2/2010 from person born in 1989	VIDD_sera	2010-age-21	human	NaN
VIDD2	collected at Hutch in 1/2009 from person born in 1956	VIDD_sera	2009-age-53	human	NaN
VIDD3	second sample collected at Hutch in 3/2009 from person born in 1956	VIDD_sera	2009-age-53-plus-2-months	human	NaN
VIDD4	collected at Hutch in 11/2008 from person born in 1945	VIDD_sera	2009-age-64	human	NaN
VIDD5	collected at Hutch in 6/2009 from person born in 1944	VIDD_sera	2009-age-65	human	NaN
557v1	collected before 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-prevacc	human	pre
557v2	collected after 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-vacc	human	post
574v1	collected before 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-prevacc	human	pre
574v2	collected after 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-vacc	human	post
589v1	collected before 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-prevacc	human	pre
589v2	collected after 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-vacc	human	post
571v1	collected before 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-prevacc	human	pre
571v2	collected after 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-vacc	human	post
VIDD5andlow4F03	collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency	serum_mAb_spike	2009-age-65-with-low-4F03	human	NaN
VIDD5andmid4F03	collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency	serum_mAb_spike	2009-age-65-with-mid-4F03	human	NaN
VIDD5andhi4F03	collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency	serum_mAb_spike	2009-age-65-with-hi-4F03	human	NaN

Get list of samples¶

Read information about all of the samples that we have deep sequenced.

For each sample, we have information on the serum to which it corresponds, the virus library, the date of sequencing, the serum dilution, the percent infectivity, and (depending on the value of seq_data_source in the config file) either the Sequence Read Archive (SRA) accession or the location of the R1 files on the Hutch server:

In [8]:

samples = pd.read_csv(config['sample_list'])

# don't need any R1 column if we are using SRA accession
if config['seq_data_source'] == 'SRA_accession':
    samples = samples.drop(columns='R1', errors='ignore')

assert len(samples) == len(samples['sample'].unique()), 'non-unique samples'

print(f"Read the following samples from {config['sample_list']}:")
display(HTML(samples.to_html(index=False)))

Read the following samples from data/sample_list.csv:

sample	serum	library	date	serum_dilution	percent_infectivity	SRA_accession
L2-3C06	3C06	lib2	2018-07-20	0.55	0.0100	SRR7974483
L4-3C06	3C06	lib1	2018-09-12	0.1	0.0688	SRR7974503
L2-3C04	3C04	lib2	2018-07-20	0.65	0.0331	SRR7974490
L4-3C04	3C04	lib1	2018-09-12	0.1	0.0777	SRR7974499
L2-4C01	4C01	lib2	2018-07-20	0.60	0.0185	SRR7974491
L4-4C01	4C01	lib1	2018-09-12	0.2	0.1190	SRR7974488
L2-1C04	1C04	lib2	2018-07-20	16.0	10.3190	SRR7974492
L3-1C04	1C04	lib3	2018-09-12	18.0	7.9040	SRR7974497
Lib2mock-mAb	mock	lib2	2018-07-20	NaN	100.0000	SRR7974494
Lib3mock-mAb	mock	lib3	2018-09-12	NaN	100.0000	SRR7974507
Lib4mock-mAb	mock	lib1	2018-09-12	NaN	100.0000	SRR7974486
WTplasmid-mAb-A	plasmid	lib2	2018-07-20	NaN	NaN	SRR7974495
WTplasmid-mAb-B	plasmid	lib3	2018-09-12	NaN	NaN	SRR7974502
WTplasmid-mAb-C	plasmid	lib1	2018-09-12	NaN	NaN	SRR7974484
L4-589v1	589v1	lib1	2018-11-14	0.0185	13.8200	SRR8875142
L4-571v1	571v1	lib1	2018-11-14	0.0185	19.6200	SRR8875143
L4-571v2	571v2	lib1	2018-11-14	0.0185	5.4800	SRR8875144
L4-574v1	574v1	lib1	2018-11-14	0.0185	9.4800	SRR8875145
L4-WHOCCPerth	WHOCCPerth	lib1	2018-11-14	0.0185	2.3200	SRR8875138
L4-589v2	589v2	lib1	2018-11-14	0.000875	4.5800	SRR8875139
L4-557v1	557v1	lib1	2018-11-14	0.0075	6.9000	SRR8875140
L4-f9267neg	f9267neg	lib1	2018-11-14	0.00075	100.0000	SRR8875141
L4-f9267d23	f9267d23	lib1	2018-11-14	0.00075	4.3600	SRR8875136
L4-f9435neg	f9435neg	lib1	2018-11-14	0.001	100.0000	SRR8875137
L4-f9437neg	f9437neg	lib1	2018-11-14	0.00225	100.0000	SRR8875168
L4-f9437d23	f9437d23	lib1	2018-11-14	0.00225	8.7700	SRR8875169
Lib4mock-A	mock	lib1	2018-11-14	NaN	100.0000	SRR8875170
WTplasmid-A	plasmid	lib1	2018-11-14	NaN	NaN	SRR8875171
L5-f9267neg	f9267neg	lib2	2019-01-16	0.00075	100.0000	SRR8875172
L5-f9435neg	f9435neg	lib2	2019-01-16	0.0025	100.0000	SRR8875173
L5-f9437neg	f9437neg	lib2	2019-01-16	0.00625	100.0000	SRR8875174
L5-VIDD1	VIDD1	lib2	2019-01-16	0.0045	5.9500	SRR8875175
L5-WHOCCPerth	WHOCCPerth	lib2	2019-01-16	0.0075	1.8800	SRR8875166
L5-f9267d23	f9267d23	lib2	2019-01-16	0.000175	5.3500	SRR8875167
L5-WHOCCVic	WHOCCVic	lib2	2019-01-16	0.0025	3.9000	SRR8875191
L5-VIDD5	VIDD5	lib2	2019-01-16	0.000875	5.2400	SRR8875190
L5-VIDD4	VIDD4	lib2	2019-01-16	0.001675	0.8700	SRR8875193
L5-VIDD2	VIDD2	lib2	2019-01-16	0.00375	2.0300	SRR8875192
L5-VIDD3	VIDD3	lib2	2019-01-16	0.0045	1.7300	SRR8875187
L5-f9435d23	f9435d23	lib2	2019-01-16	0.000625	1.8800	SRR8875186
L5-f9437d23	f9437d23	lib2	2019-01-16	0.001375	1.7000	SRR8875189
Lib5mock-A	mock	lib2	2019-01-16	NaN	100.0000	SRR8875188
WTplasmid-B	plasmid	lib2	2019-01-16	NaN	NaN	SRR8875195
L4-f9435d23	f9435d23	lib1	2019-01-16	0.0025	3.4500	SRR8875194
L4-VIDD4	VIDD4	lib1	2019-01-16	0.004	10.7400	SRR8875112
L4-VIDD3	VIDD3	lib1	2019-01-16	0.0075	5.9800	SRR8875113
L4-WHOCCVic	WHOCCVic	lib1	2019-01-16	0.00625	1.2600	SRR8875110
L4-VIDD1	VIDD1	lib1	2019-01-16	0.0175	4.0800	SRR8875111
L4-VIDD2	VIDD2	lib1	2019-01-16	0.00875	3.7400	SRR8875108
Lib4mock-B	mock	lib1	2019-01-16	NaN	100.0000	SRR8875109
WTplasmid-C	plasmid	lib1	2019-01-16	NaN	NaN	SRR8875106
L5-589v1	589v1	lib2	2019-03-06	0.0175	20.3200	SRR8875107
L5-557v2	557v2	lib2	2019-03-06	0.0005	0.8100	SRR8875114
L5-571v2	571v2	lib2	2019-03-06	0.01	1.0900	SRR8875115
L5-574v1	574v1	lib2	2019-03-06	0.0125	7.5700	SRR8875119
L5-574v2	574v2	lib2	2019-03-06	0.000875	1.1100	SRR8875118
L5-4F03-c1	4F03	lib2	2019-03-06	0.3	22.9900	SRR8875117
L5-4F03-c3	4F03	lib2	2019-03-06	1.5	3.7100	SRR8875116
L6-589v2	589v2	lib3	2019-03-06	0.000375	4.0600	SRR8875123
L6-557v1	557v1	lib3	2019-03-06	0.005	1.5700	SRR8875122
L6-557v2	557v2	lib3	2019-03-06	0.0005	1.6400	SRR8875121
L6-VIDD2	VIDD2	lib3	2019-03-06	0.00375	6.2200	SRR8875120
L6-VIDD3	VIDD3	lib3	2019-03-06	0.00625	1.1200	SRR8875125
L6-VIDD1	VIDD1	lib3	2019-03-06	0.0075	3.6800	SRR8875124
L6-f9267d23	f9267d23	lib3	2019-03-06	0.000375	1.6900	SRR8875132
L6-f9435d23	f9435d23	lib3	2019-03-06	0.00125	3.0400	SRR8875133
L6-WHOCCPerth	WHOCCPerth	lib3	2019-03-06	0.0075	4.1000	SRR8875134
L6-WHOCCVic	WHOCCVic	lib3	2019-03-06	0.0025	3.6200	SRR8875135
Lib5mock-B	mock	lib2	2019-03-06	NaN	100.0000	SRR8875128
Lib6mock-A	mock	lib3	2019-03-06	NaN	100.0000	SRR8875129
WTplasmid-D	plasmid	lib2	2019-03-06	NaN	NaN	SRR8875130
WTplasmid-E	plasmid	lib3	2019-03-06	NaN	NaN	SRR8875131
L5-589v2	589v2	lib2	2019-03-06	0.000375	4.6800	SRR8875126
L5-557v1	557v1	lib2	2019-03-06	0.005	2.3100	SRR8875127
L5-571v1	571v1	lib2	2019-03-06	0.0185	11.8600	SRR8875149
L5-5A01	5A01	lib2	2019-03-06	0.025	1.3600	SRR8875148
L5-4F03-c2	4F03	lib2	2019-03-06	0.7	4.7100	SRR8875151
L5-VIDD5-low4F03	VIDD5andlow4F03	lib2	2019-03-06	0.000875+0.3	5.0000	SRR8875150
L5-VIDD5-mid4F03	VIDD5andmid4F03	lib2	2019-03-06	0.000875+0.7	0.4600	SRR8875153
L5-VIDD5-hi4F03	VIDD5andhi4F03	lib2	2019-03-06	0.000875+1.5	0.0400	SRR8875152
L4-574v2	574v2	lib1	2019-03-06	0.002	1.5400	SRR8875155
L4-557v2	557v2	lib1	2019-03-06	0.00125	1.5300	SRR8875154
L4-5A01	5A01	lib1	2019-03-06	0.07	4.3200	SRR8875147
L4-4F03-c2	4F03	lib1	2019-03-06	1.0	3.4700	SRR8875146
L4-4F03-c3	4F03	lib1	2019-03-06	2.0	0.7800	SRR8875160
Lib4mock-C	mock	lib1	2019-03-06	NaN	100.0000	SRR8875161
WTplasmid-F	plasmid	lib1	2019-03-06	NaN	NaN	SRR8875158
L6-589v1	589v1	lib3	2019-03-26	0.0175	18.4000	SRR8875159
L6-571v1	571v1	lib3	2019-03-26	0.0185	27.7400	SRR8875164
L6-571v2	571v2	lib3	2019-03-26	0.005	7.7900	SRR8875165
L6-574v1	574v1	lib3	2019-03-26	0.015	3.3000	SRR8875162
L6-574v2	574v2	lib3	2019-03-26	0.00075	1.9200	SRR8875163
L6-f9267neg	f9267neg	lib3	2019-03-26	0.00075	100.0000	SRR8875156
L6-f9435neg	f9435neg	lib3	2019-03-26	0.0025	100.0000	SRR8875157
L6-f9437neg	f9437neg	lib3	2019-03-26	0.00625	100.0000	SRR8875183
L6-5A01	5A01	lib3	2019-03-26	0.025	1.0200	SRR8875182
Lib6mock-B	mock	lib3	2019-03-26	NaN	100.0000	SRR8875181
L4-4F03-c1	4F03	lib1	2019-03-26	0.3	28.2400	SRR8875180
L6-VIDD5	VIDD5	lib3	2019-03-26	0.002	5.5800	SRR8875179
L6-VIDD4	VIDD4	lib3	2019-03-26	0.00375	0.2600	SRR8875178
L6-f9437d23	f9437d23	lib3	2019-03-26	0.00175	5.7800	SRR8875177
L6-4F03-c1	4F03	lib3	2019-03-26	0.3	23.9100	SRR8875176
L6-4F03-c3	4F03	lib3	2019-03-26	1.4	0.8400	SRR8875185
L4-VIDD5	VIDD5	lib1	2019-03-26	0.003375	1.7700	SRR8875184
L4-VIDD5-low4F03	VIDD5andlow4F03	lib1	2019-03-26	0.003375+0.3	0.0300	SRR8875196
L4-VIDD5-mid4F03	VIDD5andmid4F03	lib1	2019-03-26	0.003375+1	0.0037	SRR8875197
L4-VIDD5-hi4F03	VIDD5andhi4F03	lib1	2019-03-26	0.003375+2	0.0016	SRR8875198
L6-4F03-c2	4F03	lib3	2019-03-26	0.75	7.1600	SRR8875199
L6-VIDD5-low4F03	VIDD5andlow4F03	lib3	2019-03-26	0.002+0.3	0.2040	SRR8875200
L6-VIDD5-mid4F03	VIDD5andmid4F03	lib3	2019-03-26	0.002+0.75	0.0077	SRR8875201
L6-VIDD5-hi4F03	VIDD5andhi4F03	lib3	2019-03-26	0.002+1.4	0.0032	SRR8875202
Lib4mock-D	mock	lib1	2019-03-26	NaN	100.0000	SRR8875203
WTplasmid-G	plasmid	lib1	2019-03-26	NaN	NaN	SRR8875204
WTplasmid-H	plasmid	lib3	2019-03-26	NaN	NaN	SRR8875205

Check that the serum for all samples are in our set of sera:

In [9]:

unknown_sera = set(samples['serum']) - set(sera['serum'])
if unknown_sera:
    raise ValueError(f"samples include unknown sera: {unknown_sera}")
else:
    print('We have information for all sera used for the samples.')

We have information for all sera used for the samples.

Download deep sequencing data if needed¶

The config file specifies whether we get the data from existing R1 files on the Hutch server, or download the data from the Sequence Read Archive (SRA) using dms_tools2.sra.fastqFromSRA:

In [10]:

if config['seq_data_source'] == 'SRA_accession':
    
    # are we using Aspera for rapid downloads?
    if config['ascp'] and config['asperakey']:
        aspera = (config['ascp'], config['asperakey'])
    else:
        aspera = None
        
    # do the downloads
    print(f"Downloading FASTQ files to {config['fastq_dir']} (takes a while)...")
    os.makedirs(config['fastq_dir'], exist_ok=True)
    samples = samples.rename(columns={'sample': 'name',
                                      'SRA_accession': 'run'})
    dms_tools2.sra.fastqFromSRA(
            samples=samples,
            fastq_dump=config['fastq_dump'],
            fastqdir=config['fastq_dir'],
            aspera=aspera,
            ncpus=config['ncpus'],
            )
    samples = samples.rename(columns={'name': 'sample',
                                      'run': 'SRA_accession'})
    print('Completed downloading files.')
        
elif config['seq_data_source'] != 'R1':
    raise ValueError('invalid value of `seq_data_source`')

Downloading FASTQ files to results/FASTQ_files (takes a while)...
Completed downloading files.

Align sequencing and count mutations¶

The samples were sequenced using barcoded subamplicon sequencing to obtain high accuracy. So we need to process these data to determine the counts of each codon mutation in each sample.

First, create the directory used for the results of this part of the analysis:

In [11]:

os.makedirs(config['countsdir'], exist_ok=True)

print(f"Results from counting mutations go to {config['countsdir']}")

Results from counting mutations go to results/codoncounts

Run `dms2_batch_bcsubamp`¶

We process the sequencing data by using dms2_batch_bcsubamp to generate "codon counts" files that give the counts of each codon mutation for each sample.

First, write the batch file used as input by this program:

In [12]:

bcsubamp_batchfile = os.path.join(config['countsdir'], 'batch.csv')

(samples
 .rename(columns={'sample': 'name'})
 [['name', 'R1']]
 .to_csv(bcsubamp_batchfile)
 )

print(f"Creating batch file {bcsubamp_batchfile}")

Creating batch file results/codoncounts/batch.csv

Now run the program:

In [13]:

cmds = ['dms2_batch_bcsubamp',
        '--batchfile', bcsubamp_batchfile,
        '--refseq', config['refseq'],
        '--alignspecs'] + config['alignspecs'].split() + [
        '--R1trim', str(config['R1trim']),
        '--R2trim', str(config['R2trim']),
        '--outdir', config['countsdir'],
        '--fastqdir', config['fastq_dir'],
        '--summaryprefix', 'summary',
        '--ncpus', str(config['ncpus']),
        '--use_existing', config['use_existing'],
        ]

print(f"Running dms2_batch_bcsubamp with this command:\n{' '.join(cmds)}")
subprocess.check_output(cmds)
print('Completed running dms2_batch_bcsubamp.')

Running dms2_batch_bcsubamp with this command:
dms2_batch_bcsubamp --batchfile results/codoncounts/batch.csv --refseq data/Perth09_HA_reference.fa --alignspecs 1,285,38,40 286,567,33,34 568,852,34,30 853,1137,34,31 1138,1422,36,29 1423,1701,39,44 --R1trim 200 --R2trim 165 --outdir results/codoncounts --fastqdir results/FASTQ_files --summaryprefix summary --ncpus 16 --use_existing yes
Completed running dms2_batch_bcsubamp.

Confirm that all the expected counts files exist:

In [14]:

assert all(os.path.isfile(f) for f in
           config['countsdir'] + '/' + samples['sample'] + '_codoncounts.csv'
           ), 'missing counts files'

Plot sequencing and mutation counts summaries¶

Running dms2_batch_bcsubamp created some summary plots. The prefix on those plots should be as follows:

In [15]:

countsplotprefix = os.path.join(config['countsdir'], 'summary_')

Reads and barcodes per sample¶

Total sequencing reads per sample:

In [16]:

showPDF(countsplotprefix + 'readstats.pdf')

Distribution of sequencing reads per barcode on subamplicons:

In [17]:

showPDF(countsplotprefix + 'readsperbc.pdf')

Number of barcoded subamplicons that align and have sufficient reads:

In [18]:

showPDF(countsplotprefix + 'bcstats.pdf')

Coverage across gene¶

Depth of valid barcoded subamplicons covering each site in the gene:

In [19]:

showPDF(countsplotprefix + 'depth.pdf')

Mutation frequencies¶

The average mutation frequency for each sample, stratifying by codon mutation type:

In [20]:

showPDF(countsplotprefix + 'codonmuttypes.pdf')

Average mutation frequency per sample, stratifying by number of nucleotide changes per codon mutation:

In [21]:

showPDF(countsplotprefix + 'codonntchanges.pdf')

Per-codon mutation frequencies across all sites in gene for each sample:

In [22]:

showPDF(countsplotprefix + 'mutfreq.pdf')

Check for oxidative damage¶

Sometimes there is oxidative damage which manifests as an enrichment of G->T and C->A mutations among the single-nucleotide codon mutations. Check for this by plotting frequencies of different single-nucleotide mutation types:

In [23]:

showPDF(countsplotprefix + 'singlentchanges.pdf')

Renumber to standard HA numbering¶

The above alignments use sequential 1, 2, ... numbering of the codons. This is not the standard numbering scheme used for HA, so we use dms_tools2.utils.renumbSites to renumber to the standard HA numbering scheme:

In [24]:

dms_tools2.utils.renumberSites(
    renumbfile=config['renumbering_scheme'],
    infiles=list(config['countsdir'] + '/' + samples['sample'] +
                 '_codoncounts.csv'),
    missing='drop',
    outdir=config['renumbcountsdir'])

assert all(os.path.isfile(f) for f in
           config['renumbcountsdir'] + '/' + samples['sample'] +
           '_codoncounts.csv'
           ), 'missing renumbered counts files'

print(f"Renumbered codon counts are in {config['renumbcountsdir']}")

Renumbered codon counts are in results/renumbered_codoncounts

Compute immune selection on mutations¶

We will now determine the immune selection on each mutation by comparing its frequency in each serum-selected sample to an appropriate mock-selected control. Specifically, we will quantify the immune selection as the differential selection (diffsel), which is essentially the log enrichment of each mutation relative to wildtype in the immune-selected versus mock sample. See Doud et al (2017) for the paper introducing this metric, and see here for a more detailed description.

Samples to compare for each selection¶

In order to quantify the immune selection, we need to compare each selected sample to the appropriate controls. Specifically, for each selection, we define three samples:

sel: the immune-selected sample
mock: the appropriate mock-selected control for that date and library
err: the appropriate wildtype plasmid control to estimate sequencing error rates.

We call each combination of three such samples a "selection". For each selection, we also record:

serum_name_formatted: a version of serum_name that also indicates species if non-human
name: a string indicating the library and the percent infectivity remaining. So for instance, lib1-5 indicates library 1 with 5% infectivity remaining.

Below we construct the data frame with this information on the selections:

In [25]:

selections = (

    # get all immune selected (sel) samples
    samples
    .query('(serum != "mock") & (serum != "plasmid")')
    .rename(columns={'sample': 'sel'})

    # add mock sample for that date and library
    .merge(samples
           .query('serum == "mock"')
           [['sample', 'library', 'date']]
           .rename(columns={'sample': 'mock'})
           )

    # add plasmid sample as error control (err) for that date and library
    .merge(samples
           .query('serum == "plasmid"')
           [['sample', 'library', 'date']]
           .rename(columns={'sample': 'err'})
           )

    # add information about sera
    .merge(sera, validate='many_to_one')

    # add informative names for serum and samples
    .assign(
        serum_name_formatted=lambda x:
            x['serum_species'].map(lambda s: '' if pd.isnull(s) or s == 'human'
                                   else s + '-') + x['serum_name'],
        name_formatted=lambda x:
            x['library'] + ', ' + x['percent_infectivity'].apply(
                dms_tools2.utils.sigFigStr, nsig=2) + '% infectivity',
        name=lambda x:
            x['library'] + '-' + x['percent_infectivity'].apply(
                dms_tools2.utils.sigFigStr, nsig=2)
        )

    # drop unneeded columns
    .drop(['R1', 'R2', 'SRA_accession'], axis='columns', errors='ignore')

    # re-order columns a bit so key ones are displayed first
    .set_index(['serum_name_formatted', 'name', 'sel', 'mock', 'err'])
    .reset_index()
    )

# make sure no duplicated serum / names
assert len(selections) == len(selections.groupby(['serum_name_formatted',
                                                  'name']))

print(f"Tabulated information for {len(selections)} selections:")
display(HTML(selections.to_html(index=False)))

Tabulated information for 92 selections:

serum_name_formatted	name	sel	mock	err	serum	library	date	serum_dilution	percent_infectivity	serum_description	serum_group	serum_name	serum_species	serum_vaccination	name_formatted
antibody-3C06	lib2-0.010	L2-3C06	Lib2mock-mAb	WTplasmid-mAb-A	3C06	lib2	2018-07-20	0.55	0.0100	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN	lib2, 0.010% infectivity
antibody-3C06	lib1-0.069	L4-3C06	Lib4mock-mAb	WTplasmid-mAb-C	3C06	lib1	2018-09-12	0.1	0.0688	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN	lib1, 0.069% infectivity
antibody-3C04	lib2-0.033	L2-3C04	Lib2mock-mAb	WTplasmid-mAb-A	3C04	lib2	2018-07-20	0.65	0.0331	site B-targeting monoclonal antibody 3C04 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C04	NaN	NaN	lib2, 0.033% infectivity
antibody-3C04	lib1-0.078	L4-3C04	Lib4mock-mAb	WTplasmid-mAb-C	3C04	lib1	2018-09-12	0.1	0.0777	site B-targeting monoclonal antibody 3C04 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C04	NaN	NaN	lib1, 0.078% infectivity
antibody-4C01	lib2-0.018	L2-4C01	Lib2mock-mAb	WTplasmid-mAb-A	4C01	lib2	2018-07-20	0.60	0.0185	site B-targeting monoclonal antibody 4C01 from Seth Zost and Scott Hensley	antibody_region_B	antibody-4C01	NaN	NaN	lib2, 0.018% infectivity
antibody-4C01	lib1-0.12	L4-4C01	Lib4mock-mAb	WTplasmid-mAb-C	4C01	lib1	2018-09-12	0.2	0.1190	site B-targeting monoclonal antibody 4C01 from Seth Zost and Scott Hensley	antibody_region_B	antibody-4C01	NaN	NaN	lib1, 0.12% infectivity
antibody-1C04	lib2-10	L2-1C04	Lib2mock-mAb	WTplasmid-mAb-A	1C04	lib2	2018-07-20	16.0	10.3190	lower head-targeting monoclonal antibody 1C04 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-1C04	NaN	NaN	lib2, 10% infectivity
antibody-1C04	lib3-7.9	L3-1C04	Lib3mock-mAb	WTplasmid-mAb-B	1C04	lib3	2018-09-12	18.0	7.9040	lower head-targeting monoclonal antibody 1C04 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-1C04	NaN	NaN	lib3, 7.9% infectivity
2015-age-48-prevacc	lib1-14	L4-589v1	Lib4mock-A	WTplasmid-A	589v1	lib1	2018-11-14	0.0185	13.8200	collected before 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-prevacc	human	pre	lib1, 14% infectivity
2015-age-48-prevacc	lib2-20	L5-589v1	Lib5mock-B	WTplasmid-D	589v1	lib2	2019-03-06	0.0175	20.3200	collected before 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-prevacc	human	pre	lib2, 20% infectivity
2015-age-48-prevacc	lib3-18	L6-589v1	Lib6mock-B	WTplasmid-H	589v1	lib3	2019-03-26	0.0175	18.4000	collected before 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-prevacc	human	pre	lib3, 18% infectivity
2015-age-49-prevacc	lib1-20	L4-571v1	Lib4mock-A	WTplasmid-A	571v1	lib1	2018-11-14	0.0185	19.6200	collected before 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-prevacc	human	pre	lib1, 20% infectivity
2015-age-49-prevacc	lib2-12	L5-571v1	Lib5mock-B	WTplasmid-D	571v1	lib2	2019-03-06	0.0185	11.8600	collected before 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-prevacc	human	pre	lib2, 12% infectivity
2015-age-49-prevacc	lib3-28	L6-571v1	Lib6mock-B	WTplasmid-H	571v1	lib3	2019-03-26	0.0185	27.7400	collected before 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-prevacc	human	pre	lib3, 28% infectivity
2015-age-49-vacc	lib1-5.5	L4-571v2	Lib4mock-A	WTplasmid-A	571v2	lib1	2018-11-14	0.0185	5.4800	collected after 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-vacc	human	post	lib1, 5.5% infectivity
2015-age-49-vacc	lib2-1.1	L5-571v2	Lib5mock-B	WTplasmid-D	571v2	lib2	2019-03-06	0.01	1.0900	collected after 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-vacc	human	post	lib2, 1.1% infectivity
2015-age-49-vacc	lib3-7.8	L6-571v2	Lib6mock-B	WTplasmid-H	571v2	lib3	2019-03-26	0.005	7.7900	collected after 2015/2016 vaccine from person born in 1966	Hensley_sera	2015-age-49-vacc	human	post	lib3, 7.8% infectivity
2015-age-29-prevacc	lib1-9.5	L4-574v1	Lib4mock-A	WTplasmid-A	574v1	lib1	2018-11-14	0.0185	9.4800	collected before 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-prevacc	human	pre	lib1, 9.5% infectivity
2015-age-29-prevacc	lib2-7.6	L5-574v1	Lib5mock-B	WTplasmid-D	574v1	lib2	2019-03-06	0.0125	7.5700	collected before 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-prevacc	human	pre	lib2, 7.6% infectivity
2015-age-29-prevacc	lib3-3.3	L6-574v1	Lib6mock-B	WTplasmid-H	574v1	lib3	2019-03-26	0.015	3.3000	collected before 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-prevacc	human	pre	lib3, 3.3% infectivity
ferret-WHO-Perth2009	lib1-2.3	L4-WHOCCPerth	Lib4mock-A	WTplasmid-A	WHOCCPerth	lib1	2018-11-14	0.0185	2.3200	ferret infected by Melbourne WHO CC with their Perth/2009 strain	ferret	WHO-Perth2009	ferret	post	lib1, 2.3% infectivity
ferret-WHO-Perth2009	lib2-1.9	L5-WHOCCPerth	Lib5mock-A	WTplasmid-B	WHOCCPerth	lib2	2019-01-16	0.0075	1.8800	ferret infected by Melbourne WHO CC with their Perth/2009 strain	ferret	WHO-Perth2009	ferret	post	lib2, 1.9% infectivity
ferret-WHO-Perth2009	lib3-4.1	L6-WHOCCPerth	Lib6mock-A	WTplasmid-E	WHOCCPerth	lib3	2019-03-06	0.0075	4.1000	ferret infected by Melbourne WHO CC with their Perth/2009 strain	ferret	WHO-Perth2009	ferret	post	lib3, 4.1% infectivity
2015-age-48-vacc	lib1-4.6	L4-589v2	Lib4mock-A	WTplasmid-A	589v2	lib1	2018-11-14	0.000875	4.5800	collected after 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-vacc	human	post	lib1, 4.6% infectivity
2015-age-48-vacc	lib2-4.7	L5-589v2	Lib5mock-B	WTplasmid-D	589v2	lib2	2019-03-06	0.000375	4.6800	collected after 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-vacc	human	post	lib2, 4.7% infectivity
2015-age-48-vacc	lib3-4.1	L6-589v2	Lib6mock-A	WTplasmid-E	589v2	lib3	2019-03-06	0.000375	4.0600	collected after 2015/2016 vaccine from person born in 1967	Hensley_sera	2015-age-48-vacc	human	post	lib3, 4.1% infectivity
2015-age-25-prevacc	lib1-6.9	L4-557v1	Lib4mock-A	WTplasmid-A	557v1	lib1	2018-11-14	0.0075	6.9000	collected before 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-prevacc	human	pre	lib1, 6.9% infectivity
2015-age-25-prevacc	lib2-2.3	L5-557v1	Lib5mock-B	WTplasmid-D	557v1	lib2	2019-03-06	0.005	2.3100	collected before 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-prevacc	human	pre	lib2, 2.3% infectivity
2015-age-25-prevacc	lib3-1.6	L6-557v1	Lib6mock-A	WTplasmid-E	557v1	lib3	2019-03-06	0.005	1.5700	collected before 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-prevacc	human	pre	lib3, 1.6% infectivity
ferret-Pitt-1-preinf	lib1-100	L4-f9267neg	Lib4mock-A	WTplasmid-A	f9267neg	lib1	2018-11-14	0.00075	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-1-preinf	ferret	pre	lib1, 100% infectivity
ferret-Pitt-1-preinf	lib2-100	L5-f9267neg	Lib5mock-A	WTplasmid-B	f9267neg	lib2	2019-01-16	0.00075	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-1-preinf	ferret	pre	lib2, 100% infectivity
ferret-Pitt-1-preinf	lib3-100	L6-f9267neg	Lib6mock-B	WTplasmid-H	f9267neg	lib3	2019-03-26	0.00075	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-1-preinf	ferret	pre	lib3, 100% infectivity
ferret-Pitt-1-postinf	lib1-4.4	L4-f9267d23	Lib4mock-A	WTplasmid-A	f9267d23	lib1	2018-11-14	0.00075	4.3600	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-1-postinf	ferret	post	lib1, 4.4% infectivity
ferret-Pitt-1-postinf	lib2-5.3	L5-f9267d23	Lib5mock-A	WTplasmid-B	f9267d23	lib2	2019-01-16	0.000175	5.3500	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-1-postinf	ferret	post	lib2, 5.3% infectivity
ferret-Pitt-1-postinf	lib3-1.7	L6-f9267d23	Lib6mock-A	WTplasmid-E	f9267d23	lib3	2019-03-06	0.000375	1.6900	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-1-postinf	ferret	post	lib3, 1.7% infectivity
ferret-Pitt-2-preinf	lib1-100	L4-f9435neg	Lib4mock-A	WTplasmid-A	f9435neg	lib1	2018-11-14	0.001	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-2-preinf	ferret	pre	lib1, 100% infectivity
ferret-Pitt-2-preinf	lib2-100	L5-f9435neg	Lib5mock-A	WTplasmid-B	f9435neg	lib2	2019-01-16	0.0025	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-2-preinf	ferret	pre	lib2, 100% infectivity
ferret-Pitt-2-preinf	lib3-100	L6-f9435neg	Lib6mock-B	WTplasmid-H	f9435neg	lib3	2019-03-26	0.0025	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-2-preinf	ferret	pre	lib3, 100% infectivity
ferret-Pitt-3-preinf	lib1-100	L4-f9437neg	Lib4mock-A	WTplasmid-A	f9437neg	lib1	2018-11-14	0.00225	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-3-preinf	ferret	pre	lib1, 100% infectivity
ferret-Pitt-3-preinf	lib2-100	L5-f9437neg	Lib5mock-A	WTplasmid-B	f9437neg	lib2	2019-01-16	0.00625	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-3-preinf	ferret	pre	lib2, 100% infectivity
ferret-Pitt-3-preinf	lib3-100	L6-f9437neg	Lib6mock-B	WTplasmid-H	f9437neg	lib3	2019-03-26	0.00625	100.0000	Lakdawala lab ferret, before infection	ferret	Pitt-3-preinf	ferret	pre	lib3, 100% infectivity
ferret-Pitt-3-postinf	lib1-8.8	L4-f9437d23	Lib4mock-A	WTplasmid-A	f9437d23	lib1	2018-11-14	0.00225	8.7700	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-3-postinf	ferret	post	lib1, 8.8% infectivity
ferret-Pitt-3-postinf	lib2-1.7	L5-f9437d23	Lib5mock-A	WTplasmid-B	f9437d23	lib2	2019-01-16	0.001375	1.7000	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-3-postinf	ferret	post	lib2, 1.7% infectivity
ferret-Pitt-3-postinf	lib3-5.8	L6-f9437d23	Lib6mock-B	WTplasmid-H	f9437d23	lib3	2019-03-26	0.00175	5.7800	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-3-postinf	ferret	post	lib3, 5.8% infectivity
2010-age-21	lib2-6.0	L5-VIDD1	Lib5mock-A	WTplasmid-B	VIDD1	lib2	2019-01-16	0.0045	5.9500	collected at Hutch in 2/2010 from person born in 1989	VIDD_sera	2010-age-21	human	NaN	lib2, 6.0% infectivity
2010-age-21	lib1-4.1	L4-VIDD1	Lib4mock-B	WTplasmid-C	VIDD1	lib1	2019-01-16	0.0175	4.0800	collected at Hutch in 2/2010 from person born in 1989	VIDD_sera	2010-age-21	human	NaN	lib1, 4.1% infectivity
2010-age-21	lib3-3.7	L6-VIDD1	Lib6mock-A	WTplasmid-E	VIDD1	lib3	2019-03-06	0.0075	3.6800	collected at Hutch in 2/2010 from person born in 1989	VIDD_sera	2010-age-21	human	NaN	lib3, 3.7% infectivity
ferret-WHO-Victoria2011	lib2-3.9	L5-WHOCCVic	Lib5mock-A	WTplasmid-B	WHOCCVic	lib2	2019-01-16	0.0025	3.9000	ferret infected by Melbourne WHO CC with their Victoria/2011 strain	ferret	WHO-Victoria2011	ferret	post	lib2, 3.9% infectivity
ferret-WHO-Victoria2011	lib1-1.3	L4-WHOCCVic	Lib4mock-B	WTplasmid-C	WHOCCVic	lib1	2019-01-16	0.00625	1.2600	ferret infected by Melbourne WHO CC with their Victoria/2011 strain	ferret	WHO-Victoria2011	ferret	post	lib1, 1.3% infectivity
ferret-WHO-Victoria2011	lib3-3.6	L6-WHOCCVic	Lib6mock-A	WTplasmid-E	WHOCCVic	lib3	2019-03-06	0.0025	3.6200	ferret infected by Melbourne WHO CC with their Victoria/2011 strain	ferret	WHO-Victoria2011	ferret	post	lib3, 3.6% infectivity
2009-age-65	lib2-5.2	L5-VIDD5	Lib5mock-A	WTplasmid-B	VIDD5	lib2	2019-01-16	0.000875	5.2400	collected at Hutch in 6/2009 from person born in 1944	VIDD_sera	2009-age-65	human	NaN	lib2, 5.2% infectivity
2009-age-65	lib3-5.6	L6-VIDD5	Lib6mock-B	WTplasmid-H	VIDD5	lib3	2019-03-26	0.002	5.5800	collected at Hutch in 6/2009 from person born in 1944	VIDD_sera	2009-age-65	human	NaN	lib3, 5.6% infectivity
2009-age-65	lib1-1.8	L4-VIDD5	Lib4mock-D	WTplasmid-G	VIDD5	lib1	2019-03-26	0.003375	1.7700	collected at Hutch in 6/2009 from person born in 1944	VIDD_sera	2009-age-65	human	NaN	lib1, 1.8% infectivity
2009-age-64	lib2-0.87	L5-VIDD4	Lib5mock-A	WTplasmid-B	VIDD4	lib2	2019-01-16	0.001675	0.8700	collected at Hutch in 11/2008 from person born in 1945	VIDD_sera	2009-age-64	human	NaN	lib2, 0.87% infectivity
2009-age-64	lib1-11	L4-VIDD4	Lib4mock-B	WTplasmid-C	VIDD4	lib1	2019-01-16	0.004	10.7400	collected at Hutch in 11/2008 from person born in 1945	VIDD_sera	2009-age-64	human	NaN	lib1, 11% infectivity
2009-age-64	lib3-0.26	L6-VIDD4	Lib6mock-B	WTplasmid-H	VIDD4	lib3	2019-03-26	0.00375	0.2600	collected at Hutch in 11/2008 from person born in 1945	VIDD_sera	2009-age-64	human	NaN	lib3, 0.26% infectivity
2009-age-53	lib2-2.0	L5-VIDD2	Lib5mock-A	WTplasmid-B	VIDD2	lib2	2019-01-16	0.00375	2.0300	collected at Hutch in 1/2009 from person born in 1956	VIDD_sera	2009-age-53	human	NaN	lib2, 2.0% infectivity
2009-age-53	lib1-3.7	L4-VIDD2	Lib4mock-B	WTplasmid-C	VIDD2	lib1	2019-01-16	0.00875	3.7400	collected at Hutch in 1/2009 from person born in 1956	VIDD_sera	2009-age-53	human	NaN	lib1, 3.7% infectivity
2009-age-53	lib3-6.2	L6-VIDD2	Lib6mock-A	WTplasmid-E	VIDD2	lib3	2019-03-06	0.00375	6.2200	collected at Hutch in 1/2009 from person born in 1956	VIDD_sera	2009-age-53	human	NaN	lib3, 6.2% infectivity
2009-age-53-plus-2-months	lib2-1.7	L5-VIDD3	Lib5mock-A	WTplasmid-B	VIDD3	lib2	2019-01-16	0.0045	1.7300	second sample collected at Hutch in 3/2009 from person born in 1956	VIDD_sera	2009-age-53-plus-2-months	human	NaN	lib2, 1.7% infectivity
2009-age-53-plus-2-months	lib1-6.0	L4-VIDD3	Lib4mock-B	WTplasmid-C	VIDD3	lib1	2019-01-16	0.0075	5.9800	second sample collected at Hutch in 3/2009 from person born in 1956	VIDD_sera	2009-age-53-plus-2-months	human	NaN	lib1, 6.0% infectivity
2009-age-53-plus-2-months	lib3-1.1	L6-VIDD3	Lib6mock-A	WTplasmid-E	VIDD3	lib3	2019-03-06	0.00625	1.1200	second sample collected at Hutch in 3/2009 from person born in 1956	VIDD_sera	2009-age-53-plus-2-months	human	NaN	lib3, 1.1% infectivity
ferret-Pitt-2-postinf	lib2-1.9	L5-f9435d23	Lib5mock-A	WTplasmid-B	f9435d23	lib2	2019-01-16	0.000625	1.8800	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-2-postinf	ferret	post	lib2, 1.9% infectivity
ferret-Pitt-2-postinf	lib1-3.5	L4-f9435d23	Lib4mock-B	WTplasmid-C	f9435d23	lib1	2019-01-16	0.0025	3.4500	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-2-postinf	ferret	post	lib1, 3.5% infectivity
ferret-Pitt-2-postinf	lib3-3.0	L6-f9435d23	Lib6mock-A	WTplasmid-E	f9435d23	lib3	2019-03-06	0.00125	3.0400	Lakdawala lab ferret, 23 days after infection by Perth/2009 with our HA	ferret	Pitt-2-postinf	ferret	post	lib3, 3.0% infectivity
2015-age-25-vacc	lib2-0.81	L5-557v2	Lib5mock-B	WTplasmid-D	557v2	lib2	2019-03-06	0.0005	0.8100	collected after 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-vacc	human	post	lib2, 0.81% infectivity
2015-age-25-vacc	lib3-1.6	L6-557v2	Lib6mock-A	WTplasmid-E	557v2	lib3	2019-03-06	0.0005	1.6400	collected after 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-vacc	human	post	lib3, 1.6% infectivity
2015-age-25-vacc	lib1-1.5	L4-557v2	Lib4mock-C	WTplasmid-F	557v2	lib1	2019-03-06	0.00125	1.5300	collected after 2015/2016 vaccine from person born in 1990	Hensley_sera	2015-age-25-vacc	human	post	lib1, 1.5% infectivity
2015-age-29-vacc	lib2-1.1	L5-574v2	Lib5mock-B	WTplasmid-D	574v2	lib2	2019-03-06	0.000875	1.1100	collected after 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-vacc	human	post	lib2, 1.1% infectivity
2015-age-29-vacc	lib1-1.5	L4-574v2	Lib4mock-C	WTplasmid-F	574v2	lib1	2019-03-06	0.002	1.5400	collected after 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-vacc	human	post	lib1, 1.5% infectivity
2015-age-29-vacc	lib3-1.9	L6-574v2	Lib6mock-B	WTplasmid-H	574v2	lib3	2019-03-26	0.00075	1.9200	collected after 2015/2016 vaccine from person born in 1986	Hensley_sera	2015-age-29-vacc	human	post	lib3, 1.9% infectivity
antibody-4F03	lib2-23	L5-4F03-c1	Lib5mock-B	WTplasmid-D	4F03	lib2	2019-03-06	0.3	22.9900	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib2, 23% infectivity
antibody-4F03	lib2-3.7	L5-4F03-c3	Lib5mock-B	WTplasmid-D	4F03	lib2	2019-03-06	1.5	3.7100	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib2, 3.7% infectivity
antibody-4F03	lib2-4.7	L5-4F03-c2	Lib5mock-B	WTplasmid-D	4F03	lib2	2019-03-06	0.7	4.7100	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib2, 4.7% infectivity
antibody-4F03	lib1-3.5	L4-4F03-c2	Lib4mock-C	WTplasmid-F	4F03	lib1	2019-03-06	1.0	3.4700	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib1, 3.5% infectivity
antibody-4F03	lib1-0.78	L4-4F03-c3	Lib4mock-C	WTplasmid-F	4F03	lib1	2019-03-06	2.0	0.7800	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib1, 0.78% infectivity
antibody-4F03	lib3-24	L6-4F03-c1	Lib6mock-B	WTplasmid-H	4F03	lib3	2019-03-26	0.3	23.9100	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib3, 24% infectivity
antibody-4F03	lib3-0.84	L6-4F03-c3	Lib6mock-B	WTplasmid-H	4F03	lib3	2019-03-26	1.4	0.8400	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib3, 0.84% infectivity
antibody-4F03	lib3-7.2	L6-4F03-c2	Lib6mock-B	WTplasmid-H	4F03	lib3	2019-03-26	0.75	7.1600	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib3, 7.2% infectivity
antibody-4F03	lib1-28	L4-4F03-c1	Lib4mock-D	WTplasmid-G	4F03	lib1	2019-03-26	0.3	28.2400	lower head-targeting monoclonal antibody 4F03 from Seth Zost and Scott Hensley	antibody_lower_head	antibody-4F03	NaN	NaN	lib1, 28% infectivity
antibody-5A01	lib2-1.4	L5-5A01	Lib5mock-B	WTplasmid-D	5A01	lib2	2019-03-06	0.025	1.3600	site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley	antibody_region_B	antibody-5A01	NaN	NaN	lib2, 1.4% infectivity
antibody-5A01	lib1-4.3	L4-5A01	Lib4mock-C	WTplasmid-F	5A01	lib1	2019-03-06	0.07	4.3200	site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley	antibody_region_B	antibody-5A01	NaN	NaN	lib1, 4.3% infectivity
antibody-5A01	lib3-1.0	L6-5A01	Lib6mock-B	WTplasmid-H	5A01	lib3	2019-03-26	0.025	1.0200	site B-targeting monoclonal antibody 5A01 from Seth Zost and Scott Hensley	antibody_region_B	antibody-5A01	NaN	NaN	lib3, 1.0% infectivity
2009-age-65-with-low-4F03	lib2-5.0	L5-VIDD5-low4F03	Lib5mock-B	WTplasmid-D	VIDD5andlow4F03	lib2	2019-03-06	0.000875+0.3	5.0000	collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency	serum_mAb_spike	2009-age-65-with-low-4F03	human	NaN	lib2, 5.0% infectivity
2009-age-65-with-low-4F03	lib3-0.20	L6-VIDD5-low4F03	Lib6mock-B	WTplasmid-H	VIDD5andlow4F03	lib3	2019-03-26	0.002+0.3	0.2040	collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency	serum_mAb_spike	2009-age-65-with-low-4F03	human	NaN	lib3, 0.20% infectivity
2009-age-65-with-low-4F03	lib1-0.030	L4-VIDD5-low4F03	Lib4mock-D	WTplasmid-G	VIDD5andlow4F03	lib1	2019-03-26	0.003375+0.3	0.0300	collected at Hutch in 6/2009 from person b1944 with 4F03 at low stringency	serum_mAb_spike	2009-age-65-with-low-4F03	human	NaN	lib1, 0.030% infectivity
2009-age-65-with-mid-4F03	lib2-0.46	L5-VIDD5-mid4F03	Lib5mock-B	WTplasmid-D	VIDD5andmid4F03	lib2	2019-03-06	0.000875+0.7	0.4600	collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency	serum_mAb_spike	2009-age-65-with-mid-4F03	human	NaN	lib2, 0.46% infectivity
2009-age-65-with-mid-4F03	lib3-0.0077	L6-VIDD5-mid4F03	Lib6mock-B	WTplasmid-H	VIDD5andmid4F03	lib3	2019-03-26	0.002+0.75	0.0077	collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency	serum_mAb_spike	2009-age-65-with-mid-4F03	human	NaN	lib3, 0.0077% infectivity
2009-age-65-with-mid-4F03	lib1-0.0037	L4-VIDD5-mid4F03	Lib4mock-D	WTplasmid-G	VIDD5andmid4F03	lib1	2019-03-26	0.003375+1	0.0037	collected at Hutch in 6/2009 from person b1944 with 4F03 at medium stringency	serum_mAb_spike	2009-age-65-with-mid-4F03	human	NaN	lib1, 0.0037% infectivity
2009-age-65-with-hi-4F03	lib2-0.040	L5-VIDD5-hi4F03	Lib5mock-B	WTplasmid-D	VIDD5andhi4F03	lib2	2019-03-06	0.000875+1.5	0.0400	collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency	serum_mAb_spike	2009-age-65-with-hi-4F03	human	NaN	lib2, 0.040% infectivity
2009-age-65-with-hi-4F03	lib3-0.0032	L6-VIDD5-hi4F03	Lib6mock-B	WTplasmid-H	VIDD5andhi4F03	lib3	2019-03-26	0.002+1.4	0.0032	collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency	serum_mAb_spike	2009-age-65-with-hi-4F03	human	NaN	lib3, 0.0032% infectivity
2009-age-65-with-hi-4F03	lib1-0.0016	L4-VIDD5-hi4F03	Lib4mock-D	WTplasmid-G	VIDD5andhi4F03	lib1	2019-03-26	0.003375+2	0.0016	collected at Hutch in 6/2009 from person b1944 with 4F03 at high stringency	serum_mAb_spike	2009-age-65-with-hi-4F03	human	NaN	lib1, 0.0016% infectivity

Compute immune selection¶

Now we run dms2_batch_diffsel to compute the immune selection. We then add to our selections data frame the name of the files holding the computed site (site) and mutation (mut) level selection for each sample.

The next cell does all of this:

In [26]:

outdir = config['diffseldir']
os.makedirs(outdir, exist_ok=True)

# write batch file used by program
batchfile = os.path.join(outdir, 'batch.csv')
(selections
 .rename(columns={'serum_name_formatted': 'group'})
 .to_csv(batchfile, index=False)
 )

cmds = ['dms2_batch_diffsel',
        '--summaryprefix', 'summary',
        '--batchfile', batchfile,
        '--outdir', outdir,
        '--indir', config['renumbcountsdir'],
        '--use_existing', config['use_existing'],
        '--ncpus', str(config['ncpus'])
        ]

print(f"Computing diffsel using dms2_batch_diffsel with command:\n{' '.join(cmds)}")
subprocess.check_output(cmds)

selfilecols = []
for selfile in ['mutdiffsel', 'sitediffsel']:
    selfilecol = selfile + '_file'
    selfilecols.append(selfilecol)
    selections[selfilecol] = (outdir + '/' + selections['serum_name_formatted']
                              + '-' + selections['name'] + '_' +
                              selfile + '.csv')
    missing_files = [f for f in selections[selfilecol]
                     if not os.path.isfile(f)]
    assert not missing_files, f"missing files: {missing_files}"
    print(f"Created {len(selections[selfilecol])} {selfile} files, adding to "
          f"`selections` data frame in column {selfilecol}")

Computing diffsel using dms2_batch_diffsel with command:
dms2_batch_diffsel --summaryprefix summary --batchfile results/diffsel/batch.csv --outdir results/diffsel --indir results/renumbered_codoncounts --use_existing yes --ncpus 16
Created 92 mutdiffsel files, adding to `selections` data frame in column mutdiffsel_file
Created 92 sitediffsel files, adding to `selections` data frame in column sitediffsel_file

Get all selection information in one data frame¶

For further processing, we want to create a dataframe that holds all of the selection information at the site and mutation levels for all samples. We create such a dataframe, sel_df, by reading the files in selections into the data frame using dms_tools2.diffsel.df_read_filecols:

In [27]:

sel_df = (dms_tools2.diffsel.df_read_filecols(selections, selfilecols)
          .drop(columns=selfilecols)
          )

Now sel_df is a very large data frame, but it has all the information we want to plot. Here are the first few rows:

In [28]:

print(f"sel_df has {len(sel_df)} rows. Here are the first few:")
display(HTML(sel_df.head(n=5).to_html(index=False)))

sel_df has 1041440 rows. Here are the first few:

serum_name_formatted	name	sel	mock	err	serum	library	date	serum_dilution	percent_infectivity	serum_description	serum_group	serum_name	serum_species	serum_vaccination	name_formatted	site	wildtype	mutation	mutdiffsel	abs_diffsel	positive_diffsel	negative_diffsel	max_diffsel	min_diffsel	isite
antibody-3C06	lib2-0.010	L2-3C06	Lib2mock-mAb	WTplasmid-mAb-A	3C06	lib2	2018-07-20	0.55	0.01	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN	lib2, 0.010% infectivity	167	T	A	12.307415	26.366193	24.254047	-2.112146	12.307415	-2.112146	182
antibody-3C06	lib2-0.010	L2-3C06	Lib2mock-mAb	WTplasmid-mAb-A	3C06	lib2	2018-07-20	0.55	0.01	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN	lib2, 0.010% infectivity	167	T	Q	1.227785	26.366193	24.254047	-2.112146	12.307415	-2.112146	182
antibody-3C06	lib2-0.010	L2-3C06	Lib2mock-mAb	WTplasmid-mAb-A	3C06	lib2	2018-07-20	0.55	0.01	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN	lib2, 0.010% infectivity	167	T	G	1.189062	26.366193	24.254047	-2.112146	12.307415	-2.112146	182
antibody-3C06	lib2-0.010	L2-3C06	Lib2mock-mAb	WTplasmid-mAb-A	3C06	lib2	2018-07-20	0.55	0.01	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN	lib2, 0.010% infectivity	167	T	V	1.085952	26.366193	24.254047	-2.112146	12.307415	-2.112146	182
antibody-3C06	lib2-0.010	L2-3C06	Lib2mock-mAb	WTplasmid-mAb-A	3C06	lib2	2018-07-20	0.55	0.01	site B-targeting monoclonal antibody 3C06 from Seth Zost and Scott Hensley	antibody_region_B	antibody-3C06	NaN	NaN	lib2, 0.010% infectivity	167	T	W	0.742358	26.366193	24.254047	-2.112146	12.307415	-2.112146	182

Analyze and plot immune selection¶

Choose sample to retain for each serum¶

For some serum, we have multiple samples. These include the samples from multiple library replicates, but sometimes also several samples for the same library replicate at different serum concentrations that yield different percent viral infectivity remaining.

Which of these samples do we want to retain? We probably want just one sample per library (so we don't over-weight some libraries in our results), and we need to select the serum concentration (percent viral infectivity remaining) that is "best" in terms of giving reproducible results between replicates with maximal signal to noise.

Plot site-level selection for all samples¶

We first plot the site-level selection for all samples. To do this, we look over all sites and use the facet_plot command of dmslogo to plot the site-level selection for all samples for each serum. We also compute the correlation between samples for each serum:

In [29]:

for serum_name, serum_sel_df in sel_df.groupby('serum_name_formatted'):

    print(f"\n\n******************* {serum_name} *******************")

    fig, ax = dmslogo.facet_plot(
                serum_sel_df,
                x_col='isite',
                gridcol_col='name_formatted',
                show_col=None,
                wspace=0.6,
                draw_line_kwargs=dict(
                        xtick_col='site',
                        height_col='positive_diffsel',
                        ylabel='positive_diffsel',
                        )
                )
    display(fig)
    plt.close(fig)

    corr_df = (serum_sel_df
               .rename(columns={'name_formatted': 'sample'})
               .pivot_table(values='positive_diffsel',
                            columns='sample',
                            index=['site'])
               .corr()
               .round(3)
               )
    display(HTML(corr_df.to_html()))


******************* 2009-age-53 *******************

sample	lib1, 3.7% infectivity	lib2, 2.0% infectivity	lib3, 6.2% infectivity
sample
lib1, 3.7% infectivity	1.000	0.439	0.396
lib2, 2.0% infectivity	0.439	1.000	0.380
lib3, 6.2% infectivity	0.396	0.380	1.000


******************* 2009-age-53-plus-2-months *******************

sample	lib1, 6.0% infectivity	lib2, 1.7% infectivity	lib3, 1.1% infectivity
sample
lib1, 6.0% infectivity	1.000	0.446	0.451
lib2, 1.7% infectivity	0.446	1.000	0.434
lib3, 1.1% infectivity	0.451	0.434	1.000


******************* 2009-age-64 *******************

sample	lib1, 11% infectivity	lib2, 0.87% infectivity	lib3, 0.26% infectivity
sample
lib1, 11% infectivity	1.000	0.498	0.473
lib2, 0.87% infectivity	0.498	1.000	0.438
lib3, 0.26% infectivity	0.473	0.438	1.000


******************* 2009-age-65 *******************

sample	lib1, 1.8% infectivity	lib2, 5.2% infectivity	lib3, 5.6% infectivity
sample
lib1, 1.8% infectivity	1.000	0.491	0.356
lib2, 5.2% infectivity	0.491	1.000	0.370
lib3, 5.6% infectivity	0.356	0.370	1.000


******************* 2009-age-65-with-hi-4F03 *******************

sample	lib1, 0.0016% infectivity	lib2, 0.040% infectivity	lib3, 0.0032% infectivity
sample
lib1, 0.0016% infectivity	1.000	0.327	0.346
lib2, 0.040% infectivity	0.327	1.000	0.632
lib3, 0.0032% infectivity	0.346	0.632	1.000


******************* 2009-age-65-with-low-4F03 *******************

sample	lib1, 0.030% infectivity	lib2, 5.0% infectivity	lib3, 0.20% infectivity
sample
lib1, 0.030% infectivity	1.000	0.409	0.381
lib2, 5.0% infectivity	0.409	1.000	0.500
lib3, 0.20% infectivity	0.381	0.500	1.000


******************* 2009-age-65-with-mid-4F03 *******************

sample	lib1, 0.0037% infectivity	lib2, 0.46% infectivity	lib3, 0.0077% infectivity
sample
lib1, 0.0037% infectivity	1.000	0.407	0.476
lib2, 0.46% infectivity	0.407	1.000	0.512
lib3, 0.0077% infectivity	0.476	0.512	1.000


******************* 2010-age-21 *******************

sample	lib1, 4.1% infectivity	lib2, 6.0% infectivity	lib3, 3.7% infectivity
sample
lib1, 4.1% infectivity	1.000	0.518	0.539
lib2, 6.0% infectivity	0.518	1.000	0.614
lib3, 3.7% infectivity	0.539	0.614	1.000


******************* 2015-age-25-prevacc *******************

sample	lib1, 6.9% infectivity	lib2, 2.3% infectivity	lib3, 1.6% infectivity
sample
lib1, 6.9% infectivity	1.000	0.753	0.763
lib2, 2.3% infectivity	0.753	1.000	0.757
lib3, 1.6% infectivity	0.763	0.757	1.000


******************* 2015-age-25-vacc *******************

sample	lib1, 1.5% infectivity	lib2, 0.81% infectivity	lib3, 1.6% infectivity
sample
lib1, 1.5% infectivity	1.000	0.708	0.693
lib2, 0.81% infectivity	0.708	1.000	0.747
lib3, 1.6% infectivity	0.693	0.747	1.000


******************* 2015-age-29-prevacc *******************

sample	lib1, 9.5% infectivity	lib2, 7.6% infectivity	lib3, 3.3% infectivity
sample
lib1, 9.5% infectivity	1.000	0.356	0.385
lib2, 7.6% infectivity	0.356	1.000	0.468
lib3, 3.3% infectivity	0.385	0.468	1.000


******************* 2015-age-29-vacc *******************

sample	lib1, 1.5% infectivity	lib2, 1.1% infectivity	lib3, 1.9% infectivity
sample
lib1, 1.5% infectivity	1.000	0.457	0.309
lib2, 1.1% infectivity	0.457	1.000	0.340
lib3, 1.9% infectivity	0.309	0.340	1.000


******************* 2015-age-48-prevacc *******************

sample	lib1, 14% infectivity	lib2, 20% infectivity	lib3, 18% infectivity
sample
lib1, 14% infectivity	1.000	0.320	0.345
lib2, 20% infectivity	0.320	1.000	0.413
lib3, 18% infectivity	0.345	0.413	1.000


******************* 2015-age-48-vacc *******************

sample	lib1, 4.6% infectivity	lib2, 4.7% infectivity	lib3, 4.1% infectivity
sample
lib1, 4.6% infectivity	1.000	0.578	0.668
lib2, 4.7% infectivity	0.578	1.000	0.571
lib3, 4.1% infectivity	0.668	0.571	1.000


******************* 2015-age-49-prevacc *******************

sample	lib1, 20% infectivity	lib2, 12% infectivity	lib3, 28% infectivity
sample
lib1, 20% infectivity	1.000	0.394	0.371
lib2, 12% infectivity	0.394	1.000	0.376
lib3, 28% infectivity	0.371	0.376	1.000


******************* 2015-age-49-vacc *******************

sample	lib1, 5.5% infectivity	lib2, 1.1% infectivity	lib3, 7.8% infectivity
sample
lib1, 5.5% infectivity	1.000	0.375	0.388
lib2, 1.1% infectivity	0.375	1.000	0.449
lib3, 7.8% infectivity	0.388	0.449	1.000


******************* antibody-1C04 *******************

sample	lib2, 10% infectivity	lib3, 7.9% infectivity
sample
lib2, 10% infectivity	1.000	0.732
lib3, 7.9% infectivity	0.732	1.000


******************* antibody-3C04 *******************

sample	lib1, 0.078% infectivity	lib2, 0.033% infectivity
sample
lib1, 0.078% infectivity	1.000	0.647
lib2, 0.033% infectivity	0.647	1.000


******************* antibody-3C06 *******************

sample	lib1, 0.069% infectivity	lib2, 0.010% infectivity
sample
lib1, 0.069% infectivity	1.000	0.576
lib2, 0.010% infectivity	0.576	1.000


******************* antibody-4C01 *******************

sample	lib1, 0.12% infectivity	lib2, 0.018% infectivity
sample
lib1, 0.12% infectivity	1.000	0.822
lib2, 0.018% infectivity	0.822	1.000


******************* antibody-4F03 *******************

sample	lib1, 0.78% infectivity	lib1, 28% infectivity	lib1, 3.5% infectivity	lib2, 23% infectivity	lib2, 3.7% infectivity	lib2, 4.7% infectivity	lib3, 0.84% infectivity	lib3, 24% infectivity	lib3, 7.2% infectivity
sample
lib1, 0.78% infectivity	1.000	0.293	0.850	0.412	0.792	0.718	0.768	0.507	0.728
lib1, 28% infectivity	0.293	1.000	0.395	0.357	0.368	0.370	0.326	0.379	0.388
lib1, 3.5% infectivity	0.850	0.395	1.000	0.451	0.745	0.720	0.731	0.578	0.743
lib2, 23% infectivity	0.412	0.357	0.451	1.000	0.576	0.630	0.459	0.533	0.506
lib2, 3.7% infectivity	0.792	0.368	0.745	0.576	1.000	0.819	0.761	0.576	0.744
lib2, 4.7% infectivity	0.718	0.370	0.720	0.630	0.819	1.000	0.786	0.688	0.806
lib3, 0.84% infectivity	0.768	0.326	0.731	0.459	0.761	0.786	1.000	0.687	0.846
lib3, 24% infectivity	0.507	0.379	0.578	0.533	0.576	0.688	0.687	1.000	0.799
lib3, 7.2% infectivity	0.728	0.388	0.743	0.506	0.744	0.806	0.846	0.799	1.000


******************* antibody-5A01 *******************

sample	lib1, 4.3% infectivity	lib2, 1.4% infectivity	lib3, 1.0% infectivity
sample
lib1, 4.3% infectivity	1.000	0.762	0.705
lib2, 1.4% infectivity	0.762	1.000	0.717
lib3, 1.0% infectivity	0.705	0.717	1.000


******************* ferret-Pitt-1-postinf *******************

sample	lib1, 4.4% infectivity	lib2, 5.3% infectivity	lib3, 1.7% infectivity
sample
lib1, 4.4% infectivity	1.000	0.517	0.58
lib2, 5.3% infectivity	0.517	1.000	0.58
lib3, 1.7% infectivity	0.580	0.580	1.00


******************* ferret-Pitt-1-preinf *******************

sample	lib1, 100% infectivity	lib2, 100% infectivity	lib3, 100% infectivity
sample
lib1, 100% infectivity	1.000	0.248	0.327
lib2, 100% infectivity	0.248	1.000	0.231
lib3, 100% infectivity	0.327	0.231	1.000


******************* ferret-Pitt-2-postinf *******************

sample	lib1, 3.5% infectivity	lib2, 1.9% infectivity	lib3, 3.0% infectivity
sample
lib1, 3.5% infectivity	1.000	0.487	0.551
lib2, 1.9% infectivity	0.487	1.000	0.509
lib3, 3.0% infectivity	0.551	0.509	1.000


******************* ferret-Pitt-2-preinf *******************

sample	lib1, 100% infectivity	lib2, 100% infectivity	lib3, 100% infectivity
sample
lib1, 100% infectivity	1.00	0.260	0.270
lib2, 100% infectivity	0.26	1.000	0.328
lib3, 100% infectivity	0.27	0.328	1.000


******************* ferret-Pitt-3-postinf *******************

sample	lib1, 8.8% infectivity	lib2, 1.7% infectivity	lib3, 5.8% infectivity
sample
lib1, 8.8% infectivity	1.000	0.589	0.563
lib2, 1.7% infectivity	0.589	1.000	0.629
lib3, 5.8% infectivity	0.563	0.629	1.000


******************* ferret-Pitt-3-preinf *******************

sample	lib1, 100% infectivity	lib2, 100% infectivity	lib3, 100% infectivity
sample
lib1, 100% infectivity	1.000	0.286	0.346
lib2, 100% infectivity	0.286	1.000	0.336
lib3, 100% infectivity	0.346	0.336	1.000


******************* ferret-WHO-Perth2009 *******************

sample	lib1, 2.3% infectivity	lib2, 1.9% infectivity	lib3, 4.1% infectivity
sample
lib1, 2.3% infectivity	1.000	0.502	0.534
lib2, 1.9% infectivity	0.502	1.000	0.449
lib3, 4.1% infectivity	0.534	0.449	1.000


******************* ferret-WHO-Victoria2011 *******************

sample	lib1, 1.3% infectivity	lib2, 3.9% infectivity	lib3, 3.6% infectivity
sample
lib1, 1.3% infectivity	1.000	0.600	0.618
lib2, 3.9% infectivity	0.600	1.000	0.631
lib3, 3.6% infectivity	0.618	0.631	1.000

Choose samples to retain based on infectivity remaining¶

From the plots and correlations immediately above, several things are apparent:

Different samples with the same virus library tend to be substantially more correlated than samples between different virus libraries. This makes sense, as some of the noise probably comes from library composition differences in terms of variant abundance. This fact implies that we want to keep just one (or the same number) of samples from each library, so that when we take means / medians across samples they are not disproportionately influenced by a single library.
The "best" concentration of serum appears to be one that leaves about about 1% to 5% of the viral infectivity remaining. This is judged by the fact that samples with this percent infectivity remaining tend to look "cleanest" in the line plots of site differential selection, and also tend to correlate best with samples from another library.

Based on the above, we will filter our samples down to those that have closest to the target amount of viral infectivity remaining. Because we qualitatively estimated above that the best results are obtained when between 1% to 5% of infectivity is remaining, we will target 2% infectivity. Then for any library / serum with multiple samples, we will retain the one that is closest (in log space) to that target infectivity. We do this in log space because in linear space, 0% infectivity would be closer to 2% infectivity as would 6% infectivity--even though the latter is clearly preferable if our target is 2%.

In [30]:

target_infectivity = 2
print(f"Choosing samples closest to {target_infectivity:.2f}% infectivity.")

Choosing samples closest to 2.00% infectivity.

Mark the samples to retain by adding a retained column to both the selections and sel_df data frames, and flagging it as True only for the sample for that serum and library that has the smallest log-space distance to the target infectivity:

In [31]:

selections = (
    selections
    .assign(retained=lambda x: x.assign(dist=lambda y: (y.percent_infectivity /
                                                        target_infectivity)
                                        .apply(math.log).apply(abs))
            .groupby(['serum_name_formatted', 'library'])
            ['dist']
            .transform(lambda y: y <= y.min())
            )
    )
print(f"Retaining {len(selections.query('retained'))} of {len(selections)}")

sel_df = sel_df.merge(selections[['serum_name_formatted', 'name', 'retained']],
                      validate='many_to_one')

Retaining 86 of 92

Plot the samples to retain and their percent infectivity. In the plot below, the dashed horizontal line indicates the target percent infectivity, and the colors of the points indicate which sample we retained for each serum / library:

In [32]:

p = (ggplot(selections.assign(xlabel=lambda x: (x['serum_name_formatted'] +
                                                ', ' + x['library'])),
               aes('xlabel', 'percent_infectivity', color='retained')) +
     geom_point(size=3, alpha=0.7) +
     theme(
         axis_text_x=element_text(angle=90),
         figure_size=(0.25 * len(selections.groupby(['serum_name_formatted',
                                                     'library'])), 2.5)
         ) +
     scale_y_log10(name='percent infectivity') +
     xlab('') +
     scale_color_manual(values=PALETTE) +
     geom_hline(yintercept=target_infectivity, linetype='dashed',
                   alpha=0.7, color=PALETTE[2])
     )

_ = p.draw()

Retained samples, percent infectivity, and serum/antibody concentration¶

Here is a small table listing the samples that we retained for each serum and

The percent infectivity remaining
The serum or antibody concentration. For sera, this is the dilution of the serum. For antibodies, this is the antibody concentration in $\mu$g/ml. For antibody+serum mixes, this is the serum dilution followed by the antibody concentration in $\mu$g/ml.

In [33]:

os.makedirs(config['selectiontabledir'], exist_ok=True)

for valtype in ['percent_infectivity', 'serum_dilution']:
    table = (selections
             .assign(serum_name=lambda x:
                     pd.Categorical(x['serum_name'],
                                    sera['serum_name'],
                                    ordered=True))
             .query('retained')
             .sort_values('serum_name')
             [['serum_group', 'serum_name_formatted', 'library', valtype]]
             .pivot_table(index=['serum_group', 'serum_name_formatted'],
                          columns='library',
                          values=valtype,
                          aggfunc=lambda x: ' '.join(str(v) for v in x))
             )
    print(f"\n************* {valtype} for each replicate *************")
    display(HTML(table.to_html()))
    table_file = os.path.join(config['selectiontabledir'], f"{valtype}_table.csv")
    print(f"Writing to {table_file}")
    table.to_csv(table_file)

************* percent_infectivity for each replicate *************

	library	lib1	lib2	lib3
serum_group	serum_name_formatted
Hensley_sera	2015-age-25-prevacc	6.9000	2.3100	1.5700
	2015-age-25-vacc	1.5300	0.8100	1.6400
	2015-age-29-prevacc	9.4800	7.5700	3.3000
	2015-age-29-vacc	1.5400	1.1100	1.9200
	2015-age-48-prevacc	13.8200	20.3200	18.4000
	2015-age-48-vacc	4.5800	4.6800	4.0600
	2015-age-49-prevacc	19.6200	11.8600	27.7400
	2015-age-49-vacc	5.4800	1.0900	7.7900
VIDD_sera	2009-age-53	3.7400	2.0300	6.2200
	2009-age-53-plus-2-months	5.9800	1.7300	1.1200
	2009-age-64	10.7400	0.8700	0.2600
	2009-age-65	1.7700	5.2400	5.5800
	2010-age-21	4.0800	5.9500	3.6800
antibody_lower_head	antibody-1C04	NaN	10.3190	7.9040
antibody_lower_head	antibody-4F03	3.4700	3.7100	0.8400
antibody_region_B	antibody-3C04	0.0777	0.0331	NaN
	antibody-3C06	0.0688	0.0100	NaN
	antibody-4C01	0.1190	0.0185	NaN
	antibody-5A01	4.3200	1.3600	1.0200
ferret	ferret-Pitt-1-postinf	4.3600	5.3500	1.6900
	ferret-Pitt-1-preinf	100.0000	100.0000	100.0000
	ferret-Pitt-2-postinf	3.4500	1.8800	3.0400
	ferret-Pitt-2-preinf	100.0000	100.0000	100.0000
	ferret-Pitt-3-postinf	8.7700	1.7000	5.7800
	ferret-Pitt-3-preinf	100.0000	100.0000	100.0000
	ferret-WHO-Perth2009	2.3200	1.8800	4.1000
	ferret-WHO-Victoria2011	1.2600	3.9000	3.6200
serum_mAb_spike	2009-age-65-with-hi-4F03	0.0016	0.0400	0.0032
	2009-age-65-with-low-4F03	0.0300	5.0000	0.2040
	2009-age-65-with-mid-4F03	0.0037	0.4600	0.0077

Writing to results/selection_tables/percent_infectivity_table.csv

************* serum_dilution for each replicate *************

	library	lib1	lib2	lib3
serum_group	serum_name_formatted
Hensley_sera	2015-age-25-prevacc	0.0075	0.005	0.005
	2015-age-25-vacc	0.00125	0.0005	0.0005
	2015-age-29-prevacc	0.0185	0.0125	0.015
	2015-age-29-vacc	0.002	0.000875	0.00075
	2015-age-48-prevacc	0.0185	0.0175	0.0175
	2015-age-48-vacc	0.000875	0.000375	0.000375
	2015-age-49-prevacc	0.0185	0.0185	0.0185
	2015-age-49-vacc	0.0185	0.01	0.005
VIDD_sera	2009-age-53	0.00875	0.00375	0.00375
	2009-age-53-plus-2-months	0.0075	0.0045	0.00625
	2009-age-64	0.004	0.001675	0.00375
	2009-age-65	0.003375	0.000875	0.002
	2010-age-21	0.0175	0.0045	0.0075
antibody_lower_head	antibody-1C04	NaN	16.0	18.0
antibody_lower_head	antibody-4F03	1.0	1.5	1.4
antibody_region_B	antibody-3C04	0.1	0.65	NaN
	antibody-3C06	0.1	0.55	NaN
	antibody-4C01	0.2	0.60	NaN
	antibody-5A01	0.07	0.025	0.025
ferret	ferret-Pitt-1-postinf	0.00075	0.000175	0.000375
	ferret-Pitt-1-preinf	0.00075	0.00075	0.00075
	ferret-Pitt-2-postinf	0.0025	0.000625	0.00125
	ferret-Pitt-2-preinf	0.001	0.0025	0.0025
	ferret-Pitt-3-postinf	0.00225	0.001375	0.00175
	ferret-Pitt-3-preinf	0.00225	0.00625	0.00625
	ferret-WHO-Perth2009	0.0185	0.0075	0.0075
	ferret-WHO-Victoria2011	0.00625	0.0025	0.0025
serum_mAb_spike	2009-age-65-with-hi-4F03	0.003375+2	0.000875+1.5	0.002+1.4
	2009-age-65-with-low-4F03	0.003375+0.3	0.000875+0.3	0.002+0.3
	2009-age-65-with-mid-4F03	0.003375+1	0.000875+0.7	0.002+0.75

Writing to results/selection_tables/serum_dilution_table.csv

Compute serum average from retained samples¶

We now compute the average selection among the retained samples for each serum.

First, confirm that we have retained just one sample per serum / library:

In [34]:

assert all(len(group) == 1 for _, group in
           (selections
            .query('retained')
            .groupby(['serum_name_formatted', 'library'])
            )
           )

We can compute the "average" selection using either the mean or the median; which one to use is specified in the config file:

In [35]:

avg_type = config['avg_type']
print(f"Computing across replicate averages as the {avg_type}")

Computing across replicate averages as the median

Do a sanity check and make sure none of our libraries are named the same as the average type:

In [36]:

assert all(avg_type not in selections[col].values for col in ['library', 'name_formatted'])

Now loop over all sera and compute the average selection for all retained samples for that sera. Note that the averages are computed on the mutation-level data, and then the site data are computed from those averaged mutation-level data. The averages (along with the samples used to compute these averages) are then added to a new data frame similar to selections that is called avg_selections:

In [37]:

avgdir = config['avgdiffseldir']
os.makedirs(avgdir, exist_ok=True)

avg_selections = []
for serum_name_formatted, group in (
            selections
            .query('retained')
            [['serum_name_formatted', 'library', 'name_formatted'] +
             list(sera.columns) + selfilecols]
            .groupby('serum_name_formatted')
            ):

    avg_selections.append(group)

    # build row of data frame with average
    avg_row = group.iloc[0].to_dict(into=collections.OrderedDict)
    avg_row['library'] = avg_type
    avg_row['name_formatted'] = avg_type

    avg_row['mutdiffsel_file'] = (f"{avgdir}/{serum_name_formatted}-"
                                  f"mutdiffsel-{avg_type}.csv")
    (dms_tools2.diffsel.avgMutDiffSel(group['mutdiffsel_file'], avg_type)
     .to_csv(avg_row['mutdiffsel_file'], index=False))

    avg_row['sitediffsel_file'] = (f"{avgdir}/{serum_name_formatted}-"
                                   f"sitediffsel-{avg_type}.csv")
    (dms_tools2.diffsel.mutToSiteDiffSel(pd.read_csv(avg_row['mutdiffsel_file']))
     .to_csv(avg_row['sitediffsel_file'], index=False))

    avg_row = pd.Series(avg_row).to_frame().transpose()
    assert all(avg_row.columns == group.columns)
    avg_selections.append(avg_row)

# put avg_selections in data frame, sort to match config['sera']
avg_selections = (pd.concat(avg_selections)
                  .assign(serum_name=lambda x:
                          pd.Categorical(x['serum_name'],
                                         sera['serum_name'],
                                         ordered=True))
                  .sort_values(['serum_name', 'library'])
                  .assign(serum_name_formatted=lambda x:
                          pd.Categorical(x['serum_name_formatted'],
                                         x['serum_name_formatted'].unique(),
                                         ordered=True))
                  .reset_index(drop=True)
                  )

Now the avg_selections data frame lists the files giving all the retained library replicates plus the average calculated from them these replicates.

Now we create the data frame avg_sel_df which actually holds the site- and mutation-level averages for all sera as well as the samples (one replicate per library) that we used to compute these averages:

In [38]:

avg_sel_df = (dms_tools2.diffsel.df_read_filecols(avg_selections, selfilecols)
              .drop(columns=selfilecols)
              # preserve order of sera as in `avg_selections`
              .assign(serum_name_formatted=lambda x:
                      pd.Categorical(x['serum_name_formatted'],
                                     (avg_selections['serum_name_formatted']
                                      .unique()),
                                     ordered=True))
              )

This avg_sel_df data frame differs from sel_df only in that it includes the averages as a library type, and only has the retained replicates for each library.

Identify sites of "significant" selection¶

We want to identify the sites that are under "significant" immune selection. The reason is that we can then zoom in on these sites in logo plots.

In dms_tools2.plot.findSigSel function, we have implemented a heuristic way to do this. Essentially, this function fits a gamma distribution to the selection values for each site, and then identifies those that are "outliers" of high selection.

Cutoff for significance¶

First, we define a cutoff for what constitutes significant:

In [39]:

fdr_cutoff = 0.05

Identify significant sites¶

Now we use dms_tools2.plot.findSigSel to get a dataframe (sigsites_df) that lists the "significant" sites for each serum. That function finds these sites by fitting a gamma distribution to the data and then finding sites that are far outside the range of the distribution (thereby computing a heuristic P-value). It has two methods, robust_hist and mle; we use both and take any sites found by either method.

Because the pre-vaccination and pre-infection serum generally have weak signal and the vaccination at most boosts existing specificities in the human samples that we have, we will ignore the pre-vaccination samples when identifying significant sites.

The cell below also saves plots showing the fit gamma distribution (you can inspect these separately if you want to look in more detail):

In [40]:

os.makedirs(config['avgdiffsel_sigsites_dir'], exist_ok=True)

plotfile_template = os.path.join(config['avgdiffsel_sigsites_dir'],
                                 'sigsites_{serum}_{method}.pdf')

print(f"Identifying sites of significant selection at a FDR of {fdr_cutoff}.\n"
      f"Plots of distribution fitting saved as {plotfile_template}")

sigsites_df = []
sigsites_cols = ['serum_group', 'serum_name_formatted', 'isite', 'site',
                 'positive_diffsel']
for serum_name_formatted, group in (
        avg_sel_df
        .query('serum_vaccination != "pre"')
        .query('library == @avg_type')
        [sigsites_cols]
        .drop_duplicates()
        .groupby('serum_name_formatted', observed=True)
        ):
    
    for method in ['robust_hist', 'mle']:
        plotfile = plotfile_template.format(serum=serum_name_formatted,
                                            method=method)
        df, _, _ = dms_tools2.plot.findSigSel(
                group,
                'positive_diffsel',
                plotfile,
                fdr=fdr_cutoff,
                title=serum_name_formatted,
                method=method
                )
        sigsites_df.append(df)

sigsites_df = (pd.concat(sigsites_df, ignore_index=True)
               .groupby(sigsites_cols)
               .aggregate({'sig': 'sum'})
               .reset_index()
               .assign(sig=lambda x: x['sig'].astype(bool))
               .query('sig')
               )

print('Here are the first few rows of sigsites_df:')
display(HTML(sigsites_df.head(n=4).to_html(index=False)))

Identifying sites of significant selection at a FDR of 0.05.
Plots of distribution fitting saved as results/avgdiffsel/sigsites/sigsites_{serum}_{method}.pdf
Here are the first few rows of sigsites_df:

serum_group	serum_name_formatted	isite	site	positive_diffsel	sig
Hensley_sera	2015-age-25-vacc	160	145	14.630931	True
Hensley_sera	2015-age-25-vacc	174	159	26.089489	True
Hensley_sera	2015-age-25-vacc	175	160	6.151038	True
Hensley_sera	2015-age-25-vacc	176	161	5.219629	True

List significant sites for each serum¶

Now display lists of the significant sites for each serum (we did not determine significant sites for pre-vaccination samples as described above):

In [41]:

display(HTML(sigsites_df
             .sort_values('isite')
             .assign(nsites=1)
             .groupby('serum_name_formatted')
             .aggregate({'site': lambda x: ', '.join(list(x)),
                         'nsites': 'sum'})
             .rename(columns={'site': 'significant sites',
                              'nsites': 'number of sites'})
             .to_html()
             ))

	significant sites	number of sites
serum_name_formatted
antibody-5A01	157, 158, 159, 160, 193, 222, 227, 244	8
antibody-3C04	159, 160, 192, 193	4
antibody-3C06	137, 145, 159, 160, 167, 193, 207, 244, 246	9
antibody-4C01	193	1
antibody-4F03	80, 81, 83, 121, 122, 220, 244, 259, (HA2)78	9
antibody-1C04	53, 54, 57, 82, 83, 188, 210, 220, 244, (HA2)61	10
ferret-Pitt-1-preinf		0
ferret-Pitt-1-postinf	189, 193, 222	3
ferret-Pitt-2-preinf		0
ferret-Pitt-2-postinf	142, 144, 189, 193, 222	5
ferret-Pitt-3-preinf		0
ferret-Pitt-3-postinf	189, 193	2
ferret-WHO-Perth2009	50, 189, 193	3
ferret-WHO-Victoria2011	50, 159, 189, 193, 222, 275	6
2010-age-21	144, 159, 193, 222	4
2009-age-53	157, 160	2
2009-age-53-plus-2-months	157, 160, 244	3
2009-age-64	159, 222, 244	3
2009-age-65	159, 160, 193	3
2015-age-25-prevacc		0
2015-age-25-vacc	145, 159, 160, 161, 192, 193, 207, 220, 222, 224, 225, 244	12
2015-age-29-prevacc		0
2015-age-29-vacc	144, 145, 159, 160, 222, 227	6
2015-age-48-prevacc		0
2015-age-48-vacc	189	1
2015-age-49-prevacc		0
2015-age-49-vacc		0
2009-age-65-with-low-4F03	80, 121, 159, 160, 193, 244	6
2009-age-65-with-mid-4F03	121	1
2009-age-65-with-hi-4F03		0

Get significant sites for each serum group¶

In the analyses below, we may want to plot the sites that are significant within at least one serum sample for each serum group. Therefore, we determine the significant sites within each serum group:

In [42]:

sigsites_by_serumgroup = (
    sigsites_df
    [['serum_group', 'isite', 'site']]
    .drop_duplicates()
    .sort_values('isite')
    .groupby('serum_group')
    .aggregate(lambda x: list(x))
    .assign(nsites=lambda x: x['site'].apply(len))
    .sort_values('nsites')
    )

display(HTML(sigsites_by_serumgroup.to_html()))

	isite	site	nsites
serum_group
serum_mAb_spike	[95, 136, 174, 175, 208, 259]	[80, 121, 159, 160, 193, 244]	6
VIDD_sera	[159, 172, 174, 175, 208, 237, 259]	[144, 157, 159, 160, 193, 222, 244]	7
ferret	[65, 157, 159, 174, 204, 208, 237, 290]	[50, 142, 144, 159, 189, 193, 222, 275]	8
antibody_region_B	[152, 160, 172, 173, 174, 175, 182, 207, 208, 222, 237, 242, 259, 261]	[137, 145, 157, 158, 159, 160, 167, 192, 193, 207, 222, 227, 244, 246]	14
Hensley_sera	[159, 160, 174, 175, 176, 204, 207, 208, 222, 235, 237, 239, 240, 242, 259]	[144, 145, 159, 160, 161, 189, 192, 193, 207, 220, 222, 224, 225, 227, 244]	15
antibody_lower_head	[68, 69, 72, 95, 96, 97, 98, 136, 137, 203, 225, 235, 259, 274, 405, 422]	[53, 54, 57, 80, 81, 82, 83, 121, 122, 188, 210, 220, 244, 259, (HA2)61, (HA2)78]	16

Line and logo plots of average for each serum¶

Now we plot the average (across libraries) selection for each serum.

Choose sites to zoom-in on¶

In the plots, we will zoom in on important sites using logo plots. The reason that we zoom in on just a subset of sites is to keep the logo plots relatively compact.

The sites we zoom in on will be those identified above as being under "significant" selection for all sera in that serum group.

We also may want to "pad" the sites that we zoom in on by zooming in on this many sites before and after each significant site. This is useful if you want to give some context to the zoomed sites. Below, set the amount of padding you want around each significant site; a value of 0 means no padding:

In [43]:

zoom_pad = 0

Now build a dict, zoom_sites that is keyed first by serum_group and then by isite and site, with values being the list of sites to zoom for that serum group:

In [44]:

zoom_sites = {}
for tup in sigsites_by_serumgroup.reset_index().itertuples(index=False):
    isites = set(tup.isite)
    for isite in list(isites):
        for pad in range(-zoom_pad, zoom_pad + 1):
            isites.add(isite + pad)
    # only keep valid sites
    isites = isites.intersection(set(avg_sel_df['isite']))
    isites = sorted(isites)  # get as sorted list
    # get sites corresponding to each isite
    sites = (avg_sel_df
             [['isite', 'site']]
             .drop_duplicates()
             .sort_values('isite')
             .query('isite in @isites')
             ['site']
             .tolist()
             )
    zoom_sites[tup.serum_group] = {'isite': isites,
                                   'site': sites,
                                   'nsites': len(sites)}

Here are the sites we will zoom in on for each serum group:

In [45]:

display(HTML(pd.DataFrame.from_dict(zoom_sites, orient='index').to_html()))

	isite	site	nsites
Hensley_sera	[159, 160, 174, 175, 176, 204, 207, 208, 222, 235, 237, 239, 240, 242, 259]	[144, 145, 159, 160, 161, 189, 192, 193, 207, 220, 222, 224, 225, 227, 244]	15
VIDD_sera	[159, 172, 174, 175, 208, 237, 259]	[144, 157, 159, 160, 193, 222, 244]	7
antibody_lower_head	[68, 69, 72, 95, 96, 97, 98, 136, 137, 203, 225, 235, 259, 274, 405, 422]	[53, 54, 57, 80, 81, 82, 83, 121, 122, 188, 210, 220, 244, 259, (HA2)61, (HA2)78]	16
antibody_region_B	[152, 160, 172, 173, 174, 175, 182, 207, 208, 222, 237, 242, 259, 261]	[137, 145, 157, 158, 159, 160, 167, 192, 193, 207, 222, 227, 244, 246]	14
ferret	[65, 157, 159, 174, 204, 208, 237, 290]	[50, 142, 144, 159, 189, 193, 222, 275]	8
serum_mAb_spike	[95, 136, 174, 175, 208, 259]	[80, 121, 159, 160, 193, 244]	6

Add a column (zoom_site) to avg_sel_df that indicates which sites to zoom in on:

In [46]:

avg_sel_df = pd.concat(
        [df.assign(zoom_site=lambda x: x['isite'].isin(
                            zoom_sites[serum_group]['isite']))
         for serum_group, df in avg_sel_df.groupby('serum_group')],
        ignore_index=True
        )

Write tidy data frame with selection data¶

We now have all the information used to display the data in the tidy data frame avg_sel_df, which has the selection for every mutation for each antibody averaged across this replicates. We write this data frame to a CSV file, getting rid of some unneeded columns:

In [47]:

avg_sel_df_file = os.path.join(config['avgdiffseldir'],
                                'avg_sel_tidy.csv')
print(f"Writing average selection information to {avg_sel_df_file}")

(avg_sel_df
 .query('library == @avg_type')
 .drop(columns=['library', 'name_formatted', 'serum_description',
                'serum_name', 'serum_species'])
 .to_csv(avg_sel_df_file, index=False, float_format='%.5g')
 )

Writing average selection information to results/avgdiffsel/avg_sel_tidy.csv

Compact "zoom" plots¶

For each group of sera we make line plots that show the site-level selection and logo plots that zoom in on mutations at the sites of significant selection. We make these plots using the facet_plot command of dmslogo.

We want to label the logo plots with site numbers and wildtype residue, so first we create a column that contains this information:

In [48]:

avg_sel_df = avg_sel_df.assign(site_label=lambda x: x['wildtype'] +
                               x['site'].astype('str'))

We will use axes with shared ylimits across rows for all plots except for the antibody serum group:

In [49]:

share_ylim_across_rows = {serum_group: ('antibody' not in serum_group)
                          for serum_group in avg_sel_df['serum_group'].unique()}

Now we make the line and logo plots. We also save PDF versions of each plot:

In [50]:

os.makedirs(config['avgdiffsel_zoom_dir'], exist_ok=True)

for serum_group, df in avg_sel_df.groupby('serum_group'):

    plotfile = os.path.join(config['avgdiffsel_zoom_dir'],
                            f"{serum_group}_avg.pdf")
    print(f"\n\n{'*' * 72}\nSerum group {serum_group}, saving to {plotfile}\n")

    fig, axes = dmslogo.facet_plot(
            data=df.query('library == @avg_type'),
            x_col='isite',
            show_col='zoom_site',
            gridrow_col='serum_name_formatted',
            share_xlabel=True,
            share_ylabel=True,
            share_ylim_across_rows=share_ylim_across_rows[serum_group],
            wspace=0.6,
            draw_line_kwargs=dict(
                    height_col='positive_diffsel',
                    xtick_col='site',
                    ylabel='immune selection',
                    ),
            draw_logo_kwargs=dict(
                    letter_col='mutation',
                    letter_height_col='mutdiffsel',
                    xtick_col='site_label',
                    xlabel='site',
                    ylabel='immune selection',
                    clip_negative_heights=True,
                    ),
            )
    display(fig)
    fig.savefig(plotfile)
    plt.close(fig)


************************************************************************
Serum group Hensley_sera, saving to results/avgdiffsel/zoomed_plots/Hensley_sera_avg.pdf


************************************************************************
Serum group VIDD_sera, saving to results/avgdiffsel/zoomed_plots/VIDD_sera_avg.pdf


************************************************************************
Serum group antibody_lower_head, saving to results/avgdiffsel/zoomed_plots/antibody_lower_head_avg.pdf


************************************************************************
Serum group antibody_region_B, saving to results/avgdiffsel/zoomed_plots/antibody_region_B_avg.pdf


************************************************************************
Serum group ferret, saving to results/avgdiffsel/zoomed_plots/ferret_avg.pdf


************************************************************************
Serum group serum_mAb_spike, saving to results/avgdiffsel/zoomed_plots/serum_mAb_spike_avg.pdf

Whole-gene logo plots¶

Finally, we make whole-gene logo plots for each serum that shows the replicate-average selection for all sites. We make these whole-gene plots using dms2_logoplot. They are too large to be useful to show visually in this notebook, but the cell below gives the name of the PDF holding each logo plot so you can examine them individually if you'd like.

In [51]:

outdir = config['avgdiffsel_full_dir']  # save plots here
os.makedirs(outdir, exist_ok=True)

for tup in (avg_selections
            .query('library == @avg_type')
            .itertuples(index=False)
            ):
    name = getattr(tup, "serum_name_formatted")
    plotfile = os.path.join(outdir, f"{name}_diffsel.pdf")
    if os.path.isfile(plotfile) and config['use_existing'] == 'yes':
        print(f"{plotfile} already exists.")
        continue
    datafile = getattr(tup, 'mutdiffsel_file')
    maxsitediffsel = (pd.read_csv(getattr(tup, 'sitediffsel_file'))
                      ['positive_diffsel']
                      .max()
                      )
    barheight = f"{0.5 * maxsitediffsel:.2g}"
    barlabel = f"differential selection = {barheight}"
    cmds = ['dms2_logoplot',
            '--outdir', outdir,
            '--diffsel', datafile,
            '--name', name,
            '--nperline', '95',
            '--overlay1', datafile, 'wildtype', 'wildtype',
            '--underlay', 'yes', 
            '--restrictdiffsel', 'positive',
            '--scalebar', barheight, barlabel,
            '--use_existing', config['use_existing'],
            ]
    print(f"Plotting {name} to {plotfile}")
    subprocess.check_output(cmds)
    assert os.path.isfile(plotfile)

results/avgdiffsel/full_logo_plots/antibody-5A01_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/antibody-3C04_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/antibody-3C06_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/antibody-4C01_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/antibody-4F03_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/antibody-1C04_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-Pitt-1-preinf_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-Pitt-1-postinf_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-Pitt-2-preinf_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-Pitt-2-postinf_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-Pitt-3-preinf_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-Pitt-3-postinf_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-WHO-Perth2009_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/ferret-WHO-Victoria2011_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2010-age-21_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2009-age-53_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2009-age-53-plus-2-months_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2009-age-64_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2009-age-65_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-25-prevacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-25-vacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-29-prevacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-29-vacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-48-prevacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-48-vacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-49-prevacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2015-age-49-vacc_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2009-age-65-with-low-4F03_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2009-age-65-with-mid-4F03_diffsel.pdf already exists.
results/avgdiffsel/full_logo_plots/2009-age-65-with-hi-4F03_diffsel.pdf already exists.

Beta diversity for each group¶

In order to determine how variable the selection is in each group, we compute the beta diversity, using both the Shannon and Simpson indices. We look only at post-vaccination samples, excluding the duplicate 2009 age 53 sample and not examing the spike-in experiment. Here are the results:

In [52]:

for index in ['shannon', 'simpson']:
    print(f"\n{index} beta diversity:")
    print(avg_sel_df
          .query('library == @avg_type')
          .query('serum_vaccination != "pre"')
          .query('serum_name_formatted != "2009-age-53-plus-2-months"')
          .query('serum_group != "serum_mAb_spike"')
          .groupby(['serum_group'])
          .apply(dms_tools2.diffsel.beta_diversity,
                 samplecol='serum',
                 sitecol='site',
                 valcol='positive_diffsel',
                 index=index)
          )

shannon beta diversity:
serum_group
Hensley_sera           1.230070
VIDD_sera              1.196409
antibody_lower_head    1.139691
antibody_region_B      1.452823
ferret                 1.135527
dtype: float64

simpson beta diversity:
serum_group
Hensley_sera           1.600488
VIDD_sera              1.375323
antibody_lower_head    1.379706
antibody_region_B      1.594487
ferret                 1.121852
dtype: float64

Large beta diversity indicates more variability among the samples in a group. As expected, the ferrets have lower diversity than the human sera. This is especially pronounced when using the Simpson index, which emphasizes the effect of strong-selection sites.

Plots of each replicate in averages¶

In the above section, we plotted the average for each serum. Here we also show some data for the replicates that went into this average.

Zoom plots showing each replicate¶

Here we make line and zoomed logo plots for each replicate that goes into the average:

In [53]:

os.makedirs(config['avgdiffsel_reps_dir'], exist_ok=True)

for serum_group, df in avg_sel_df.groupby('serum_group'):

    plotfile = os.path.join(config['avgdiffsel_reps_dir'],
                            f"{serum_group}_reps.pdf")
    svgplotfile = os.path.splitext(plotfile)[0] + '.svg'
    print(f"\n\n{'*' * 72}\n{serum_group}, saving to {plotfile} and {svgplotfile}\n")

    fig, axes = dmslogo.facet_plot(
            data=df.query('library != @avg_type'),
            x_col='isite',
            show_col='zoom_site',
            gridrow_col='serum_name_formatted',
            gridcol_col='library',
            share_xlabel=True,
            share_ylabel=True,
            share_ylim_across_rows=share_ylim_across_rows[serum_group],
            wspace=0.6,
            draw_line_kwargs=dict(
                    height_col='positive_diffsel',
                    xtick_col='site',
                    ylabel='immune selection',
                    ),
            draw_logo_kwargs=dict(
                    letter_col='mutation',
                    letter_height_col='mutdiffsel',
                    xtick_col='site_label',
                    xlabel='site',
                    ylabel='immune selection',
                    clip_negative_heights=True,
                    ),
            )
    display(fig)
    fig.savefig(plotfile)
    fig.savefig(svgplotfile)
    plt.close(fig)


************************************************************************
Hensley_sera, saving to results/avgdiffsel/replicates/Hensley_sera_reps.pdf and results/avgdiffsel/replicates/Hensley_sera_reps.svg


************************************************************************
VIDD_sera, saving to results/avgdiffsel/replicates/VIDD_sera_reps.pdf and results/avgdiffsel/replicates/VIDD_sera_reps.svg


************************************************************************
antibody_lower_head, saving to results/avgdiffsel/replicates/antibody_lower_head_reps.pdf and results/avgdiffsel/replicates/antibody_lower_head_reps.svg


************************************************************************
antibody_region_B, saving to results/avgdiffsel/replicates/antibody_region_B_reps.pdf and results/avgdiffsel/replicates/antibody_region_B_reps.svg


************************************************************************
ferret, saving to results/avgdiffsel/replicates/ferret_reps.pdf and results/avgdiffsel/replicates/ferret_reps.svg


************************************************************************
serum_mAb_spike, saving to results/avgdiffsel/replicates/serum_mAb_spike_reps.pdf and results/avgdiffsel/replicates/serum_mAb_spike_reps.svg

Plot replicate-replicate correlations¶

Now we plot the correlation among the replicates that we retained for each serum. First, we make a tidy data frame with the correlations between all pairs of replicates for the same serum:

In [54]:

corr_df = []

serum_names = avg_sel_df['serum_name_formatted'].unique()
libraries = [lib for lib in avg_sel_df['library'].unique() if lib != avg_type]

for serum_name, serum_sel_df in avg_sel_df.groupby('serum_name_formatted'):

    corr_df.append(
               serum_sel_df
               .query('library in @libraries')
               .pivot_table(values='positive_diffsel',
                            columns='library',
                            index=['site'])
               .corr(method='pearson')
               .reset_index()
               .melt(id_vars='library',
                     var_name='lib_B',
                     value_name='correlation')
               .rename(columns={'library': 'lib_A'})
               .query('lib_A <= lib_B')
               .assign(serum_name=serum_name)
               )

corr_df = (pd.concat(corr_df, ignore_index=True)
           .assign(serum_name=lambda x: pd.Categorical(x['serum_name'],
                                                       serum_names,
                                                       ordered=True),
                   lib_A=lambda x: pd.Categorical(x['lib_A'],
                                                  libraries,
                                                  ordered=True),
                   lib_B=lambda x: pd.Categorical(x['lib_B'],
                                                  reversed(libraries),
                                                  ordered=True),
                   corr_str=lambda x: x['correlation'].apply('{:.2f}'.format)
                   )
           )

print('Here are the first few lines of the tidy correlation data frame:')
display(HTML(corr_df.head().to_html(index=False)))

Here are the first few lines of the tidy correlation data frame:

lib_A	lib_B	correlation	serum_name	corr_str
lib1	lib1	1.000000	antibody-5A01	1.00
lib1	lib2	0.762430	antibody-5A01	0.76
lib2	lib2	1.000000	antibody-5A01	1.00
lib1	lib3	0.705409	antibody-5A01	0.71
lib2	lib3	0.717227	antibody-5A01	0.72

Now we use plotnine to plot a correlation matrix for each serum, and then save to a file as well as showing it:

In [55]:

ncol = 5
nsera = len(corr_df['serum_name'].unique())

corr_plot = (
    ggplot(corr_df, aes('lib_A', 'lib_B',
                        fill='correlation', label='corr_str')) +
    geom_tile(color='white', size=0.5) +
    geom_text() +
    facet_wrap('~ serum_name', ncol=ncol) +
    theme(figure_size=(2.55 * ncol, 2.55 * math.ceil(nsera / ncol)),
          panel_grid_major=element_blank()
          ) +
    scale_fill_continuous(limits=(0, 1)) +
    xlab('') +
    ylab('')
    )

_ = corr_plot.draw()

rep_corr_plot = os.path.join(config['avgdiffsel_reps_dir'], 'rep_corr_matrix.pdf')
print(f"Saving plot to {rep_corr_plot}")
corr_plot.save(rep_corr_plot)

Saving plot to results/avgdiffsel/replicates/rep_corr_matrix.pdf

Customized figures for paper¶

In the following section, we will generate additional customized paper figures not created above.

These go in the following directory:

In [56]:

print(f"Putting additional figures in {config['figsdir']}")
os.makedirs(config['figsdir'], exist_ok=True)

Putting additional figures in results/figures

We will save each of these figures with the following extensions:

In [57]:

fig_extensions = ['.pdf', '.svg']

Logo and line plot figures¶

Now we are going to make versions of the zoomed logo plots that are slightly different than the ones above. We read in colors for specific mutations from a YAML file; these colors match those used to plot the neutralization curves. We also plot a slightly different set of sites: not just the "significant" ones, but also others of interest (these are sites of strong selection in ferrets).

First, get data frames with sites to zoom on and how to color them, and how to color specific mutations:

In [58]:

with open(config['figure_config']) as f:
    fig_config = yaml.safe_load(f)

zoom_sites = []
mutation_colors = []
for figure, fig_d in fig_config['figures'].items():
    # process zoom sites
    site_color_map = collections.defaultdict(lambda: fig_config['default_logo_color'])
    if 'site_colors' in fig_d:
        for r, c in fig_d['site_colors'].items():
            site_color_map[str(r)] = c
    zoom_sites.append(
        pd.DataFrame({'figure': figure,
                      'sera': fig_d['sera'],
                      'dummy': 0})
        .merge(pd.DataFrame({'site': avg_sel_df['site'].unique(),
                             'dummy': 0}))
        .drop(columns='dummy')
        .assign(zoom=lambda x: x['site'].isin([str(r) for r in fig_d['sites']]),
                color=lambda x: x['site'].map(site_color_map)
                )
        )
    # process colors
    for serum in fig_d['sera']:
        if ('ignore_serum_virus' in fig_d) and (serum in fig_d['ignore_serum_virus']):
            ignore_muts = fig_d['ignore_serum_virus'][serum]
        else:
            ignore_muts = []
        muts = [mut for mut in fig_d['colors'] if (mut not in {'wt', 'syn'}) and
                                                  (mut not in ignore_muts)]
        mutation_colors.append(pd.DataFrame(
                                {'figure': figure,
                                 'sera': serum,
                                 'site': [mut[1: -1] for mut in muts],
                                 'mutation': [mut[-1] for mut in muts],
                                 'color': [fig_d['colors'][mut][0] for mut in muts],
                                 }))

zoom_sites = pd.concat(zoom_sites)
mutation_colors = pd.concat(mutation_colors)

Merge the zoom and mutation-color data frames with the data on all of the immune selection values:

In [59]:

colored_zoom_df = (
    avg_sel_df
    .query('library == @avg_type')
    [['serum_name_formatted', 'isite', 'site', 'mutation',
      'positive_diffsel', 'mutdiffsel', 'site_label']]
    .rename(columns={'serum_name_formatted': 'sera'})
    .merge(zoom_sites,
           on=['sera', 'site'],
           how='left')
    .assign(zoom=lambda x: x['zoom'].fillna(False))
    .merge(mutation_colors,
           on=['figure', 'sera', 'site', 'mutation'],
           how='left')
    .assign(color=lambda x: x['color_x'].where(x['color_y'].isna(), x['color_y']))
    .drop(columns=['color_x', 'color_y'])
    )

Make the zoomed logo plots:

In [60]:

for figure, df in colored_zoom_df.groupby('figure'):
        
    if 'sera_names' in fig_config['figures'][figure]:
        name_map = fig_config['figures'][figure]['sera_names']
    else:
        name_map = {s: s.replace('-', ' ') for s in df.sera.unique()}
    df = df.assign(sera_names=lambda x: pd.Categorical(x.sera.map(name_map),
                                                       x.sera.map(name_map).unique(),
                                                       ordered=True))
        
    fig, axes = dmslogo.facet_plot(
            data=df,
            x_col='isite',
            show_col='zoom',
            gridrow_col='sera_names',
            share_xlabel=True,
            share_ylabel=True,
            share_ylim_across_rows=share_ylim_across_rows[serum_group],
            wspace=0.6,
            rmargin=0.7,
            draw_line_kwargs=dict(
                    height_col='positive_diffsel',
                    xtick_col='site',
                    ylabel='immune selection',
                    show_color=PALETTE[-2],
                    ),
            draw_logo_kwargs=dict(
                    letter_col='mutation',
                    letter_height_col='mutdiffsel',
                    color_col='color',
                    xtick_col='site_label',
                    xlabel='site',
                    ylabel='immune selection',
                    clip_negative_heights=True,
                    ),
            )
    display(fig)
    for ext in fig_extensions:
        plotfile = os.path.join(config['figsdir'], f"{figure}_logo{ext}")
        print(f"Saving figure to {plotfile}")
        fig.savefig(plotfile)
    plt.close(fig)

Saving figure to results/figures/2009_age_53_samples_logo.pdf
Saving figure to results/figures/2009_age_53_samples_logo.svg

Saving figure to results/figures/Hensley_sera_logo.pdf
Saving figure to results/figures/Hensley_sera_logo.svg

Saving figure to results/figures/VIDD_sera_logo.pdf
Saving figure to results/figures/VIDD_sera_logo.svg

Saving figure to results/figures/antibody_lower_head_logo.pdf
Saving figure to results/figures/antibody_lower_head_logo.svg

Saving figure to results/figures/antibody_region_B_logo.pdf
Saving figure to results/figures/antibody_region_B_logo.svg

Saving figure to results/figures/antibody_spikein_logo.pdf
Saving figure to results/figures/antibody_spikein_logo.svg

Saving figure to results/figures/ferret_logo.pdf
Saving figure to results/figures/ferret_logo.svg

Replicate-to-replicate correlations¶

Plot replicate-replicate correlations for the sera in each figure:

In [61]:

for figure, fig_d in fig_config['figures'].items():

    sera = fig_d['sera']
    
    if len(sera) == 5:
        ncol = 5
    else:
        ncol = 4

    fig_corr_df = (corr_df
                   .query('serum_name in @sera')
                   .assign(serum_name=lambda x: x['serum_name'].str.replace('-', ' '))
                   )
    fig_corr_plot = (
        ggplot(fig_corr_df, aes('lib_A', 'lib_B',
                            fill='correlation', label='corr_str')) +
        geom_tile(color='white', size=0.5) +
        geom_text() +
        facet_wrap('~ serum_name', ncol=ncol) +
        theme(figure_size=(2.55 * min(len(sera), ncol), 2.55 * math.ceil(len(sera) / ncol)),
              panel_grid_major=element_blank()
              ) +
        scale_fill_continuous(limits=(0, 1)) +
        xlab('') +
        ylab('')
        )

    print(fig_corr_plot)
    
    for ext in fig_extensions:
        plotfile = os.path.join(config['figsdir'], f"{figure}_rep_corr{ext}")
        print(f"Saving figure to {plotfile}")
        fig_corr_plot.save(plotfile)

<ggplot: (8744359022719)>
Saving figure to results/figures/VIDD_sera_rep_corr.pdf
Saving figure to results/figures/VIDD_sera_rep_corr.svg

<ggplot: (8744358702870)>
Saving figure to results/figures/2009_age_53_samples_rep_corr.pdf
Saving figure to results/figures/2009_age_53_samples_rep_corr.svg

<ggplot: (8744656139766)>
Saving figure to results/figures/Hensley_sera_rep_corr.pdf
Saving figure to results/figures/Hensley_sera_rep_corr.svg

<ggplot: (8744399698780)>
Saving figure to results/figures/ferret_rep_corr.pdf
Saving figure to results/figures/ferret_rep_corr.svg

<ggplot: (8744655353541)>
Saving figure to results/figures/antibody_region_B_rep_corr.pdf
Saving figure to results/figures/antibody_region_B_rep_corr.svg

<ggplot: (8744359031374)>
Saving figure to results/figures/antibody_lower_head_rep_corr.pdf
Saving figure to results/figures/antibody_lower_head_rep_corr.svg

<ggplot: (-9223363292486854961)>
Saving figure to results/figures/antibody_spikein_rep_corr.pdf
Saving figure to results/figures/antibody_spikein_rep_corr.svg

Percent infectivity for each replicate¶

Plot the percent infectivity remaining for the library for each selection.

In [62]:

def format_label(x):
    if x == 100:
        lab = '100'
    else:
        lab = '{0:#.2g}'.format(x)
    if lab[-1] == '.':
        lab = lab[: -1]
    return lab + '%'
        

for figure, fig_d in fig_config['figures'].items():

    sera = fig_d['sera']

    df = (selections
          .query('retained')
          .query('serum_name_formatted in @sera')
          [['serum_name_formatted', 'library', 'percent_infectivity']]
          .assign(serum=lambda x: x['serum_name_formatted'].str.replace('-', ' '),
                  label=lambda x: x['percent_infectivity'].apply(format_label)
                  )
          )
    ymax = max(100, 10**math.ceil(math.log10(df['percent_infectivity'].max())))
    ymin = min(0.1, 10**math.floor(math.log10(df['percent_infectivity'].min())))
    yextent = math.log10(ymax / ymin)
    if df['percent_infectivity'].max() > 50:
        nudge_y = -0.1 * yextent
    else:
        nudge_y = 0.1 * yextent
    if len(sera) == 5:
        ncol = 5
    else:
        ncol = 4
        
    plot_percent_infectivity = (
        ggplot(df, aes('library', 'percent_infectivity', label='label', color='library')) +
        geom_point(size=3) +
        geom_text(size=10, nudge_y=nudge_y) +
        facet_wrap('~ serum', ncol=ncol) +
        scale_y_log10(limits=(ymin, ymax), name='percent infectivity') +
        theme(figure_size=(2.55 * min(ncol, len(sera)), 2.55 * math.ceil(len(sera) / ncol))) +
        scale_color_manual(values=PALETTE[1: ], guide=False)
        )
    print(plot_percent_infectivity)
    
    for ext in fig_extensions:
        plotfile = os.path.join(config['figsdir'], f"{figure}_percent_infectivity{ext}")
        print(f"Saving figure to {plotfile}")
        plot_percent_infectivity.save(plotfile)

<ggplot: (-9223363292500806326)>
Saving figure to results/figures/VIDD_sera_percent_infectivity.pdf
Saving figure to results/figures/VIDD_sera_percent_infectivity.svg

<ggplot: (-9223363292196862478)>
Saving figure to results/figures/2009_age_53_samples_percent_infectivity.pdf
Saving figure to results/figures/2009_age_53_samples_percent_infectivity.svg

<ggplot: (8744664887837)>
Saving figure to results/figures/Hensley_sera_percent_infectivity.pdf
Saving figure to results/figures/Hensley_sera_percent_infectivity.svg

<ggplot: (-9223363292196645544)>
Saving figure to results/figures/ferret_percent_infectivity.pdf
Saving figure to results/figures/ferret_percent_infectivity.svg

<ggplot: (8744665275007)>
Saving figure to results/figures/antibody_region_B_percent_infectivity.pdf
Saving figure to results/figures/antibody_region_B_percent_infectivity.svg

<ggplot: (8744405357695)>
Saving figure to results/figures/antibody_lower_head_percent_infectivity.pdf
Saving figure to results/figures/antibody_lower_head_percent_infectivity.svg

<ggplot: (8744367265649)>
Saving figure to results/figures/antibody_spikein_percent_infectivity.pdf
Saving figure to results/figures/antibody_spikein_percent_infectivity.svg

In [ ]:

Table of Contents