We combine some cell metadata with TCR data, and export the results.
import pandas as pd # Pandas for data analysis.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt # For basic plotting.
import seaborn as sns # For pretty visualization in Seaborn. See https://seaborn.pydata.org/
from IPython.display import display # Pretty display of data frames.
from sklearn import base
from sklearn.feature_selection import chi2, f_classif
import scanpy as sc
sc.settings.verbosity = 1 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')
# Put plots inline rather than in a pop-up.
%matplotlib inline
scanpy==1.7.1 anndata==0.7.6 umap==0.5.1 numpy==1.20.1 scipy==1.6.1 pandas==1.2.3 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
def hrule(repchar = '=', length=80):
'''
A quick function to print a horizontal line.
'''
if len(repchar) == 1:
print(repchar*length)
First we load the TCR data for all three experiments.
annot_df = {}
experiments = ['exp1', 'exp2', 'exp3']
for exp in experiments:
annot_df[exp] = pd.read_csv('Raw/Final_TCR_xls_AllExps_10x_CJAug_{}.csv'.format(exp), index_col=0)
print('{} TCR records loaded for {}'.format(len(annot_df[exp]), exp))
22435 TCR records loaded for exp1 15139 TCR records loaded for exp2 28104 TCR records loaded for exp3
meta_all = pd.read_csv('Processed/Allcells_metadata.csv', index_col=0)
meta_all.head()
clone_ID | hash_ID | nCount_CH | nCount_RNA | nFeature_CH | nFeature_RNA | orig_ident | percent_mt | clone | well | experiment | CD_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CellID | ||||||||||||
exp1_AAACCTGAGACAGGCT-1 | exp1_2 | exp1_Sample2 | 1028.0 | 6876.0 | 6 | 2669 | exp1 | 3.955788 | 2 | Sample2 | exp1 | CD4 |
exp1_AAACCTGAGCCCAATT-1 | exp1_non | exp1_Sample1 | 2166.0 | 6246.0 | 6 | 2368 | exp1 | 2.161383 | non | Sample1 | exp1 | unknown |
exp1_AAACCTGAGGACAGAA-1 | exp1_non | exp1_Sample6 | 813.0 | 4615.0 | 6 | 1718 | exp1 | 2.665222 | non | Sample6 | exp1 | unknown |
exp1_AAACCTGCAACAACCT-1 | exp1_5 | exp1_Sample5 | 1092.0 | 11565.0 | 6 | 3613 | exp1 | 2.075227 | 5 | Sample5 | exp1 | CD4 |
exp1_AAACCTGCAGACACTT-1 | exp1_7 | exp1_Sample4 | 847.0 | 4440.0 | 6 | 2005 | exp1 | 0.990991 | 7 | Sample4 | exp1 | CD4 |
meta_df = {}
for exp in experiments:
meta_df[exp] = meta_all[meta_all.orig_ident == exp]
print('Metadata loaded for {} cells in {}'.format(len(meta_df[exp]), exp))
Metadata loaded for 13493 cells in exp1 Metadata loaded for 11143 cells in exp2 Metadata loaded for 22277 cells in exp3
These numbers of cells would be half of the number of TCR records, if each cell had one TCRA and one TCRB record. But that's not the case, and some more cleanup of data is needed.
First, we remove cells from the metadata with repeated barcodes.
for exp in experiments:
cells_meta = pd.Series([cn.split('_')[1] for cn in meta_df[exp].index])
vc = cells_meta.value_counts()
good_barcodes = vc[vc == 1].index # Omit barcodes that appear more than once.
print('{} rows in cell metadata. {} with unique barcodes'.format(len(cells_meta), len(good_barcodes)))
meta_df[exp].index = cells_meta
meta_df[exp] = meta_df[exp].loc[good_barcodes]
print('Only unique barcodes remain in {}.'.format(exp))
13493 rows in cell metadata. 13493 with unique barcodes Only unique barcodes remain in exp1. 11143 rows in cell metadata. 11143 with unique barcodes Only unique barcodes remain in exp2. 22277 rows in cell metadata. 22277 with unique barcodes Only unique barcodes remain in exp3.
This is good. No repeated barcodes within an experiment. There might be some across different experiments.
meta_df[exp].head()
clone_ID | hash_ID | nCount_CH | nCount_RNA | nFeature_CH | nFeature_RNA | orig_ident | percent_mt | clone | well | experiment | CD_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CACAGGCCATCGGAAG-1 | exp3_non | exp3_Sample2 | 3172.0 | 4911.0 | 6 | 2377 | exp3 | 1.323559 | non | Sample2 | exp3 | unknown |
ATCGAGTCACGAAAGC-1 | exp3_4 | exp3_Sample1 | 901.0 | 8349.0 | 5 | 3155 | exp3 | 1.593005 | 4 | Sample1 | exp3 | CD4 |
CTAGTGAAGGGTATCG-1 | exp3_8 | exp3_Sample4 | 308.0 | 2927.0 | 6 | 1496 | exp3 | 1.947386 | 8 | Sample4 | exp3 | CD4 |
TGACTAGTCATCGCTC-1 | exp3_1 | exp3_Sample5 | 308.0 | 1591.0 | 6 | 1100 | exp3 | 0.817096 | 1 | Sample5 | exp3 | CD4 |
CGTCACTTCATCATTC-1 | exp3_non | exp3_Sample6 | 436.0 | 2766.0 | 6 | 1509 | exp3 | 0.795372 | non | Sample6 | exp3 | unknown |
for exp in experiments:
a_df = annot_df[exp]
for TR_chain in ['TRA', 'TRB']:
print('Processing {}, {} sequences'.format(exp, TR_chain))
TR_df = a_df[a_df.chain == TR_chain]
TR_agg = TR_df.groupby(level=0).agg(list)['cdr3'].apply(pd.Series) # Trick to aggregate with multiple TRA seqs.
#TR_agg.columns = [TR_chain + '_seq1', TR_chain + '_seq2']
for j in TR_agg.columns:
meta_df[exp][TR_chain+'_seq'+str(j+1)] = TR_agg[j]
print('{} cells with at least {} {} sequences'.format(len(TR_agg[j].dropna()), j+1, TR_chain))
hrule()
Processing exp1, TRA sequences 10658 cells with at least 1 TRA sequences 542 cells with at least 2 TRA sequences ================================================================================ Processing exp1, TRB sequences 11235 cells with at least 1 TRB sequences ================================================================================ Processing exp2, TRA sequences 7175 cells with at least 1 TRA sequences 184 cells with at least 2 TRA sequences ================================================================================ Processing exp2, TRB sequences 7780 cells with at least 1 TRB sequences ================================================================================ Processing exp3, TRA sequences 12903 cells with at least 1 TRA sequences 912 cells with at least 2 TRA sequences ================================================================================ Processing exp3, TRB sequences 14289 cells with at least 1 TRB sequences ================================================================================
meta_df[exp]
clone_ID | hash_ID | nCount_CH | nCount_RNA | nFeature_CH | nFeature_RNA | orig_ident | percent_mt | clone | well | experiment | CD_type | TRA_seq1 | TRA_seq2 | TRB_seq1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CACAGGCCATCGGAAG-1 | exp3_non | exp3_Sample2 | 3172.0 | 4911.0 | 6 | 2377 | exp3 | 1.323559 | non | Sample2 | exp3 | unknown | NaN | NaN | NaN |
ATCGAGTCACGAAAGC-1 | exp3_4 | exp3_Sample1 | 901.0 | 8349.0 | 5 | 3155 | exp3 | 1.593005 | 4 | Sample1 | exp3 | CD4 | CAVSDRGAGGFKTIF | CAEASLLSGTYKYIF | CASSPRDRATGELFF |
CTAGTGAAGGGTATCG-1 | exp3_8 | exp3_Sample4 | 308.0 | 2927.0 | 6 | 1496 | exp3 | 1.947386 | 8 | Sample4 | exp3 | CD4 | CATDQAGTALIF | NaN | CASSLVGVGADQPQHF |
TGACTAGTCATCGCTC-1 | exp3_1 | exp3_Sample5 | 308.0 | 1591.0 | 6 | 1100 | exp3 | 0.817096 | 1 | Sample5 | exp3 | CD4 | CALSESLNNNARLMF | NaN | CASSEGGKSGIVYEQYF |
CGTCACTTCATCATTC-1 | exp3_non | exp3_Sample6 | 436.0 | 2766.0 | 6 | 1509 | exp3 | 0.795372 | non | Sample6 | exp3 | unknown | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ACGATGTCAGCTGTTA-1 | exp3_3 | exp3_Sample4 | 249.0 | 2709.0 | 6 | 1566 | exp3 | 1.291990 | 3 | Sample4 | exp3 | CD4 | CAMSSGGSNYKLTF | NaN | CSASSGIQPQHF |
TGAGGGAAGTGGGTTG-1 | exp3_8 | exp3_Sample3 | 723.0 | 3927.0 | 6 | 1724 | exp3 | 0.713012 | 8 | Sample3 | exp3 | CD4 | CATDQAGTALIF | NaN | CASSLVGVGADQPQHF |
GAATGAATCTGTCAAG-1 | exp3_non | exp3_Sample2 | 1554.0 | 1569.0 | 5 | 1133 | exp3 | 5.544933 | non | Sample2 | exp3 | unknown | NaN | NaN | NaN |
GTAACGTTCACAAACC-1 | exp3_non | exp3_Sample1 | 585.0 | 873.0 | 6 | 663 | exp3 | 8.820160 | non | Sample1 | exp3 | unknown | NaN | NaN | NaN |
CGAATGTTCAGTTGAC-1 | exp3_12 | exp3_Sample2 | 2406.0 | 5952.0 | 6 | 2688 | exp3 | 1.108871 | 12 | Sample2 | exp3 | CD4 | CILRGYTGNQFYF | NaN | CASSPNRDVGSGYTF |
22277 rows × 15 columns
for exp in experiments:
print('Checking {}'.format(exp))
m_df = meta_df[exp]
no_clone = m_df.clone_ID == exp+'_non'
no_TCRA = m_df.TRA_seq1.isna()
no_TCRB = m_df.TRB_seq1.isna()
print('{} cells with no clone, having TCRA or TCRB'.format((no_clone & ((~no_TCRA) | (~no_TCRB)) ).sum()))
print('{} cells with clone, but no TCRA and no TCRB'.format( ((~no_clone) & no_TCRA & no_TCRB).sum() ))
hrule()
Checking exp1 3 cells with no clone, having TCRA or TCRB 122 cells with clone, but no TCRA and no TCRB ================================================================================ Checking exp2 0 cells with no clone, having TCRA or TCRB 182 cells with clone, but no TCRA and no TCRB ================================================================================ Checking exp3 1 cells with no clone, having TCRA or TCRB 106 cells with clone, but no TCRA and no TCRB ================================================================================
That's not perfect... but not so bad I think!
for exp in experiments:
meta_df[exp].to_csv('Processed/metadata_withTR_{}.csv'.format(exp))