10X Metadata¶

We combine some cell metadata with TCR data, and export the results.

Loading packages and data¶

In [1]:

import pandas as pd  # Pandas for data analysis.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt  # For basic plotting.
import seaborn as sns # For pretty visualization in Seaborn.  See https://seaborn.pydata.org/
from IPython.display import display # Pretty display of data frames.

from sklearn import base
from sklearn.feature_selection import chi2, f_classif

import scanpy as sc
sc.settings.verbosity = 1             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

# Put plots inline rather than in a pop-up.
%matplotlib inline

scanpy==1.7.1 anndata==0.7.6 umap==0.5.1 numpy==1.20.1 scipy==1.6.1 pandas==1.2.3 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3

In [2]:

def hrule(repchar = '=', length=80):
    '''
    A quick function to print a horizontal line.
    '''
    if len(repchar) == 1:
        print(repchar*length)

First we load the TCR data for all three experiments.

In [3]:

annot_df = {}
experiments = ['exp1', 'exp2', 'exp3']
for exp in experiments:
    annot_df[exp] = pd.read_csv('Raw/Final_TCR_xls_AllExps_10x_CJAug_{}.csv'.format(exp), index_col=0)
    print('{} TCR records loaded for {}'.format(len(annot_df[exp]), exp))

22435 TCR records loaded for exp1
15139 TCR records loaded for exp2
28104 TCR records loaded for exp3

In [4]:

meta_all = pd.read_csv('Processed/Allcells_metadata.csv', index_col=0)
meta_all.head()

Out[4]:

	clone_ID	hash_ID	nCount_CH	nCount_RNA	nFeature_CH	nFeature_RNA	orig_ident	percent_mt	clone	well	experiment	CD_type
CellID
exp1_AAACCTGAGACAGGCT-1	exp1_2	exp1_Sample2	1028.0	6876.0	6	2669	exp1	3.955788	2	Sample2	exp1	CD4
exp1_AAACCTGAGCCCAATT-1	exp1_non	exp1_Sample1	2166.0	6246.0	6	2368	exp1	2.161383	non	Sample1	exp1	unknown
exp1_AAACCTGAGGACAGAA-1	exp1_non	exp1_Sample6	813.0	4615.0	6	1718	exp1	2.665222	non	Sample6	exp1	unknown
exp1_AAACCTGCAACAACCT-1	exp1_5	exp1_Sample5	1092.0	11565.0	6	3613	exp1	2.075227	5	Sample5	exp1	CD4
exp1_AAACCTGCAGACACTT-1	exp1_7	exp1_Sample4	847.0	4440.0	6	2005	exp1	0.990991	7	Sample4	exp1	CD4

In [5]:

meta_df = {}
for exp in experiments:
    meta_df[exp] = meta_all[meta_all.orig_ident == exp]
    print('Metadata loaded for {} cells in {}'.format(len(meta_df[exp]), exp))

Metadata loaded for 13493 cells in exp1
Metadata loaded for 11143 cells in exp2
Metadata loaded for 22277 cells in exp3

These numbers of cells would be half of the number of TCR records, if each cell had one TCRA and one TCRB record. But that's not the case, and some more cleanup of data is needed.

Cleaning up data¶

First, we remove cells from the metadata with repeated barcodes.

In [6]:

for exp in experiments:
    cells_meta = pd.Series([cn.split('_')[1] for cn in meta_df[exp].index])
    vc = cells_meta.value_counts()
    good_barcodes = vc[vc == 1].index # Omit barcodes that appear more than once.
    print('{} rows in cell metadata.  {} with unique barcodes'.format(len(cells_meta), len(good_barcodes)))
    meta_df[exp].index = cells_meta
    meta_df[exp] = meta_df[exp].loc[good_barcodes]
    print('Only unique barcodes remain in {}.'.format(exp))

13493 rows in cell metadata.  13493 with unique barcodes
Only unique barcodes remain in exp1.
11143 rows in cell metadata.  11143 with unique barcodes
Only unique barcodes remain in exp2.
22277 rows in cell metadata.  22277 with unique barcodes
Only unique barcodes remain in exp3.

This is good. No repeated barcodes within an experiment. There might be some across different experiments.

In [7]:

meta_df[exp].head()

Out[7]:

	clone_ID	hash_ID	nCount_CH	nCount_RNA	nFeature_CH	nFeature_RNA	orig_ident	percent_mt	clone	well	experiment	CD_type
CACAGGCCATCGGAAG-1	exp3_non	exp3_Sample2	3172.0	4911.0	6	2377	exp3	1.323559	non	Sample2	exp3	unknown
ATCGAGTCACGAAAGC-1	exp3_4	exp3_Sample1	901.0	8349.0	5	3155	exp3	1.593005	4	Sample1	exp3	CD4
CTAGTGAAGGGTATCG-1	exp3_8	exp3_Sample4	308.0	2927.0	6	1496	exp3	1.947386	8	Sample4	exp3	CD4
TGACTAGTCATCGCTC-1	exp3_1	exp3_Sample5	308.0	1591.0	6	1100	exp3	0.817096	1	Sample5	exp3	CD4
CGTCACTTCATCATTC-1	exp3_non	exp3_Sample6	436.0	2766.0	6	1509	exp3	0.795372	non	Sample6	exp3	unknown

Extract TCRA and TCRB data¶

In [8]:

for exp in experiments:
    a_df = annot_df[exp]
    for TR_chain in ['TRA', 'TRB']:
        print('Processing {}, {} sequences'.format(exp, TR_chain))
        TR_df = a_df[a_df.chain == TR_chain]
        TR_agg = TR_df.groupby(level=0).agg(list)['cdr3'].apply(pd.Series) # Trick to aggregate with multiple TRA seqs.
        #TR_agg.columns = [TR_chain + '_seq1', TR_chain + '_seq2']
        for j in TR_agg.columns:
            meta_df[exp][TR_chain+'_seq'+str(j+1)] = TR_agg[j]
            print('{} cells with at least {} {} sequences'.format(len(TR_agg[j].dropna()), j+1, TR_chain))
        hrule()
        
        

Processing exp1, TRA sequences
10658 cells with at least 1 TRA sequences
542 cells with at least 2 TRA sequences
================================================================================
Processing exp1, TRB sequences
11235 cells with at least 1 TRB sequences
================================================================================
Processing exp2, TRA sequences
7175 cells with at least 1 TRA sequences
184 cells with at least 2 TRA sequences
================================================================================
Processing exp2, TRB sequences
7780 cells with at least 1 TRB sequences
================================================================================
Processing exp3, TRA sequences
12903 cells with at least 1 TRA sequences
912 cells with at least 2 TRA sequences
================================================================================
Processing exp3, TRB sequences
14289 cells with at least 1 TRB sequences
================================================================================

In [9]:

meta_df[exp]

Out[9]:

	clone_ID	hash_ID	nCount_CH	nCount_RNA	nFeature_CH	nFeature_RNA	orig_ident	percent_mt	clone	well	experiment	CD_type	TRA_seq1	TRA_seq2	TRB_seq1
CACAGGCCATCGGAAG-1	exp3_non	exp3_Sample2	3172.0	4911.0	6	2377	exp3	1.323559	non	Sample2	exp3	unknown	NaN	NaN	NaN
ATCGAGTCACGAAAGC-1	exp3_4	exp3_Sample1	901.0	8349.0	5	3155	exp3	1.593005	4	Sample1	exp3	CD4	CAVSDRGAGGFKTIF	CAEASLLSGTYKYIF	CASSPRDRATGELFF
CTAGTGAAGGGTATCG-1	exp3_8	exp3_Sample4	308.0	2927.0	6	1496	exp3	1.947386	8	Sample4	exp3	CD4	CATDQAGTALIF	NaN	CASSLVGVGADQPQHF
TGACTAGTCATCGCTC-1	exp3_1	exp3_Sample5	308.0	1591.0	6	1100	exp3	0.817096	1	Sample5	exp3	CD4	CALSESLNNNARLMF	NaN	CASSEGGKSGIVYEQYF
CGTCACTTCATCATTC-1	exp3_non	exp3_Sample6	436.0	2766.0	6	1509	exp3	0.795372	non	Sample6	exp3	unknown	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ACGATGTCAGCTGTTA-1	exp3_3	exp3_Sample4	249.0	2709.0	6	1566	exp3	1.291990	3	Sample4	exp3	CD4	CAMSSGGSNYKLTF	NaN	CSASSGIQPQHF
TGAGGGAAGTGGGTTG-1	exp3_8	exp3_Sample3	723.0	3927.0	6	1724	exp3	0.713012	8	Sample3	exp3	CD4	CATDQAGTALIF	NaN	CASSLVGVGADQPQHF
GAATGAATCTGTCAAG-1	exp3_non	exp3_Sample2	1554.0	1569.0	5	1133	exp3	5.544933	non	Sample2	exp3	unknown	NaN	NaN	NaN
GTAACGTTCACAAACC-1	exp3_non	exp3_Sample1	585.0	873.0	6	663	exp3	8.820160	non	Sample1	exp3	unknown	NaN	NaN	NaN
CGAATGTTCAGTTGAC-1	exp3_12	exp3_Sample2	2406.0	5952.0	6	2688	exp3	1.108871	12	Sample2	exp3	CD4	CILRGYTGNQFYF	NaN	CASSPNRDVGSGYTF

22277 rows × 15 columns

In [10]:

for exp in experiments:
    print('Checking {}'.format(exp))
    m_df = meta_df[exp]
    no_clone = m_df.clone_ID == exp+'_non'
    no_TCRA = m_df.TRA_seq1.isna()
    no_TCRB = m_df.TRB_seq1.isna()
    print('{} cells with no clone, having TCRA or TCRB'.format((no_clone & ((~no_TCRA) | (~no_TCRB)) ).sum()))
    print('{} cells with clone, but no TCRA and no TCRB'.format( ((~no_clone) & no_TCRA & no_TCRB).sum() ))
    hrule()

Checking exp1
3 cells with no clone, having TCRA or TCRB
122 cells with clone, but no TCRA and no TCRB
================================================================================
Checking exp2
0 cells with no clone, having TCRA or TCRB
182 cells with clone, but no TCRA and no TCRB
================================================================================
Checking exp3
1 cells with no clone, having TCRA or TCRB
106 cells with clone, but no TCRA and no TCRB
================================================================================

That's not perfect... but not so bad I think!

Saving data!¶

In [11]:

for exp in experiments:
    meta_df[exp].to_csv('Processed/metadata_withTR_{}.csv'.format(exp))

In [ ]: