Notebook

In this notebook, I will prepare human antibody structures from SAbDab (The Structural Antibody Database) for multimodal pre-training.

Goals

Download human antibody structures with resolution 2.5Å or better.
Use proteinflow to filter sequences for quality, cluster sequences, and split into train/valid/test.

Setup¶

In [ ]:

# Import necessary libraries
from pathlib import Path
import os

In [ ]:

!pip install proteinflow &> /dev/null
!apt-get install -qq -y mmseqs2 &> /dev/null

In [ ]:

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

path = Path("/content/gdrive/")
path_data = Path("/content/gdrive/MyDrive/data")

Mounted at /content/gdrive

In [ ]:

import pandas as pd

from slugify import slugify

In [ ]:

#!proteinflow generate --help

SaAbDab¶

Download human antobody structures with resolution 2.5Å or better. This resulted in structures resolved by either X-ray crystallography or cryo-electron microscopy.

In [ ]:

# Species Homo Sapiens and Resolution 2.5 A
sabdab_summary_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/'
sabdab_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/'
fname = slugify(sabdab_summary_url.split('/')[-2], lowercase=False)

In [ ]:

# Need to generate url fresh everytime
!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/ -O {path_data}/{fname}_summary.tsv

--2024-05-20 18:11:19--  https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/
Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)... 163.1.32.59
Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)|163.1.32.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1050365 (1.0M) [text/tab-separated-values]
Saving to: ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’

/content/gdrive/MyD 100%[===================>]   1.00M  2.20MB/s    in 0.5s    

2024-05-20 18:11:21 (2.20 MB/s) - ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’ saved [1050365/1050365]

In [ ]:

# Need to generate url fresh everytime
!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/ -O {path_data}/{fname}.zip

--2024-05-20 18:01:35--  https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/
Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)... 163.1.32.59
Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)|163.1.32.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4002463609 (3.7G) [application/zip]
Saving to: ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’

/content/gdrive/MyD 100%[===================>]   3.73G  30.5MB/s    in 3m 23s  

2024-05-20 18:05:24 (18.8 MB/s) - ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’ saved [4002463609/4002463609]

In [ ]:

!ls {path_data}

20240520-0899946_summary.tsv  20240520-0899946.zip

ProteinFlow¶

Filter

Discard biounits with sequences <30 residues, since they are very small and quite flexible.
Retain redundant dataset of structures, since antibodies with identical amino acid sequences can have slight variations in their structure.
Select proteins with <30% missing residues in the tails and <10% missing residues in the middle.
Discard every biounits that contain unnatural aminoacids.
Discard biounits that contain unexpected atoms.
Discard biounits with discrepancies between fasta and PDB sequences.
Discard biounits that contain chains with > 10,000 aminoacids in total.

Cluster

SAbDab sequences clustering is done across all 6 Complementary Determining Regions (CDRs) - H1, H2, H3, L1, L2, L3, based on the Chothia numbering using MMSeqs2. The minimum sequence identity for mmseqs clustering is set at 90%.

Split

The resulting CDR clusters are split into train, valid, and test set at ∼80:10:10 ratio in a way that ensures that every PDB file only appears in one subset.

In [ ]:

!proteinflow generate --sabdab \
--sabdab_data_path {path_data}/{fname}.zip --tag {fname} \
--resolution_thr 2.5 --not_remove_redundancies \
--min_seq_id 0.9 \
--local_datasets_folder {path_data} \
--valid_split 0.1 --test_split 0.1 \
--split_tolerance 0.05

Log file: /content/gdrive/MyDrive/data/proteinflow_20240520-0899946/log.txt 

Moving files...
Unzipping /content/gdrive/MyDrive/data/20240520-0899946.zip...
100% 5071/5071 [01:55<00:00, 43.75it/s]
Filtering...
100% 1287/1287 [00:15<00:00, 84.22it/s] 
Downloading fasta files...
100% 1287/1287 [00:14<00:00, 90.75it/s]
Filter and process...
100% 2219/2219 [24:05<00:00,  1.54it/s]
<<< Too many missing values in total: 150
<<< Too many missing values in the middle: 120
<<< Incorrect alignment: 34
<<< Too many missing values in the ends: 22
<<< FASTA file not found: 10
<<< Some chains in the PDB do not appear in the fasta file: 8
<<< Unnatural amino acids found: 7
<<< PDB / mmCIF file is too large: 2
Total exceptions: 353
Checking excluded chains similarity...
100% 1869/1869 [00:37<00:00, 49.42it/s] 
Clustering with MMSeqs2 for CDR L1...
100% 1868/1868 [00:08<00:00, 232.49it/s]
100% 1121/1121 [00:00<00:00, 200666.42it/s]
Clustering with MMSeqs2 for CDR L2...
100% 1868/1868 [00:07<00:00, 243.93it/s]
100% 1121/1121 [00:00<00:00, 236283.97it/s]
Clustering with MMSeqs2 for CDR L3...
100% 1868/1868 [00:08<00:00, 210.17it/s]
100% 1121/1121 [00:00<00:00, 125495.51it/s]
Clustering with MMSeqs2 for CDR H1...
100% 1868/1868 [00:07<00:00, 236.49it/s]
100% 1121/1121 [00:00<00:00, 228520.77it/s]
Clustering with MMSeqs2 for CDR H2...
100% 1868/1868 [00:08<00:00, 226.31it/s]
100% 1121/1121 [00:00<00:00, 247476.96it/s]
Clustering with MMSeqs2 for CDR H3...
100% 1868/1868 [00:08<00:00, 212.87it/s]
100% 1121/1121 [00:00<00:00, 205858.79it/s]
/usr/local/lib/python3.10/dist-packages/networkx/convert_matrix.py:687: DeprecationWarning: from_numpy_matrix is deprecated and will be removed in NetworkX 3.0.
Use from_numpy_array instead, e.g. from_numpy_array(A, **kwargs)
  warnings.warn(

Split size:
    Train 79.89%
    Valid 10.04%
    Test 10.07%

Moving files in the train set...
100% 1571/1571 [00:06<00:00, 226.53it/s]
Moving files in the validation set...
100% 147/147 [00:00<00:00, 234.08it/s]
Moving files in the test set...
100% 150/150 [00:01<00:00, 148.44it/s]

In [ ]:

!ls /content/gdrive/MyDrive/data/proteinflow_{fname}/

log.txt  splits_dict  test  train  valid

In [ ]:

!proteinflow generate --help

Usage: proteinflow generate [OPTIONS]

  Generate a new ProteinFlow dataset

Options:
  --max_chains INTEGER            The maximum number of chains per biounit
  --random_seed INTEGER           The random seed to use for splitting
  --require_ligand                Use this flag to require that the PDB files
                                  contain a ligand
  --foldseek                      Whether to use FoldSeek to cluster the
                                  dataset
  --tanimoto_clustering           Whether to use Tanimoto Clustering instead
                                  of MMSeqs2. Only works if load_ligands is
                                  set to True
  --exclude_chains_without_ligands
                                  Exclude chains without ligands from the
                                  generated dataset
  --load_ligands                  Whether or not to load ligands found in the
                                  pdbs example: data['A']['ligand'][0]['X']
  --exclude_based_on_cdr [L1|L2|L3|H1|H2|H3]
                                  if given and exclude_clusters is true + the
                                  dataset is SAbDab, exclude files based on
                                  only the given CDR clusters
  --exclude_clusters              Exclude clusters that contain chains similar
                                  to chains to exclude
  --exclude_threshold FLOAT       Exclude chains with sequence identity to
                                  exclude_chains above this threshold
  --exclude_chains_file TEXT      Exclude specific chains from the dataset
                                  (path to a file containing the sequences to
                                  exclude, one sequence per line)
  -e, --exclude_chains TEXT       Exclude specific chains from the dataset
                                  ({pdb_id}-{chain_id}, e.g. -e 1a2b-A)
  --require_antigen               Use this flag to require that the SAbDab
                                  files contain an antigen
  --sabdab_data_path TEXT         Path to a zip file or a directory containing
                                  SAbDab files (only used if `sabdab` is
                                  `True`)
  --sabdab                        Use this flag to generate a dataset from
                                  SAbDab files instead of PDB
  --min_seq_id FLOAT              Minimum sequence identity for mmseqs
                                  clustering
  --load_live                     Load the files that are not in the latest
                                  PDB snapshot from the PDB FTP server
                                  (disregarded if pdb_snapshot is not none)
  --pdb_snapshot TEXT             The pdb snapshot folder to load
  --valid_split FLOAT             The fraction of chains to put in the
                                  validation set (default 5%)
  --test_split FLOAT              The fraction of chains to put in the test
                                  set (default 5%)
  --split_tolerance FLOAT         The tolerance on the split ratio (default
                                  20%)
  --force                         When `True`, rewrite the files if they
                                  already exist
  --n INTEGER                     The number of files to process (for
                                  debugging purposes)
  --skip_splitting                Use this flag to skip splitting the data
  --redundancy_thr FLOAT          The threshold upon which sequences are
                                  considered as one and the same (default:
                                  90%)
  --not_remove_redundancies       Unless this flag is used, removes biounits
                                  that are doubles of others sequence wise
  --not_filter_methods            Unless this flag is used, only files
                                  obtained with X-ray or EM will be processed
  --missing_middle_thr FLOAT      The maximum fraction of missing residues in
                                  the middle (after missing ends are
                                  disregarded)
  --missing_ends_thr FLOAT        The maximum fraction of missing residues at
                                  the ends
  --resolution_thr FLOAT          The maximum resolution
  --max_length INTEGER            The maximum number of residues per chain
                                  (set None for no threshold)
  --min_length INTEGER            The minimum number of non-missing residues
                                  per chain
  --local_datasets_folder TEXT    The folder where proteinflow datasets,
                                  temporary files and logs will be stored
  --pdb_id_list_path TEXT         List of pdb ids to download and process
  --tag TEXT                      The name of the dataset
  --help                          Show this message and exit.