In this notebook, I will prepare human antibody structures from SAbDab (The Structural Antibody Database) for multimodal pre-training.
Goals
# Import necessary libraries
from pathlib import Path
import os
!pip install proteinflow &> /dev/null
!apt-get install -qq -y mmseqs2 &> /dev/null
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
path = Path("/content/gdrive/")
path_data = Path("/content/gdrive/MyDrive/data")
Mounted at /content/gdrive
import pandas as pd
from slugify import slugify
#!proteinflow generate --help
Download human antobody structures with resolution 2.5Å or better. This resulted in structures resolved by either X-ray crystallography or cryo-electron microscopy.
# Species Homo Sapiens and Resolution 2.5 A
sabdab_summary_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/'
sabdab_url = 'https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/'
fname = slugify(sabdab_summary_url.split('/')[-2], lowercase=False)
# Need to generate url fresh everytime
!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/ -O {path_data}/{fname}_summary.tsv
--2024-05-20 18:11:19-- https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/20240520_0899946/ Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)... 163.1.32.59 Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)|163.1.32.59|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1050365 (1.0M) [text/tab-separated-values] Saving to: ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’ /content/gdrive/MyD 100%[===================>] 1.00M 2.20MB/s in 0.5s 2024-05-20 18:11:21 (2.20 MB/s) - ‘/content/gdrive/MyDrive/data/20240520-0899946_summary.tsv’ saved [1050365/1050365]
# Need to generate url fresh everytime
!wget https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/ -O {path_data}/{fname}.zip
--2024-05-20 18:01:35-- https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/archive/20240520_0899946/ Resolving opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)... 163.1.32.59 Connecting to opig.stats.ox.ac.uk (opig.stats.ox.ac.uk)|163.1.32.59|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4002463609 (3.7G) [application/zip] Saving to: ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’ /content/gdrive/MyD 100%[===================>] 3.73G 30.5MB/s in 3m 23s 2024-05-20 18:05:24 (18.8 MB/s) - ‘/content/gdrive/MyDrive/data/20240520_0899946.zip’ saved [4002463609/4002463609]
!ls {path_data}
20240520-0899946_summary.tsv 20240520-0899946.zip
Filter
Cluster
SAbDab sequences clustering is done across all 6 Complementary Determining Regions (CDRs) - H1, H2, H3, L1, L2, L3, based on the Chothia numbering using MMSeqs2. The minimum sequence identity for mmseqs clustering is set at 90%.
Split
The resulting CDR clusters are split into train, valid, and test set at ∼80:10:10 ratio in a way that ensures that every PDB file only appears in one subset.
!proteinflow generate --sabdab \
--sabdab_data_path {path_data}/{fname}.zip --tag {fname} \
--resolution_thr 2.5 --not_remove_redundancies \
--min_seq_id 0.9 \
--local_datasets_folder {path_data} \
--valid_split 0.1 --test_split 0.1 \
--split_tolerance 0.05
Log file: /content/gdrive/MyDrive/data/proteinflow_20240520-0899946/log.txt Moving files... Unzipping /content/gdrive/MyDrive/data/20240520-0899946.zip... 100% 5071/5071 [01:55<00:00, 43.75it/s] Filtering... 100% 1287/1287 [00:15<00:00, 84.22it/s] Downloading fasta files... 100% 1287/1287 [00:14<00:00, 90.75it/s] Filter and process... 100% 2219/2219 [24:05<00:00, 1.54it/s] <<< Too many missing values in total: 150 <<< Too many missing values in the middle: 120 <<< Incorrect alignment: 34 <<< Too many missing values in the ends: 22 <<< FASTA file not found: 10 <<< Some chains in the PDB do not appear in the fasta file: 8 <<< Unnatural amino acids found: 7 <<< PDB / mmCIF file is too large: 2 Total exceptions: 353 Checking excluded chains similarity... 100% 1869/1869 [00:37<00:00, 49.42it/s] Clustering with MMSeqs2 for CDR L1... 100% 1868/1868 [00:08<00:00, 232.49it/s] 100% 1121/1121 [00:00<00:00, 200666.42it/s] Clustering with MMSeqs2 for CDR L2... 100% 1868/1868 [00:07<00:00, 243.93it/s] 100% 1121/1121 [00:00<00:00, 236283.97it/s] Clustering with MMSeqs2 for CDR L3... 100% 1868/1868 [00:08<00:00, 210.17it/s] 100% 1121/1121 [00:00<00:00, 125495.51it/s] Clustering with MMSeqs2 for CDR H1... 100% 1868/1868 [00:07<00:00, 236.49it/s] 100% 1121/1121 [00:00<00:00, 228520.77it/s] Clustering with MMSeqs2 for CDR H2... 100% 1868/1868 [00:08<00:00, 226.31it/s] 100% 1121/1121 [00:00<00:00, 247476.96it/s] Clustering with MMSeqs2 for CDR H3... 100% 1868/1868 [00:08<00:00, 212.87it/s] 100% 1121/1121 [00:00<00:00, 205858.79it/s] /usr/local/lib/python3.10/dist-packages/networkx/convert_matrix.py:687: DeprecationWarning: from_numpy_matrix is deprecated and will be removed in NetworkX 3.0. Use from_numpy_array instead, e.g. from_numpy_array(A, **kwargs) warnings.warn( Split size: Train 79.89% Valid 10.04% Test 10.07% Moving files in the train set... 100% 1571/1571 [00:06<00:00, 226.53it/s] Moving files in the validation set... 100% 147/147 [00:00<00:00, 234.08it/s] Moving files in the test set... 100% 150/150 [00:01<00:00, 148.44it/s]
!ls /content/gdrive/MyDrive/data/proteinflow_{fname}/
log.txt splits_dict test train valid
!proteinflow generate --help
Usage: proteinflow generate [OPTIONS] Generate a new ProteinFlow dataset Options: --max_chains INTEGER The maximum number of chains per biounit --random_seed INTEGER The random seed to use for splitting --require_ligand Use this flag to require that the PDB files contain a ligand --foldseek Whether to use FoldSeek to cluster the dataset --tanimoto_clustering Whether to use Tanimoto Clustering instead of MMSeqs2. Only works if load_ligands is set to True --exclude_chains_without_ligands Exclude chains without ligands from the generated dataset --load_ligands Whether or not to load ligands found in the pdbs example: data['A']['ligand'][0]['X'] --exclude_based_on_cdr [L1|L2|L3|H1|H2|H3] if given and exclude_clusters is true + the dataset is SAbDab, exclude files based on only the given CDR clusters --exclude_clusters Exclude clusters that contain chains similar to chains to exclude --exclude_threshold FLOAT Exclude chains with sequence identity to exclude_chains above this threshold --exclude_chains_file TEXT Exclude specific chains from the dataset (path to a file containing the sequences to exclude, one sequence per line) -e, --exclude_chains TEXT Exclude specific chains from the dataset ({pdb_id}-{chain_id}, e.g. -e 1a2b-A) --require_antigen Use this flag to require that the SAbDab files contain an antigen --sabdab_data_path TEXT Path to a zip file or a directory containing SAbDab files (only used if `sabdab` is `True`) --sabdab Use this flag to generate a dataset from SAbDab files instead of PDB --min_seq_id FLOAT Minimum sequence identity for mmseqs clustering --load_live Load the files that are not in the latest PDB snapshot from the PDB FTP server (disregarded if pdb_snapshot is not none) --pdb_snapshot TEXT The pdb snapshot folder to load --valid_split FLOAT The fraction of chains to put in the validation set (default 5%) --test_split FLOAT The fraction of chains to put in the test set (default 5%) --split_tolerance FLOAT The tolerance on the split ratio (default 20%) --force When `True`, rewrite the files if they already exist --n INTEGER The number of files to process (for debugging purposes) --skip_splitting Use this flag to skip splitting the data --redundancy_thr FLOAT The threshold upon which sequences are considered as one and the same (default: 90%) --not_remove_redundancies Unless this flag is used, removes biounits that are doubles of others sequence wise --not_filter_methods Unless this flag is used, only files obtained with X-ray or EM will be processed --missing_middle_thr FLOAT The maximum fraction of missing residues in the middle (after missing ends are disregarded) --missing_ends_thr FLOAT The maximum fraction of missing residues at the ends --resolution_thr FLOAT The maximum resolution --max_length INTEGER The maximum number of residues per chain (set None for no threshold) --min_length INTEGER The minimum number of non-missing residues per chain --local_datasets_folder TEXT The folder where proteinflow datasets, temporary files and logs will be stored --pdb_id_list_path TEXT List of pdb ids to download and process --tag TEXT The name of the dataset --help Show this message and exit.