If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers as well as some other libraries. Uncomment the following cell and run it.

In [ ]:

#! pip install transformers evaluate datasets requests pandas sklearn

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your username and password:

In [ ]:

from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS. Uncomment the following instructions:

In [ ]:

# !apt install git-lfs

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [ ]:

from transformers.utils import send_example_telemetry

send_example_telemetry("protein_language_modeling_notebook", framework="tensorflow")

Fine-Tuning Protein Language Models¶

In this notebook, we're going to do some transfer learning to fine-tune some large, pre-trained protein language models on tasks of interest. If that sentence feels a bit intimidating to you, don't panic - there's a blog post that explains the concepts here in much more detail.

The specific model we're going to use is ESM-2, which is the state-of-the-art protein language model at the time of writing (November 2022). The citation for this model is Lin et al, 2022.

There are several ESM-2 checkpoints with differing model sizes. Larger models will generally have better accuracy, but they require more GPU memory and will take much longer to train. The available ESM-2 checkpoints (at time of writing) are:

Checkpoint name	Num layers	Num parameters
`esm2_t48_15B_UR50D`	48	15B
`esm2_t36_3B_UR50D`	36	3B
`esm2_t33_650M_UR50D`	33	650M
`esm2_t30_150M_UR50D`	30	150M
`esm2_t12_35M_UR50D`	12	35M
`esm2_t6_8M_UR50D`	6	8M

Note that the larger checkpoints may be very difficult to train without a large cloud GPU like an A100 or H100, and the largest 15B parameter checkpoint will probably be impossible to train on any single GPU! Also, note that memory usage for attention during training will scale as O(batch_size * num_layers * seq_len^2), so larger models on long sequences will use quite a lot of memory! We will use the esm2_t12_35M_UR50D checkpoint for this notebook, which should train on any Colab instance or modern GPU.

In [1]:

model_checkpoint = "facebook/esm2_t12_35M_UR50D"

Sequence classification¶

One of the most common tasks you can perform with a language model is sequence classification. In sequence classification, we classify an entire protein into a category, from a list of two or more possibilities. There's no limit on the number of categories you can use, or the specific problem you choose, as long as it's something the model could in theory infer from the raw protein sequence. To keep things simple for this example, though, let's try classifying proteins by their cellular localization - given their sequence, can we predict if they're going to be found in the cytosol (the fluid inside the cell) or embedded in the cell membrane?

Data preparation¶

In this section, we're going to gather some training data from UniProt. Our goal is to create a pair of lists: sequences and labels. sequences will be a list of protein sequences, which will just be strings like "MNKL...", where each letter represents a single amino acid in the complete protein. labels will be a list of the category for each sequence. The categories will just be integers, with 0 representing the first category, 1 representing the second and so on. In other words, if sequences[i] is a protein sequence then labels[i] should be its corresponding category. These will form the training data we're going to use to teach the model the task we want it to do.

If you're adapting this notebook for your own use, this will probably be the main section you want to change! You can do whatever you want here, as long as you create those two lists by the end of it. If you want to follow along with this example, though, first we'll need to import requests and set up our query to UniProt.

In [2]:

import requests

query_url ="https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Csequence%2Ccc_subcellular_location&format=tsv&query=%28%28organism_id%3A9606%29%20AND%20%28reviewed%3Atrue%29%20AND%20%28length%3A%5B80%20TO%20500%5D%29%29"

This query URL might seem mysterious, but it isn't! To get it, we searched for (organism_id:9606) AND (reviewed:true) AND (length:[80 TO 500]) on UniProt to get a list of reasonably-sized human proteins, then selected 'Download', and set the format to TSV and the columns to Sequence and Subcellular location [CC], since those contain the data we care about for this task.

Once that's done, selecting Generate URL for API gives you a URL you can pass to Requests. Alternatively, if you're not on Colab you can just download the data through the web interface and open the file locally.

In [3]:

uniprot_request = requests.get(query_url)

To get this data into Pandas, we use a BytesIO object, which Pandas will treat like a file. If you downloaded the data as a file you can skip this bit and just pass the filepath directly to read_csv.

In [4]:

from io import BytesIO
import pandas

bio = BytesIO(uniprot_request.content)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df

Out[4]:

	Entry	Sequence	Subcellular location [CC]
0	A0A0K2S4Q6	MTQRAGAAMLPSALLLLCVPGCLTVSGPSTVMGAVGESLSVQCRYE...	SUBCELLULAR LOCATION: [Isoform 1]: Membrane {E...
1	A0A5B9	DLKNVFPPKVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWV...	SUBCELLULAR LOCATION: Cell membrane {ECO:00003...
2	A0AVI4	MDSPEVTFTLAYLVFAVCFVFTPNEFHAAGLTVQNLLSGWLGSEDA...	SUBCELLULAR LOCATION: Endoplasmic reticulum me...
3	A0JLT2	MENFTALFGAQADPPPPPTALGFGPGKPPPPPPPPAGGGPGTAPPP...	SUBCELLULAR LOCATION: Nucleus {ECO:0000305}.
4	A0M8Q6	GQPKAAPSVTLFPPSSEELQANKATLVCLVSDFNPGAVTVAWKADG...	SUBCELLULAR LOCATION: Secreted {ECO:0000303\|Pu...
...	...	...	...
11977	Q9NZ38	MAFPGQSDTKMQWPEVPALPLLSSLCMAMVRKSSALGKEVGRRSEG...	NaN
11978	Q9UFV3	MAETYRRSRQHEQLPGQRHMDLLTGYSKLIQSRLKLLLHLGSQPPV...	NaN
11979	Q9Y6C7	MAHHSLNTFYIWHNNVLHTHLVFFLPHLLNQPFSRGSFLIWLLLCW...	NaN
11980	X6R8D5	MGRKEHESPSQPHMCGWEDSQKPSVPSHGPKTPSCKGVKAPHSSRP...	NaN
11981	X6R8R1	MGVVLSPHPAPSRREPLAPLAPGTRPGWSPAVSGSSRSALRPSTAG...	NaN

11982 rows × 3 columns

Nice! Now we have some proteins and their subcellular locations. Let's start filtering this down. First, let's ditch the columns without subcellular location information.

In [5]:

df = df.dropna()  # Drop proteins with missing columns

Now we'll make one dataframe of proteins that contain cytosol or cytoplasm in their subcellular localization column, and a second that mentions the membrane or cell membrane. To ensure we don't get overlap, we ensure each dataframe only contains proteins that don't match the other search term.

In [6]:

cytosolic = df['Subcellular location [CC]'].str.contains("Cytosol") | df['Subcellular location [CC]'].str.contains("Cytoplasm")
membrane = df['Subcellular location [CC]'].str.contains("Membrane") | df['Subcellular location [CC]'].str.contains("Cell membrane")

In [7]:

cytosolic_df = df[cytosolic & ~membrane]
cytosolic_df

Out[7]:

	Entry	Sequence	Subcellular location [CC]
10	A1E959	MKIIILLGFLGATLSAPLIPQRLMSASNSNELLLNLNNGQLLPLQL...	SUBCELLULAR LOCATION: Secreted {ECO:0000250\|Un...
15	A1XBS5	MMRRTLENRNAQTKQLQTAVSNVEKHFGELCQIFAAYVRKTARLRD...	SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269\|P...
19	A2RU49	MSSGNYQQSEALSKPTFSEEQASALVESVFGLKVSKVRPLPSYDDQ...	SUBCELLULAR LOCATION: Cytoplasm {ECO:0000305}.
21	A2RUH7	MEAATAPEVAAGSKLKVKEASPADAEPPQASPGQGAGSPTPQLLPP...	SUBCELLULAR LOCATION: Cytoplasm, myofibril, sa...
22	A4D126	MEAGPPGSARPAEPGPCLSGQRGADHTASASLQSVAGTEPGRHPQA...	SUBCELLULAR LOCATION: Cytoplasm, cytosol {ECO:...
...	...	...	...
11555	Q96L03	MATLARLQARSSTVGNQYYFRNSVVDPFRKKENDAAVKIQSWFRGC...	SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}.
11597	Q9BYD9	MNHCQLPVVIDNGSGMIKAGVAGCREPQFIYPNIIGRAKGQSRAAQ...	SUBCELLULAR LOCATION: Cytoplasm, cytoskeleton ...
11639	Q9NPB0	MEQRLAEFRAARKRAGLAAQPPAASQGAQTPGEKAEAAATLKAAPG...	SUBCELLULAR LOCATION: Cytoplasmic vesicle memb...
11652	Q9NUJ7	MGGQVSASNSFSRLHCRNANEDWMSALCPRLWDVPLHHLSIPGSHD...	SUBCELLULAR LOCATION: Cytoplasm {ECO:0000269\|P...
11662	Q9P2W6	MGRTWCGMWRRRRPGRRSAVPRWPHLSSQSGVEPPDRWTGTPGWPS...	SUBCELLULAR LOCATION: Cytoplasm.

2495 rows × 3 columns

In [8]:

membrane_df = df[membrane & ~cytosolic]
membrane_df

Out[8]:

	Entry	Sequence	Subcellular location [CC]
0	A0A0K2S4Q6	MTQRAGAAMLPSALLLLCVPGCLTVSGPSTVMGAVGESLSVQCRYE...	SUBCELLULAR LOCATION: [Isoform 1]: Membrane {E...
1	A0A5B9	DLKNVFPPKVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWV...	SUBCELLULAR LOCATION: Cell membrane {ECO:00003...
4	A0M8Q6	GQPKAAPSVTLFPPSSEELQANKATLVCLVSDFNPGAVTVAWKADG...	SUBCELLULAR LOCATION: Secreted {ECO:0000303\|Pu...
18	A2RU14	MAGTVLGVGAGVFILALLWVAVLLLCVLLSRASGAARFSVIFLFFG...	SUBCELLULAR LOCATION: Membrane {ECO:0000305}; ...
35	A5X5Y0	MEGSWFHRKRFSFYLLLGFLLQGRGVTFTINCSGFGQHGADPTALN...	SUBCELLULAR LOCATION: Cell membrane {ECO:00002...
...	...	...	...
11843	Q6UWF5	MQIQNNLFFCCYTVMSAIFKWLLLYSLPALCFLLGTQESESFHSKA...	SUBCELLULAR LOCATION: Membrane {ECO:0000305}; ...
11917	Q8N8V8	MLLKVRRASLKPPATPHQGAFRAGNVIGQLIYLLTWSLFTAWLRPP...	SUBCELLULAR LOCATION: Membrane {ECO:0000305}; ...
11958	Q96N68	MQGQGALKESHIHLPTEQPEASLVLQGQLAESSALGPKGALRPQAQ...	SUBCELLULAR LOCATION: Membrane {ECO:0000305}; ...
11965	Q9H0A3	MMNNTDFLMLNNPWNKLCLVSMDFCFPLDFVSNLFWIFASKFIIVT...	SUBCELLULAR LOCATION: Membrane {ECO:0000255}; ...
11968	Q9H354	MNKHNLRLVQLASELILIEIIPKLFLSQVTTISHIKREKIPPNHRK...	SUBCELLULAR LOCATION: Membrane {ECO:0000305}; ...

2579 rows × 3 columns

We're almost done! Now, let's make a list of sequences from each df and generate the associated labels. We'll use 0 as the label for cytosolic proteins and 1 as the label for membrane proteins.

In [9]:

cytosolic_sequences = cytosolic_df["Sequence"].tolist()
cytosolic_labels = [0 for protein in cytosolic_sequences]

In [10]:

membrane_sequences = membrane_df["Sequence"].tolist()
membrane_labels = [1 for protein in membrane_sequences]

Now we can concatenate these lists together to get the sequences and labels lists that will form our final training data. Don't worry - they'll get shuffled during training!

In [11]:

sequences = cytosolic_sequences + membrane_sequences
labels = cytosolic_labels + membrane_labels

# Quick check to make sure we got it right
len(sequences) == len(labels)

Out[11]:

True

Phew!

Splitting the data¶

Since the data we're loading isn't prepared for us as a machine learning dataset, we'll have to split the data into train and test sets ourselves! We can use sklearn's function for that:

In [12]:

from sklearn.model_selection import train_test_split

train_sequences, test_sequences, train_labels, test_labels = train_test_split(sequences, labels, test_size=0.25, shuffle=True)

Tokenizing the data¶

All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called tokenization. For natural language this can be quite complex, as usually the network's vocabulary will not contain every possible word, which means the tokenizer must handle splitting rarer words into pieces, as well as all the complexities of capitalization and unicode characters and so on.

With proteins, however, things are very easy. In protein language models, each amino acid is converted to a single token. Every model on transformers comes with an associated tokenizer that handles tokenization for it, and protein language models are no different. Let's get our tokenizer!

In [13]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

2022-11-15 18:12:47.607429: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-15 18:12:47.713337: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-15 18:12:48.149628: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-15 18:12:48.149671: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-11-15 18:12:48.149676: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Downloading:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Let's try a single sequence to see what the outputs from our tokenizer look like:

In [14]:

tokenizer(train_sequences[0])

Out[14]:

{'input_ids': [0, 20, 5, 6, 10, 6, 18, 8, 22, 6, 14, 6, 21, 4, 17, 9, 13, 17, 5, 10, 18, 4, 4, 4, 5, 5, 4, 12, 7, 4, 19, 4, 4, 6, 6, 5, 5, 7, 18, 8, 5, 4, 9, 4, 5, 21, 9, 10, 16, 5, 15, 16, 10, 22, 9, 9, 10, 4, 5, 17, 18, 8, 10, 6, 21, 17, 4, 8, 10, 13, 9, 4, 10, 6, 18, 4, 10, 21, 19, 9, 9, 5, 11, 10, 5, 6, 12, 10, 7, 13, 17, 7, 10, 14, 10, 22, 13, 18, 11, 6, 5, 18, 19, 18, 7, 6, 11, 7, 7, 8, 11, 12, 6, 18, 6, 20, 11, 11, 14, 5, 11, 7, 6, 6, 15, 12, 18, 4, 12, 18, 19, 6, 4, 7, 6, 23, 8, 8, 11, 12, 4, 18, 18, 17, 4, 18, 4, 9, 10, 4, 12, 11, 12, 12, 5, 19, 12, 20, 15, 8, 23, 21, 16, 10, 16, 4, 10, 10, 10, 6, 5, 4, 14, 16, 9, 8, 4, 15, 13, 5, 6, 16, 23, 9, 7, 13, 8, 4, 5, 6, 22, 15, 14, 8, 7, 19, 19, 7, 20, 4, 12, 4, 23, 11, 5, 8, 12, 4, 12, 8, 23, 23, 5, 8, 5, 20, 19, 11, 14, 12, 9, 6, 22, 8, 19, 18, 13, 8, 4, 19, 18, 23, 18, 7, 5, 18, 8, 11, 12, 6, 18, 6, 13, 4, 7, 8, 8, 16, 17, 5, 21, 19, 9, 8, 16, 6, 4, 19, 10, 18, 5, 17, 18, 7, 18, 12, 4, 20, 6, 7, 23, 23, 12, 19, 8, 4, 18, 17, 7, 12, 8, 12, 4, 12, 15, 16, 8, 4, 17, 22, 12, 4, 10, 15, 20, 13, 8, 6, 23, 23, 14, 16, 23, 16, 10, 6, 4, 4, 10, 8, 10, 10, 17, 7, 7, 20, 14, 6, 8, 7, 10, 17, 10, 23, 17, 12, 8, 12, 9, 11, 13, 6, 7, 5, 9, 8, 13, 11, 13, 6, 10, 10, 4, 8, 6, 9, 20, 12, 8, 20, 15, 13, 4, 4, 5, 5, 17, 15, 5, 8, 4, 5, 12, 4, 16, 15, 16, 4, 8, 9, 20, 5, 17, 6, 23, 14, 21, 16, 11, 8, 11, 4, 5, 10, 13, 17, 9, 18, 8, 6, 6, 7, 6, 5, 18, 5, 12, 20, 17, 17, 10, 4, 5, 9, 11, 8, 6, 13, 10, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

This looks good! We can see that our sequence has been converted into input_ids, which is the tokenized sequence, and an attention_mask. The attention mask handles the case when we have sequences of variable length - in those cases, the shorter sequences are padded with blank "padding" tokens, and the attention mask is padded with 0s to indicate that those tokens should be ignored by the model.

So now, let's tokenize our whole dataset. Note that we don't need to do anything with the labels, as they're already in the format we need.

In [15]:

train_tokenized = tokenizer(train_sequences)
test_tokenized = tokenizer(test_sequences)

Dataset creation¶

Now we want to turn this data into a dataset that Keras can load samples from. We can use the HuggingFace Dataset class for this, which has convenience functions to wrap itself with a tf.data.Dataset, although there are a number of different approaches that you can take at this stage.

In [16]:

from datasets import Dataset
train_dataset = Dataset.from_dict(train_tokenized)
test_dataset = Dataset.from_dict(test_tokenized)

train_dataset

Out[16]:

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 3805
})

This looks good, but we're missing our labels! Let's add those on as an extra column to the datasets.

In [17]:

train_dataset = train_dataset.add_column("labels", train_labels)
test_dataset = test_dataset.add_column("labels", test_labels)
train_dataset

Out[17]:

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 3805
})

Looks good! We're ready for training.

Model loading¶

Next, we want to load our model. Make sure to use exactly the same model as you used when loading the tokenizer, or your model might not understand the tokenization scheme you're using!

In [18]:

from transformers import TFAutoModelForSequenceClassification

num_labels = max(train_labels + test_labels) + 1  # Add 1 since 0 can be a label
print("Num labels:", num_labels)
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Num labels: 2

Downloading:   0%|          | 0.00/134M [00:00<?, ?B/s]

2022-11-15 18:13:01.430665: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:01.437052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:01.437302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:01.437840: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-15 18:13:01.441035: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:01.441276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:01.441481: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:08.787846: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:08.788109: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:08.788318: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-15 18:13:08.788485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21763 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:21:00.0, compute capability: 8.6
2022-11-15 18:13:19.921968: I tensorflow/stream_executor/cuda/cuda_blas.cc:1614] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
Some layers from the model checkpoint at facebook/esm2_t12_35M_UR50D were not used when initializing TFEsmForSequenceClassification: ['lm_head']
- This IS expected if you are initializing TFEsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFEsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFEsmForSequenceClassification were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

These warnings are telling us that the model is discarding some weights that it used for language modelling (the lm_head) and adding some weights for sequence classification (the classifier). This is exactly what we expect when we want to fine-tune a language model on a sequence classification task!

Next, let's prepare our tf.data.Dataset. This Dataset will stream samples from our Huggingface Dataset in a way that Keras natively understands - once we've created it, we can pass it straight to model.fit()!

In [19]:

tf_train_set = model.prepare_tf_dataset(
    train_dataset,
    batch_size=8,
    shuffle=True,
    tokenizer=tokenizer
)

tf_test_set = model.prepare_tf_dataset(
    test_dataset,
    batch_size=8,
    shuffle=False,
    tokenizer=tokenizer
)

You might wonder why we pass along the tokenizer when we already preprocessed our data. This is because we will use it one last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The tokenizer has a pad() method that will do all of this right for us, and prepare_tf_dataset will use it.

Now all we need to do is compile our model. We use the AdamWeightDecay optimizer, which usually performs a little better than the base Adam optimizer.

In [20]:

from transformers import AdamWeightDecay

model.compile(optimizer=AdamWeightDecay(2e-5), metrics=["accuracy"])

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

And now we can fit our model!

In [21]:

model.fit(tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
475/475 [==============================] - 98s 180ms/step - loss: 0.2361 - accuracy: 0.9250 - val_loss: 0.1592 - val_accuracy: 0.9504
Epoch 2/3
475/475 [==============================] - 84s 176ms/step - loss: 0.1393 - accuracy: 0.9534 - val_loss: 0.1941 - val_accuracy: 0.9417
Epoch 3/3
475/475 [==============================] - 83s 174ms/step - loss: 0.0987 - accuracy: 0.9647 - val_loss: 0.1547 - val_accuracy: 0.9504

Out[21]:

<keras.callbacks.History at 0x7f60a011e590>

Nice! After three epochs we have a model accuracy of ~94%. Note that we didn't do a lot of work to filter the training data or tune hyperparameters for this experiment, and also that we used one of the smallest ESM-2 models. With a larger starting model and more effort to ensure that the training data categories were cleanly separable, accuracy could almost certainly go a lot higher!

Now that we're done, let's see how we can upload our model to the HuggingFace Hub. This step is optional, but will allow us to easily share it with other researchers. If you encounter any errors here, make sure you ran the login cell at the top of the notebook!

First, let's set a couple of properties on our model. This is optional, but will ensure the model knows the names of its categories, rather than just outputting "0" or "1".

In [22]:

model.label2id = {"cytosol": 0, "membrane": 1}
model.id2label = {val: key for key, val in model.label2id.items()}

Now we can push it to the hub as simply as...

In [23]:

model_name = model_checkpoint.split('/')[-1]
finetuned_model_name = f"{model_name}-finetuned-cytosol-membrane-classification"

model.push_to_hub(finetuned_model_name)
tokenizer.push_to_hub(finetuned_model_name)

Out[23]:

CommitInfo(commit_url='https://huggingface.co/Rocketknight1/esm2_t12_35M_UR50D-finetuned-cytosol-membrane-classification/commit/72448afb641a8460bb94d0efc9c61f0d37e4e123', commit_message='Upload tokenizer', commit_description='', oid='72448afb641a8460bb94d0efc9c61f0d37e4e123', pr_url=None, pr_revision=None, pr_num=None)

If you used the code above, you can now share this model with all your friends, family or favorite pets: they can all load it with the identifier "your-username/the-name-you-picked" so for instance:

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("your-username/my-awesome-model")

Token classification¶

Another common language model task is token classification. In this task, instead of classifying the whole sequence into a single category, we categorize each token (amino acid, in this case!) into one or more categories. This kind of model could be useful for:

Predicting secondary structure
Predicting buried vs. exposed residues
Predicting residues that will receive post-translational modifications
Predicting residues involved in binding pockets or active sites
Probably several other things, it's been a while since I was a postdoc

Data preparation¶

In this section, we're going to gather some training data from UniProt. As in the sequence classification example, we aim to create two lists: sequences and labels. Unlike in that example, however, the labels are more than just single integers. Instead, the label for each sample will be one integer per token in the input. This should make sense - when we do token classification, different tokens in the input may have different categories!

To demonstrate token classification, we're going to go back to UniProt and get some data on protein secondary structures. As above, this will probably the main section you want to change when adapting this code to your own problems.

In [24]:

import requests

query_url ="https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Csequence%2Cft_strand%2Cft_helix&format=tsv&query=%28%28organism_id%3A9606%29%20AND%20%28reviewed%3Atrue%29%20AND%20%28length%3A%5B80%20TO%20500%5D%29%29"

This time, our UniProt search was (organism_id:9606) AND (reviewed:true) AND (length:[100 TO 1000]) as it was in the first example, but instead of Subcellular location [CC] we take the Helix and Beta strand columns, as they contain the secondary structure information we want.

In [25]:

uniprot_request = requests.get(query_url)

To get this data into Pandas, we use a BytesIO object, which Pandas will treat like a file. If you downloaded the data as a file you can skip this bit and just pass the filepath directly to read_csv.

In [26]:

from io import BytesIO
import pandas

bio = BytesIO(uniprot_request.content)

df = pandas.read_csv(bio, compression='gzip', sep='\t')
df

Out[26]:

	Entry	Sequence	Beta strand	Helix
0	A0A0K2S4Q6	MTQRAGAAMLPSALLLLCVPGCLTVSGPSTVMGAVGESLSVQCRYE...	NaN	NaN
1	A0A5B9	DLKNVFPPKVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWV...	STRAND 9..14; /evidence="ECO:0007829\|PDB:4UDT"...	HELIX 2..4; /evidence="ECO:0007829\|PDB:4UDT"; ...
2	A0AVI4	MDSPEVTFTLAYLVFAVCFVFTPNEFHAAGLTVQNLLSGWLGSEDA...	NaN	NaN
3	A0JLT2	MENFTALFGAQADPPPPPTALGFGPGKPPPPPPPPAGGGPGTAPPP...	STRAND 79..81; /evidence="ECO:0007829\|PDB:7EMF"	HELIX 83..86; /evidence="ECO:0007829\|PDB:7EMF"...
4	A0M8Q6	GQPKAAPSVTLFPPSSEELQANKATLVCLVSDFNPGAVTVAWKADG...	NaN	NaN
...	...	...	...	...
11977	Q9NZ38	MAFPGQSDTKMQWPEVPALPLLSSLCMAMVRKSSALGKEVGRRSEG...	NaN	NaN
11978	Q9UFV3	MAETYRRSRQHEQLPGQRHMDLLTGYSKLIQSRLKLLLHLGSQPPV...	NaN	NaN
11979	Q9Y6C7	MAHHSLNTFYIWHNNVLHTHLVFFLPHLLNQPFSRGSFLIWLLLCW...	NaN	NaN
11980	X6R8D5	MGRKEHESPSQPHMCGWEDSQKPSVPSHGPKTPSCKGVKAPHSSRP...	NaN	NaN
11981	X6R8R1	MGVVLSPHPAPSRREPLAPLAPGTRPGWSPAVSGSSRSALRPSTAG...	NaN	NaN

11982 rows × 4 columns

Since not all proteins have this structural information, we discard proteins that have no annotated beta strands or alpha helices.

In [27]:

no_structure_rows = df["Beta strand"].isna() & df["Helix"].isna()
df = df[~no_structure_rows]
df

Out[27]:

	Entry	Sequence	Beta strand	Helix
1	A0A5B9	DLKNVFPPKVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWV...	STRAND 9..14; /evidence="ECO:0007829\|PDB:4UDT"...	HELIX 2..4; /evidence="ECO:0007829\|PDB:4UDT"; ...
3	A0JLT2	MENFTALFGAQADPPPPPTALGFGPGKPPPPPPPPAGGGPGTAPPP...	STRAND 79..81; /evidence="ECO:0007829\|PDB:7EMF"	HELIX 83..86; /evidence="ECO:0007829\|PDB:7EMF"...
14	A1L3X0	MAFSDLTSRTVHLYDNWIKDADPRVEDWLLMSSPLPQTILLGFYVY...	STRAND 97..99; /evidence="ECO:0007829\|PDB:6Y7F"	HELIX 17..20; /evidence="ECO:0007829\|PDB:6Y7F"...
16	A1Z1Q3	MYPSNKKKKVWREEKERLLKMTLEERRKEYLRDYIPLNSILSWKEE...	STRAND 71..77; /evidence="ECO:0007829\|PDB:4IQY...	HELIX 11..19; /evidence="ECO:0007829\|PDB:4IQY"...
20	A2RUC4	MAGQHLPVPRLEGVSREQFMQHLYPQRKPLVLEGIDLGPCTSKWTV...	STRAND 10..13; /evidence="ECO:0007829\|PDB:3AL5...	HELIX 16..22; /evidence="ECO:0007829\|PDB:3AL5"...
...	...	...	...	...
11551	Q96I45	MVNLGLSRVDDAVAAKHPGLGEYAACQSHAFMKGVFTFVTGTGMAF...	STRAND 3..5; /evidence="ECO:0007829\|PDB:2LOR";...	HELIX 6..16; /evidence="ECO:0007829\|PDB:2LOR";...
11614	Q9H0W7	MPTNCAAAGCATTYNKHINISFHRFPLDPKRRKEWVRLVRRKNFVP...	STRAND 7..9; /evidence="ECO:0007829\|PDB:2D8R";...	HELIX 29..38; /evidence="ECO:0007829\|PDB:2D8R"
11659	Q9P1F3	MNVDHEVNLLVEEIHRLGSKNADGKLSVKFGVLFRDDKCANLFEAL...	STRAND 24..29; /evidence="ECO:0007829\|PDB:2L2O...	HELIX 3..17; /evidence="ECO:0007829\|PDB:2L2O";...
11661	Q9P298	MSANRRWWVPPDDEDCVSEKLLRKTRESPLVPIGLGGCLVVAAYRI...	STRAND 11..14; /evidence="ECO:0007829\|PDB:2LON...	HELIX 18..24; /evidence="ECO:0007829\|PDB:2LON"...
11668	Q9UIY3	MSASVKESLQLQLLEMEMLFSMFPNQGEVKLEDVNALTNIKRYLEG...	STRAND 28..32; /evidence="ECO:0007829\|PDB:2DAW...	HELIX 5..22; /evidence="ECO:0007829\|PDB:2DAW";...

3911 rows × 4 columns

Well, this works, but that data still isn't in a clean format that we can use to build our labels. Let's take a look at one sample to see what exactly we're dealing with:

In [28]:

df.iloc[0]["Helix"]

Out[28]:

'HELIX 2..4; /evidence="ECO:0007829|PDB:4UDT"; HELIX 17..23; /evidence="ECO:0007829|PDB:4UDT"; HELIX 83..86; /evidence="ECO:0007829|PDB:4UDT"'

We'll need to use a regex to pull out each segment that's marked as being a STRAND or HELIX. What we're asking for is a list of everywhere we see the word STRAND or HELIX followed by two numbers separated by two dots. In each case where this pattern is found, we tell the regex to extract the two numbers as a tuple for us.

In [29]:

import re

strand_re = r"STRAND\s(\d+)\.\.(\d+)\;"
helix_re = r"HELIX\s(\d+)\.\.(\d+)\;"

re.findall(helix_re, df.iloc[0]["Helix"])

Out[29]:

[('2', '4'), ('17', '23'), ('83', '86')]

Looks good! We can use this to build our training data. Recall that the labels need to be a list or array of integers that's the same length as the input sequence. We're going to use 0 to indicate residues without any annotated structure, 1 for residues in an alpha helix, and 2 for residues in a beta strand. To build that, we'll start with an array of all 0s, and then fill in values based on the positions that our regex pulls out of the UniProt results.

We'll use NumPy arrays rather than lists here, since these allow slice assignment, which will be a lot simpler than editing a list of integers. Note also that UniProt annotates residues starting from 1 (unlike Python, which starts from 0), and region annotations are inclusive (so 1..3 means residues 1, 2 and 3). To turn these into Python slices, we subtract 1 from the start of each annotation, but not the end.

In [30]:

import numpy as np

def build_labels(sequence, strands, helices):
    # Start with all 0s
    labels = np.zeros(len(sequence), dtype=np.int64)
    
    if isinstance(helices, float): # Indicates missing (NaN)
        found_helices = []
    else:
        found_helices = re.findall(helix_re, helices)
    for helix_start, helix_end in found_helices:
        helix_start = int(helix_start) - 1
        helix_end = int(helix_end)
        assert helix_end <= len(sequence)
        labels[helix_start: helix_end] = 1  # Helix category
    
    if isinstance(strands, float): # Indicates missing (NaN)
        found_strands = []
    else:
        found_strands = re.findall(strand_re, strands)
    for strand_start, strand_end in found_strands:
        strand_start = int(strand_start) - 1
        strand_end = int(strand_end)
        assert strand_end <= len(sequence)
        labels[strand_start: strand_end] = 2  # Strand category
    return labels

Now we've defined a helper function, let's build our lists of sequences and labels:

In [31]:

sequences = []
labels = []

for row_idx, row in df.iterrows():
    row_labels = build_labels(row["Sequence"], row["Beta strand"], row["Helix"])
    sequences.append(row["Sequence"])
    labels.append(row_labels)

Creating our dataset¶

Nice! Now we'll split and tokenize the data, and then create datasets - I'll go through this quite quickly here, since it's identical to how we did it in the sequence classification example above.

In [32]:

from sklearn.model_selection import train_test_split

train_sequences, test_sequences, train_labels, test_labels = train_test_split(sequences, labels, test_size=0.25, shuffle=True)

In [33]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

train_tokenized = tokenizer(train_sequences)
test_tokenized = tokenizer(test_sequences)

In [34]:

from datasets import Dataset

train_dataset = Dataset.from_dict(train_tokenized)
test_dataset = Dataset.from_dict(test_tokenized)

train_dataset = train_dataset.add_column("labels", train_labels)
test_dataset = test_dataset.add_column("labels", test_labels)

Model loading¶

The key difference here with the above example is that we use TFAutoModelForTokenClassification instead of TFAutoModelForSequenceClassification. We will also need a data_collator this time, as we're in the slightly more complex case where both inputs and labels must be padded in each batch.

In [35]:

from transformers import TFAutoModelForTokenClassification

num_labels = 3
model = TFAutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some layers from the model checkpoint at facebook/esm2_t12_35M_UR50D were not used when initializing TFEsmForTokenClassification: ['lm_head']
- This IS expected if you are initializing TFEsmForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFEsmForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFEsmForTokenClassification were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['dropout_73', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

In [36]:

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="np")

Now we create our tf.data.Dataset objects as before. Remember to pass the data collator, though! Note that when you pass a data collator, there's no need to pass your tokenizer, as the data collator is handling padding for us.

In [37]:

tf_train_set = model.prepare_tf_dataset(
    train_dataset,
    batch_size=8,
    shuffle=True,
    collate_fn=data_collator
)

tf_test_set = model.prepare_tf_dataset(
    test_dataset,
    batch_size=8,
    shuffle=False,
    collate_fn=data_collator
)

/home/matt/PycharmProjects/transformers/src/transformers/tokenization_utils_base.py:715: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  tensor = as_tensor(value)

Our metrics are bit more complex than in the sequence classification task, as we need to ignore padding tokens (those where the label is -100). This means we'll need our own metric function where we only compute accuracy on non-padding tokens.

In [38]:

from transformers import AdamWeightDecay
import tensorflow as tf

def masked_accuracy(y_true, y_pred):
    predictions = tf.math.argmax(y_pred, axis=-1)  # Highest logit corresponds to predicted category
    numerator = tf.math.count_nonzero((predictions == tf.cast(y_true, predictions.dtype)) & (y_true != -100), dtype=tf.float32)
    denominator = tf.math.count_nonzero(y_true != -100, dtype=tf.float32)
    return numerator / denominator

model.compile(optimizer=AdamWeightDecay(2e-5), metrics=[masked_accuracy])

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

And now we're ready to train our model!

In [39]:

model.fit(tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
366/366 [==============================] - 78s 184ms/step - loss: 0.5809 - masked_accuracy: 0.7502 - val_loss: 0.4764 - val_masked_accuracy: 0.8023
Epoch 2/3
366/366 [==============================] - 65s 177ms/step - loss: 0.4534 - masked_accuracy: 0.8132 - val_loss: 0.4564 - val_masked_accuracy: 0.8115
Epoch 3/3
366/366 [==============================] - 64s 176ms/step - loss: 0.4108 - masked_accuracy: 0.8325 - val_loss: 0.4586 - val_masked_accuracy: 0.8119

Out[39]:

<keras.callbacks.History at 0x7f60a011e320>

This definitely seems harder than the first task, but we still attain a very respectable accuracy. Remember that to keep this demo lightweight, we used one of the smallest ESM models, focused on human proteins only and didn't put a lot of work into making sure we only included completely-annotated proteins in our training set. With a bigger model and a cleaner, broader training set, accuracy on this task could definitely go a lot higher!

Now, let's push this model to the hub as we did before, while also setting the category labels appropriately.

In [40]:

model.label2id = {"unstructured": 0, "helix": 1, "strand": 2}
model.id2label = {val: key for key, val in model.label2id.items()}

model_name = model_checkpoint.split('/')[-1]
finetuned_model_name = f"{model_name}-finetuned-secondary-structure-classification"

model.push_to_hub(finetuned_model_name)
tokenizer.push_to_hub(finetuned_model_name)

Out[40]:

CommitInfo(commit_url='https://huggingface.co/Rocketknight1/esm2_t12_35M_UR50D-finetuned-secondary-structure-classification/commit/42093032cc6f061e1ef23fdf96ad80e5dce1a75a', commit_message='Upload tokenizer', commit_description='', oid='42093032cc6f061e1ef23fdf96ad80e5dce1a75a', pr_url=None, pr_revision=None, pr_num=None)

If you used the code above, you can now share this model with all your friends, family or favorite pets: they can all load it with the identifier "your-username/the-name-you-picked" so for instance:

from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained("your-username/my-awesome-model")

In [ ]: