Notebook

< Transfer Learning | Pretrained models for NLP |¶

When What you Want is Already Out There¶

Sometimes, the task you want to solve might align well with pre-existing tasks. If that is the case, you're in luck, as you can leverage work done by others. In this notebook, we'll see how to use some useful libraries to fetch pretrained models and obtained some predictions.

Table of Contents¶

1. NLP = Transformers ¶

2. Hugging Face 🤗¶

3. Task #1: Sentence similarity ¶

4. Task #2: Named Entity Recognition (NER)¶

5. Task #3: Text generation with GPT-2 ¶

NLP = Transformers¶

Transformers are a class of models, just like CNNs. Their particularities are:

Not having any assumption on the structure of the input representation. If CNN assume a 1D or 2D sequence with spatial coherence, transformers work of sets.
The use of self-attention to correlate informations from different elements of the set.

A more detailed description of transformers is available here.

Today, the field of Natural Language Processing is heavily driven by transformer architectures. The following picture is the leaderboard of the GLUE benchmark, a well known benchmark for analysing the natural language understanding capabilities of ML models.

The top of the ranking consists in mainly transformer architectures. We saw a real explosion of those models in the last few years:

Hugging Face 🤗¶

Hugging Face is an open-source provider of natural language processing (NLP) technologies. They created the transformers library, which contains powerful abstractions to train/fine-tune/test transformer models. This is useful considering the explosion of tranformer architctures available.

Setup¶

We need to install the transformers library:

In [ ]:

!pip install transformers

The documentation for the transformers library is available here.

Finding a pre-trained model¶

Hugging Face provides a nice UI to search for models: https://huggingface.co/models

Task #1: Sentence similarity¶

Link to the model description provided here.

Load the model & tokenizer¶

In [ ]:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')

In [ ]:

sentences = ['I love transformers', 'The cat sat on the hat']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
encoded_input

For more info on the tokenizer, see the doc.

In [ ]:

tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])

In [ ]:

model(**encoded_input)

Get embeddings for sentences¶

In [ ]:

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    b, s, e = token_embeddings.shape
    sentence_lengths = attention_mask.sum(dim=1).view(-1,1) # get sentence lengths
    token_embeddings = token_embeddings * attention_mask.view(-1, s, 1) # mask padded tokens
    return token_embeddings.sum(dim=1) / (sentence_lengths.float() + 1e-9)


def get_embeddings(sentences, tokenizer, model):
    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling. In this case, mean pooling.
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    return  sentence_embeddings

In [ ]:

sentences = ['This is an example sentence', 'This is a longer example sentence']
get_embeddings(sentences, tokenizer, model)

Computing cosine-similarity¶

$$\cos(v_1, v_2) = \frac{<v_1, v_2>}{\|v_1\| \|v_2\|} \in [-1,1]$$

If $v_1$ and $v_2$ are very similar (the angle between them is small), then their cosine similarity is close to $1$.

In [ ]:

def cosine_sim(v1, v2):
    sim = (v1*v2).sum() / (v1.norm()*v2.norm() + 1e-8)
    return sim

sentences = ['The cat ate the fish', 
             'No feline would say no to tuna', 
             'Fish populations are threatened by the fishing industry', 
             'John bought an electric bike']
embs = get_embeddings(sentences, tokenizer, model)

print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[1]}': {cosine_sim(embs[0], embs[1]):.2f}")
print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[2]}': {cosine_sim(embs[0], embs[2]):.2f}")
print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[3]}': {cosine_sim(embs[0], embs[3]):.2f}\n")

Task #2: Named Entity Recognition (NER)¶

Link to the model description provided here.

Load the model & tokenizer¶

In [ ]:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

Get annotations for a sentence¶

More informations on pipelines here.

In [ ]:

from transformers import pipeline

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "John loves Lausanne for its proximity to the lake and the great workshops at EPFL."

ner_results = nlp(example)
ner_results

Task #3: Text generation with GPT-2¶

Here is the link to the model.

In [ ]:

# TODO find the code to load the model and generate some text

You can modify the parameters controling the text generation, check them out here.

In [ ]:

< Transfer Learning | Pretrained models for NLP |¶

When What you Want is Already Out There¶

Table of Contents¶

1. NLP = Transformers¶

2. Hugging Face 🤗¶

3. Task #1: Sentence similarity¶

4. Task #2: Named Entity Recognition (NER)¶

5. Task #3: Text generation with GPT-2¶

NLP = Transformers¶

Hugging Face 🤗¶

Setup¶

Finding a pre-trained model¶

Task #1: Sentence similarity¶

Load the model & tokenizer¶

Get embeddings for sentences¶

Computing cosine-similarity¶

Task #2: Named Entity Recognition (NER)¶

Load the model & tokenizer¶

Get annotations for a sentence¶

Task #3: Text generation with GPT-2¶

1. NLP = Transformers ¶

3. Task #1: Sentence similarity ¶

5. Task #3: Text generation with GPT-2 ¶