Sometimes, the task you want to solve might align well with pre-existing tasks. If that is the case, you're in luck, as you can leverage work done by others. In this notebook, we'll see how to use some useful libraries to fetch pretrained models and obtained some predictions.
Transformers are a class of models, just like CNNs. Their particularities are:
Not having any assumption on the structure of the input representation. If CNN assume a 1D or 2D sequence with spatial coherence, transformers work of sets.
The use of self-attention to correlate informations from different elements of the set.
A more detailed description of transformers is available here.
Today, the field of Natural Language Processing is heavily driven by transformer architectures. The following picture is the leaderboard of the GLUE benchmark, a well known benchmark for analysing the natural language understanding capabilities of ML models.
The top of the ranking consists in mainly transformer architectures. We saw a real explosion of those models in the last few years:
Hugging Face is an open-source provider of natural language processing (NLP) technologies. They created the transformers
library, which contains powerful abstractions to train/fine-tune/test transformer models. This is useful considering the explosion of tranformer architctures available.
We need to install the transformers
library:
!pip install transformers
The documentation for the transformers
library is available here.
Hugging Face provides a nice UI to search for models: https://huggingface.co/models
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
sentences = ['I love transformers', 'The cat sat on the hat']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
encoded_input
For more info on the tokenizer, see the doc.
tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])
model(**encoded_input)
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
b, s, e = token_embeddings.shape
sentence_lengths = attention_mask.sum(dim=1).view(-1,1) # get sentence lengths
token_embeddings = token_embeddings * attention_mask.view(-1, s, 1) # mask padded tokens
return token_embeddings.sum(dim=1) / (sentence_lengths.float() + 1e-9)
def get_embeddings(sentences, tokenizer, model):
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
return sentence_embeddings
sentences = ['This is an example sentence', 'This is a longer example sentence']
get_embeddings(sentences, tokenizer, model)
If $v_1$ and $v_2$ are very similar (the angle between them is small), then their cosine similarity is close to $1$.
def cosine_sim(v1, v2):
sim = (v1*v2).sum() / (v1.norm()*v2.norm() + 1e-8)
return sim
sentences = ['The cat ate the fish',
'No feline would say no to tuna',
'Fish populations are threatened by the fishing industry',
'John bought an electric bike']
embs = get_embeddings(sentences, tokenizer, model)
print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[1]}': {cosine_sim(embs[0], embs[1]):.2f}")
print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[2]}': {cosine_sim(embs[0], embs[2]):.2f}")
print(f"Cosine-similarity between '{sentences[0]}' and '{sentences[3]}': {cosine_sim(embs[0], embs[3]):.2f}\n")
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
from transformers import pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "John loves Lausanne for its proximity to the lake and the great workshops at EPFL."
ner_results = nlp(example)
ner_results
Here is the link to the model.
# TODO find the code to load the model and generate some text
You can modify the parameters controling the text generation, check them out here.