Quick summary:
Soon: datasets streaming for huge datasets and 100+ datasets
import logging
logging.basicConfig(level=logging.INFO)
# Let's import the library
import nlp
INFO:nlp.utils.file_utils:PyTorch version 1.4.0 available.
Currently available 54 datasets (not tested yet for most of them):
# Downloading and loading a dataset is a one-liner
dataset = nlp.load('squad', split='validation[:10%]')
INFO:nlp.load:Dataset script /Users/thomwolf/.cache/huggingface/datasets/ee43d2be6898ebb9c2afefda4455306911d308bcf924d21c975796832cc7c114.e7d8881147e5da61c98918c61832c7f1c88b33b51a082c464e70e119bb24983d already found in datasets directory at /Users/thomwolf/Documents/GitHub/datasets/src/nlp/datasets/686d79c021d7dcd78da4d67fe01fbe30dfecabcd4bd02d06aa9d51edab713144/squad.py, returning it. Use `force_reload=True` to override. INFO:nlp.builder:No config specified, defaulting to first: squad/plain_text INFO:nlp.builder:Overwrite dataset info from restored data version. INFO:nlp.info:Loading Dataset info from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0 INFO:nlp.builder:Reusing dataset squad (/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0) INFO:nlp.builder:Constructing Dataset for split validation[:10%], from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
This call to nlp.load()
does the following steps under the hood:
Download and import in the library the SQuAD python processing script from our S3 if it's not already stored in the library. You can find the SQuAD processing script here for instance.
Proecssing scripts are small python scripts that define the info and format of the dataset, contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files.
Run the SQuAD python processing script which will:
Download the SQuAD dataset from the original URL (see the script) if it's not already downloaded and cached.
Process and cache all SQuAD in a structured Arrow table for each standard splits stored on the drive.
Arrow table are arbitrarly long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.
Return a dataset build from the splits asked by the user (default: all), in the above example we create a dataset with the first 10% of the validation split.
# General informations on the dataset are provided in the `.info` property
print(dataset.info)
DatasetInfo( name='squad', version=1.0.0, description='Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. ', homepage='https://rajpurkar.github.io/SQuAD-explorer/', features=struct<id: string, title: string, context: string, question: string, answers: struct<text: list<item: string>, answer_start: list<item: int32>>>, total_num_examples=98169, splits={ 'train': 87599, 'validation': 10570, }, supervised_keys=None, citation="""@article{2016arXiv160605250R, author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev}, Konstantin and {Liang}, Percy}, title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}", journal = {arXiv e-prints}, year = 2016, eid = {arXiv:1606.05250}, pages = {arXiv:1606.05250}, archivePrefix = {arXiv}, eprint = {1606.05250}, }""", license=None, )
The returned Dataset
object is a memory mapped dataset that behave similarly to a normal map-style dataset. It is backed by an Apache Arrow table which allows many interesting features.
print(dataset)
Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 1057)
You can query it's length and get items or slices like you would do normally with a python mapping.
from pprint import pprint
print(f"Dataset len(dataset): {len(dataset)}")
print("First item:")
pprint(dataset[0])
print("Slice of the first two items:")
pprint(dataset[:2])
Dataset len(dataset): 1057 First item: {'answers': {'answer_start': [177, 177, 177], 'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}, 'context': 'Super Bowl 50 was an American football game to determine the ' 'champion of the National Football League (NFL) for the 2015 ' 'season. The American Football Conference (AFC) champion Denver ' 'Broncos defeated the National Football Conference (NFC) champion ' 'Carolina Panthers 24–10 to earn their third Super Bowl title. The ' "game was played on February 7, 2016, at Levi's Stadium in the San " 'Francisco Bay Area at Santa Clara, California. As this was the ' '50th Super Bowl, the league emphasized the "golden anniversary" ' 'with various gold-themed initiatives, as well as temporarily ' 'suspending the tradition of naming each Super Bowl game with ' 'Roman numerals (under which the game would have been known as ' '"Super Bowl L"), so that the logo could prominently feature the ' 'Arabic numerals 50.', 'id': '56be4db0acb8001400a502ec', 'question': 'Which NFL team represented the AFC at Super Bowl 50?', 'title': 'Super_Bowl_50'} Slice of the first two items: {'answers': [{'answer_start': [177, 177, 177], 'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}, {'answer_start': [249, 249, 249], 'text': ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers']}], 'context': ['Super Bowl 50 was an American football game to determine the ' 'champion of the National Football League (NFL) for the 2015 ' 'season. The American Football Conference (AFC) champion Denver ' 'Broncos defeated the National Football Conference (NFC) champion ' 'Carolina Panthers 24–10 to earn their third Super Bowl title. ' "The game was played on February 7, 2016, at Levi's Stadium in " 'the San Francisco Bay Area at Santa Clara, California. As this ' 'was the 50th Super Bowl, the league emphasized the "golden ' 'anniversary" with various gold-themed initiatives, as well as ' 'temporarily suspending the tradition of naming each Super Bowl ' 'game with Roman numerals (under which the game would have been ' 'known as "Super Bowl L"), so that the logo could prominently ' 'feature the Arabic numerals 50.', 'Super Bowl 50 was an American football game to determine the ' 'champion of the National Football League (NFL) for the 2015 ' 'season. The American Football Conference (AFC) champion Denver ' 'Broncos defeated the National Football Conference (NFC) champion ' 'Carolina Panthers 24–10 to earn their third Super Bowl title. ' "The game was played on February 7, 2016, at Levi's Stadium in " 'the San Francisco Bay Area at Santa Clara, California. As this ' 'was the 50th Super Bowl, the league emphasized the "golden ' 'anniversary" with various gold-themed initiatives, as well as ' 'temporarily suspending the tradition of naming each Super Bowl ' 'game with Roman numerals (under which the game would have been ' 'known as "Super Bowl L"), so that the logo could prominently ' 'feature the Arabic numerals 50.'], 'id': ['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed'], 'question': ['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?'], 'title': ['Super_Bowl_50', 'Super_Bowl_50']}
You can get a full column of the dataset by indexing with its name as a string:
print(dataset['question'][:10])
['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?', 'Where did Super Bowl 50 take place?', 'Which NFL team won Super Bowl 50?', 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'What was the theme of Super Bowl 50?', 'What day was the game played on?', 'What is the AFC short for?', 'What was the theme of Super Bowl 50?', 'What does AFC stand for?']
Items are returned as dict of element.
Slices are returned as dict of lists of elements.
Columns are returned as a list.
You can thus permute slice, index and columns indexings with identical results:
print(dataset[0]['question'] == dataset['question'][0])
print(dataset[10:20]['context'] == dataset['context'][10:20])
True True
# The underlying table is typed (int/float/strings/lists/dict) and structured
print(dataset.column_names)
print(dataset.schema)
['id', 'title', 'context', 'question', 'answers'] id: string title: string context: string question: string answers: struct<text: list<item: string>, answer_start: list<item: int32>> child 0, text: list<item: string> child 0, item: string child 1, answer_start: list<item: int32> child 0, item: int32
# Datasets also have a bunch of properties you can access
print("The number of bytes allocated on the drive is ", dataset.nbytes)
print("For comparison, here is the number of bytes allocated in memory which can be")
print("accessed with `nlp.total_allocated_bytes()`: ", nlp.total_allocated_bytes())
print("The number of rows", dataset.num_rows)
print("The number of columns", dataset.num_columns)
print("The shape (rows, columns)", dataset.shape)
The number of bytes allocated on the drive is 10472672 For comparison, here is the number of bytes allocated in memory which can be accessed with `nlp.total_allocated_bytes()`: 0 The number of rows 1057 The number of columns 5 The shape (rows, columns) (1057, 5)
# We can list the unique elements in a column. This is done by the backend (so fast!)
print(dataset.unique('title'))
['Super_Bowl_50', 'Warsaw']
# This will drop the column 'id'
dataset.drop('id') # Remove column 'id'
print(dataset.column_names)
['title', 'context', 'question', 'answers']
# This will flatten the nested columns in 'answers'
dataset.flatten()
print(dataset.column_names)
['title', 'context', 'question', 'answers.text', 'answers.answer_start']
# We can also "dictionnary encode" a column if many of it's elements are similar
# This will reduce it's size by only storing the distinct elements (e.g. string)
# It only has effect on the internal storage (no difference from a user point of view)
dataset.dictionary_encode_column('title')
You can check the current cache files backing the dataset with the .cache_file
property
dataset.cache_files
({'filename': '/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/squad-validation.arrow', 'skip': 0, 'take': 1057},)
You can clean up the cache files for in the current dataset directory with the .cleanup_cache_files()
.
Be careful that no other process is using these cache files when running this command.
dataset.cleanup_cache_files() # Returns the number of removed cache files
INFO:nlp.arrow_dataset:Listing files in /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0 INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-2b0c4368cd1b9d9ab7dd158754adb501.arrow INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-fef84cefe794447d6dc0b28596974c80.arrow INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b9d042be98ac7ed20cb12b2e9d65d208.arrow INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-0d81cced63f868bf1a233bffb4c94b85.arrow INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-fdd554f8e6ee8230941052eceac92e0f.arrow INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-2d5d9f6d0f564bbd27c91aee95cfc0dc.arrow INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-79ea07cbbe2ddf3afe1d0c6ac0269cc3.arrow
7
dataset.map
¶There is a powerful method .map()
that you can use to apply a function to each examples, independantly or in batch.
# `.map()` takes a callable accepting a dict as argument
# (same dict as returned by dataset[i])
# and iterate over the dataset by calling the function with each example.
# Let's print the length of each `context` string in our subset of the dataset
# (10% of the validation i.e. 1057 examples)
dataset.map(lambda example: print(len(example['context']), end=','))
1057it [00:00, 10624.60it/s]
775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,179,179,179,179,179,179,179,179,179,179,179,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,704,704,704,704,704,704,704,704,704,704,704,704,704,704,353,353,353,353,353,353,353,353,353,353,353,353,353,353,353,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,306,306,306,306,306,306,306,306,306,306,306,306,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,496,496,496,496,496,496,496,496,496,496,496,496,496,496,496,260,260,260,260,260,260,260,260,260,874,874,874,874,874,874,874,874,874,874,874,874,874,874,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,536,536,536,536,536,536,536,536,536,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,495,495,495,495,495,495,495,495,495,495,495,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,441,441,441,441,441,441,441,441,441,441,441,357,357,357,357,357,357,357,357,357,296,296,296,296,296,296,296,296,296,296,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,804,804,804,804,804,804,804,804,804,804,804,397,397,397,397,397,397,397,397,397,397,397,397,397,397,360,360,360,360,360,360,360,973,973,973,973,973,973,973,973,973,973,973,973,973,973,263,263,263,263,263,263,263,263,263,263,263,568,568,568,568,568,568,568,568,568,568,568,264,264,264,264,264,264,264,264,264,264,264,264,264,264,264,892,892,892,892,892,892,892,892,892,892,892,206,206,206,206,206,489,489,489,489,489,489,489,489,489,489,489,489,489,181,181,181,181,181,181,181,181,181,181,181,181,531,531,531,531,531,531,531,531,531,531,531,531,664,664,664,664,664,664,664,664,664,664,664,664,664,664,672,672,672,672,672,672,672,672,672,672,672,672,672,672,858,858,858,858,858,858,858,858,858,858,858,858,634,634,634,634,634,634,634,634,634,634,634,634,634,634,891,891,891,891,891,891,891,891,891,891,891,891,891,488,488,488,488,488,488,488,488,488,488,488,488,942,942,942,942,942,942,942,942,942,942,942,942,942,942,942,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,522,522,522,522,522,1643,1643,1643,1643,1643,628,628,628,628,628,758,758,758,758,758,883,883,883,883,883,559,559,559,559,559,603,603,603,603,631,631,631,631,631,626,626,626,626,626,541,541,541,541,541,795,795,795,795,795,591,591,591,591,591,568,568,568,568,568,536,536,536,536,536,575,575,575,575,575,571,571,571,571,571,641,641,641,641,641,665,665,665,665,665,1088,1088,1088,1088,1088,1619,1619,1619,1619,1619,939,939,939,939,939,865,865,865,865,865,711,711,711,711,711,831,831,831,831,831,501,501,501,501,501,676,676,676,676,676,854,854,854,854,854,784,784,784,784,784,641,641,641,641,641,544,544,544,544,544,918,918,918,918,918,763,763,763,763,763,906,906,906,906,906,632,632,632,632,632,869,869,869,869,869,1044,1044,1044,1044,1044,760,760,760,760,760,715,715,715,715,715,838,838,838,838,838,881,881,881,881,881,940,940,940,940,940,618,618,618,618,618,1205,1205,1205,534,534,534,534,534,757,757,757,757,757,1239,1239,1239,1239,1239,609,609,609,609,609,798,798,798,798,798,613,613,613,613,613,613,613,613,613,613,
Dataset(schema: {'title': 'string', 'context': 'string', 'question': 'string', 'answers.text': 'list<item: string>', 'answers.answer_start': 'list<item: int32>'}, num_rows: 1057)
This is basically the same as doing
for example in dataset:
function(example)
The above example had no effect on the dataset because our function supplied to .map()
didn't return a dict
or a abc.Mapping
that could be used to update the examples in the dataset. .map()
then just return the same dataset (self
).
Now let's see how to use a function that can modify the dataset.
The main interest of .map()
is to update and modify the content of the table.
To use .map()
to update elements in the table you should provide a function with the following signature: function(example: dict) -> dict
.
# Let's add a prefix 'My cute title: ' to each of our titles
def add_prefix_to_title(example):
example['title'] = 'My cute title: ' + example['title']
return example
dataset = dataset.map(add_prefix_to_title)
print(dataset.unique('title'))
INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-ed8b1249a765df5c159965379e685e44.arrow 1057it [00:00, 21208.28it/s] INFO:nlp.arrow_writer:Done writing 1057 examples in 906626 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-ed8b1249a765df5c159965379e685e44.arrow.
['My cute title: Super_Bowl_50', 'My cute title: Warsaw']
This call to .map()
compute and return the updated table. It will also store the updated table in a cache file indexed by the current state and the mapped function. A subsequent call to .map()
(even in another python session) will reuse the cached file instead of recomputing the operation (this caching may not work in jupyter notebooks yet).
The returned updated dataset is (again) directly memory mapped from drive and not allocated in RAM.
Your function should accept an input with the format of an item of the dataset: function(dataset[0])
and return a python dict.
The columns and type of the outputs can be different than the input dict. In this case the new keys will be added as additional columns in the dataset.
The example is updated()
with the output dictionary: examples.update(function(example))
.
# Since the input example is updated with our function output,
# we can actually just return the updated 'title' field
dataset = dataset.map(lambda example: {'title': 'My cutest title: ' + example['title']})
print(dataset.unique('title'))
INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-17091682b8ed78e55221838b2595bbd5.arrow 1057it [00:00, 24103.23it/s] INFO:nlp.arrow_writer:Done writing 1057 examples in 924595 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-17091682b8ed78e55221838b2595bbd5.arrow.
['My cutest title: My cute title: Super_Bowl_50', 'My cutest title: My cute title: Warsaw']
You can also remove columns when running map with the remove_columns=List[str]
argument.
# This will select the 'title' input to send to our function (as only field in the input)
# and replace it with the output of the method as a 'new_title' field
dataset = dataset.map(lambda example: {'new_title': 'Wouhahh: ' + example['title']},
remove_columns=['title'])
print(dataset.column_names)
print(dataset.unique('new_title'))
INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-9741cad18be490ab827b103119d5c732.arrow 1057it [00:00, 25135.67it/s] INFO:nlp.arrow_writer:Done writing 1057 examples in 934108 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-9741cad18be490ab827b103119d5c732.arrow.
['context', 'question', 'answers.text', 'answers.answer_start', 'new_title'] ['Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'Wouhahh: My cutest title: My cute title: Warsaw']
With with_indices=True
, dataset indices (from 0
to len(dataset)
) will be supplied to the function which must thus have the following signature: function(example: dict, indice: int) -> dict
# This will add the index in the dataset to the 'question' field
dataset = dataset.map(lambda example, idx: {'question': f'{idx}: ' + example['question']},
with_indices=True)
print('\n'.join(dataset['question'][:5]))
INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-93827205b7769e301be275a794040d51.arrow 1057it [00:00, 24952.75it/s] INFO:nlp.arrow_writer:Done writing 1057 examples in 939340 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-93827205b7769e301be275a794040d51.arrow.
0: Which NFL team represented the AFC at Super Bowl 50? 1: Which NFL team represented the NFC at Super Bowl 50? 2: Where did Super Bowl 50 take place? 3: Which NFL team won Super Bowl 50? 4: What color was used to emphasize the 50th anniversary of the Super Bowl?
.map()
can also work with batch of examples (slices of the dataset).
This is particularly interesting if you have a function that can handle batch of inputs like the tokenizers of HuggingFace tokenizers
.
To work on batched inputs set batched=True
when calling .map()
and supply a function with the following signature: function(examples: Dict[List]) -> Dict[List]
or, if you use indices, function(examples: Dict[List], indices: List[int]) -> Dict[List]
).
Your function should accept an input with the format of a slice of the dataset: e.g. function(dataset[:10])
.
# Let's import a fast tokenizer that can work on batched inputs
# (the 'Fast' tokenizers in HuggingFace)
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
INFO:transformers.file_utils:PyTorch version 1.4.0 available. INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /Users/thomwolf/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
# Now let's batch tokenize our dataset 'context'
dataset = dataset.map(lambda example: tokenizer.batch_encode_plus(example['context']),
batched=True)
print("dataset[0]", dataset[0])
INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-09ed6375515654521b963025766295d1.arrow 100%|██████████| 2/2 [00:00<00:00, 18.20it/s] INFO:nlp.arrow_writer:Done writing 1057 examples in 4811564 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-09ed6375515654521b963025766295d1.arrow.
dataset[0] {'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': '0: Which NFL team represented the AFC at Super Bowl 50?', 'answers.text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answers.answer_start': [177, 177, 177], 'new_title': 'Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'input_ids': [101, 3198, 5308, 1851, 1108, 1126, 1237, 1709, 1342, 1106, 4959, 1103, 3628, 1104, 1103, 1305, 2289, 1453, 113, 4279, 114, 1111, 1103, 1410, 1265, 119, 1109, 1237, 2289, 3047, 113, 10402, 114, 3628, 7068, 14722, 2378, 1103, 1305, 2289, 3047, 113, 24743, 114, 3628, 2938, 13598, 1572, 782, 1275, 1106, 7379, 1147, 1503, 3198, 5308, 1641, 119, 1109, 1342, 1108, 1307, 1113, 1428, 128, 117, 1446, 117, 1120, 12388, 112, 188, 3339, 1107, 1103, 1727, 2948, 2410, 3894, 1120, 3364, 10200, 117, 1756, 119, 1249, 1142, 1108, 1103, 13163, 3198, 5308, 117, 1103, 2074, 13463, 1103, 107, 5404, 5453, 107, 1114, 1672, 2284, 118, 12005, 11751, 117, 1112, 1218, 1112, 7818, 28117, 20080, 16264, 1103, 3904, 1104, 10505, 1296, 3198, 5308, 1342, 1114, 2264, 183, 15447, 16179, 113, 1223, 1134, 1103, 1342, 1156, 1138, 1151, 1227, 1112, 107, 3198, 5308, 149, 107, 114, 117, 1177, 1115, 1103, 7998, 1180, 15199, 2672, 1103, 4944, 183, 15447, 16179, 1851, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# we have added additional columns
# we could have replaced the dataset with `remove_columns=True`
print(dataset.column_names)
['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask']
# Let show a more complex processing with the full preparation of the SQuAD dataset
# for training a model from Transformers
def convert_to_features(batch):
# Tokenize contexts and questions (as pairs of inputs)
# keep offset mappings for evaluation
input_pairs = list(zip(batch['context'], batch['question']))
encodings = tokenizer.batch_encode_plus(input_pairs,
pad_to_max_length=True,
return_offsets_mapping=True)
# Compute start and end tokens for labels
start_positions, end_positions = [], []
for i, (text, start) in enumerate(zip(batch['answers.text'], batch['answers.answer_start'])):
first_char = start[0]
last_char = first_char + len(text[0]) - 1
start_positions.append(encodings.char_to_token(i, first_char))
end_positions.append(encodings.char_to_token(i, last_char))
encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
return encodings
dataset = dataset.map(convert_to_features, batched=True)
INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b01e56f9216a5c4e04189ae568585041.arrow 100%|██████████| 2/2 [00:00<00:00, 6.16it/s] INFO:nlp.arrow_writer:Done writing 1057 examples in 21999734 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b01e56f9216a5c4e04189ae568585041.arrow.
# Now our dataset comprise the labels for the start and end position
# as well as the offsets for converting back tokens
# in span of the original string for evaluation
print("column_names", dataset.column_names)
print("start_positions", dataset[:5]['start_positions'])
column_names ['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'] start_positions [34, 45, 80, 34, 98]
Now that we hae all our tokenized inputs, we would like to use this dataset in a torch.Dataloader
or a tf.data.Dataset
.
To be able to do this we need to tweak two things:
format the indexing (__getitem__
) to return numpy/torch/tensorflow tensors, instead of python objects, and
format the indexing (__getitem__
) to return only the subset of the columns that we need for our model inputs.
We don't want the columns id
or title
as input sto train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model.
This is handled by the .set_format(type: Union[None, str], columns: Union[None, str, List[str]])
where:
type
define the return type for our dataset __getitem__
method and is one of [None, 'numpy', 'torch', 'tensorflow']
(None
means return python objects), andcolumns
define the columns returned by __getitem__
and takes the name of a column in the dataset or a list of columns to return (None
means return all columns).columns_to_return = ['input_ids', 'token_type_ids', 'attention_mask',
'start_positions', 'end_positions']
dataset.set_format(type='torch',
columns=columns_to_return)
# Our dataset indexing output is now ready for being used in a pytorch dataloader
print('\n'.join([' '.join((n, str(type(t)), str(t.shape))) for n, t in dataset[:10].items()]))
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch and filter ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'] columns (when key is int or slice).
input_ids <class 'torch.Tensor'> torch.Size([10, 451]) token_type_ids <class 'torch.Tensor'> torch.Size([10, 451]) attention_mask <class 'torch.Tensor'> torch.Size([10, 451]) start_positions <class 'torch.Tensor'> torch.Size([10]) end_positions <class 'torch.Tensor'> torch.Size([10])
# Note that the columns are not removed from the dataset,
# just not returned when calling __getitem__
print(dataset.column_names)
['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']
# We can remove the formating with `.reset_format()`
# or, identically, a call to `.set_format()` with no arguments
dataset.reset_format()
print('\n'.join([' '.join((n, str(type(t)))) for n, t in dataset[:10].items()]))
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects and filter no columns (when key is int or slice).
context <class 'list'> question <class 'list'> answers.text <class 'list'> answers.answer_start <class 'list'> new_title <class 'list'> input_ids <class 'list'> token_type_ids <class 'list'> attention_mask <class 'list'> offset_mapping <class 'list'> start_positions <class 'list'> end_positions <class 'list'>
# The current format can be checked with `.format`,
# which is a dict of the type and formating
dataset.format
{'type': 'python', 'columns': ['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']}
Let's wrap this all up with the full code to load and prepare SQuAD for training a PyTorch model.
import nlp
import torch
from transformers import BertTokenizerFast
# Load our training dataset and tokenizer
dataset = nlp.load('squad')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
# Tokenize our training dataset
def convert_to_features(example_batch):
# Tokenize contexts and questions (as pairs of inputs)
input_pairs = list(zip(example_batch['context'], example_batch['question']))
encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True)
# Compute start and end tokens for labels
start_positions, end_positions = [], []
for i, answer in enumerate(example_batch['answers']):
first_char = answer['answer_start'][0]
last_char = first_char + len(answer['text'][0]) - 1
start_positions.append(encodings.char_to_token(i, first_char))
end_positions.append(encodings.char_to_token(i, last_char))
encodings.update({'start_positions': start_positions,
'end_positions': end_positions})
return encodings
dataset['train'] = dataset['train'].map(convert_to_features, batched=True)
# Format our outputs to train a pytorch model
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
dataset['train'].set_format(type='torch', columns=columns)
# Instantiate a PyTorch Dataloader around our dataset
dataloader = torch.utils.data.DataLoader(dataset['train'], batch_size=8)
INFO:nlp.load:Dataset script /Users/thomwolf/.cache/huggingface/datasets/ee43d2be6898ebb9c2afefda4455306911d308bcf924d21c975796832cc7c114.e7d8881147e5da61c98918c61832c7f1c88b33b51a082c464e70e119bb24983d already found in datasets directory at /Users/thomwolf/Documents/GitHub/datasets/src/nlp/datasets/686d79c021d7dcd78da4d67fe01fbe30dfecabcd4bd02d06aa9d51edab713144/squad.py, returning it. Use `force_reload=True` to override. INFO:nlp.builder:No config specified, defaulting to first: squad/plain_text INFO:nlp.builder:Overwrite dataset info from restored data version. INFO:nlp.info:Loading Dataset info from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0 INFO:nlp.builder:Reusing dataset squad (/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0) INFO:nlp.builder:Constructing Dataset for split None, from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0 INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /Users/thomwolf/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1 INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7008a010d9ded38b9f1f7e5bfe57c19a.arrow 100%|██████████| 88/88 [00:15<00:00, 5.83it/s] INFO:nlp.arrow_writer:Done writing 87599 examples in 1114822607 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7008a010d9ded38b9f1f7e5bfe57c19a.arrow. INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch and filter ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'] columns (when key is int or slice).
# Let's load a pretrained Bert model and a simple optimizer
from transformers import BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained('bert-base-cased')
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /Users/thomwolf/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.3d5adf10d3445c36ce131f4c6416aa62e9b58e1af56b97664773f4858a46286e INFO:transformers.configuration_utils:Model config BertConfig { "_num_labels": 2, "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bad_words_ids": null, "bos_token_id": null, "decoder_start_token_id": null, "do_sample": false, "early_stopping": false, "eos_token_id": null, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 512, "min_length": 0, "model_type": "bert", "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_past": true, "pad_token_id": 0, "prefix": null, "pruned_heads": {}, "repetition_penalty": 1.0, "task_specific_params": null, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 2, "use_bfloat16": false, "vocab_size": 28996 } INFO:transformers.modeling_utils:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /Users/thomwolf/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2 INFO:transformers.modeling_utils:Weights of BertForQuestionAnswering not initialized from pretrained model: ['qa_outputs.weight', 'qa_outputs.bias'] INFO:transformers.modeling_utils:Weights from pretrained model not used in BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
# Now let's train our model
model.train()
for i, batch in enumerate(dataloader):
outputs = model(**batch)
loss = outputs[0]
loss.backward()
optimizer.step()
model.zero_grad()
print(f'Step {i} - loss: {loss:.3}')
if i > 3:
break
Step 0 - loss: 6.26
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-33-fd6d56e12581> in <module> 3 model.train() 4 for i, batch in enumerate(dataloader): ----> 5 outputs = model(**batch) 6 loss = outputs[0] 7 loss.backward() ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 530 result = self._slow_forward(*input, **kwargs) 531 else: --> 532 result = self.forward(*input, **kwargs) 533 for hook in self._forward_hooks.values(): 534 hook_result = hook(self, input, result) ~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, start_positions, end_positions) 1478 position_ids=position_ids, 1479 head_mask=head_mask, -> 1480 inputs_embeds=inputs_embeds, 1481 ) 1482 ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 530 result = self._slow_forward(*input, **kwargs) 531 else: --> 532 result = self.forward(*input, **kwargs) 533 for hook in self._forward_hooks.values(): 534 hook_result = hook(self, input, result) ~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask) 788 head_mask=head_mask, 789 encoder_hidden_states=encoder_hidden_states, --> 790 encoder_attention_mask=encoder_extended_attention_mask, 791 ) 792 sequence_output = encoder_outputs[0] ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 530 result = self._slow_forward(*input, **kwargs) 531 else: --> 532 result = self.forward(*input, **kwargs) 533 for hook in self._forward_hooks.values(): 534 hook_result = hook(self, input, result) ~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask) 405 406 layer_outputs = layer_module( --> 407 hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask 408 ) 409 hidden_states = layer_outputs[0] ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 530 result = self._slow_forward(*input, **kwargs) 531 else: --> 532 result = self.forward(*input, **kwargs) 533 for hook in self._forward_hooks.values(): 534 hook_result = hook(self, input, result) ~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask) 366 encoder_attention_mask=None, 367 ): --> 368 self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask) 369 attention_output = self_attention_outputs[0] 370 outputs = self_attention_outputs[1:] # add self attentions if we output attention weights ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 530 result = self._slow_forward(*input, **kwargs) 531 else: --> 532 result = self.forward(*input, **kwargs) 533 for hook in self._forward_hooks.values(): 534 hook_result = hook(self, input, result) ~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask) 312 ): 313 self_outputs = self.self( --> 314 hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask 315 ) 316 attention_output = self.output(self_outputs[0], hidden_states) ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 530 result = self._slow_forward(*input, **kwargs) 531 else: --> 532 result = self.forward(*input, **kwargs) 533 for hook in self._forward_hooks.values(): 534 hook_result = hook(self, input, result) ~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask) 239 240 # Normalize the attention scores to probabilities. --> 241 attention_probs = nn.Softmax(dim=-1)(attention_scores) 242 243 # This is actually dropping out entire tokens to attend to, which might ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 530 result = self._slow_forward(*input, **kwargs) 531 else: --> 532 result = self.forward(*input, **kwargs) 533 for hook in self._forward_hooks.values(): 534 hook_result = hook(self, input, result) ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/activation.py in forward(self, input) 1016 1017 def forward(self, input): -> 1018 return F.softmax(input, self.dim, _stacklevel=5) 1019 1020 def extra_repr(self): ~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/functional.py in softmax(input, dim, _stacklevel, dtype) 1229 dim = _get_softmax_dim('softmax', input.dim(), _stacklevel) 1230 if dtype is None: -> 1231 ret = input.softmax(dim) 1232 else: 1233 ret = input.softmax(dim, dtype=dtype) KeyboardInterrupt: