Notebook

HuggingFace Datasets library demo¶

Quick summary:

50+ NLP datasets + super easy to add new ones (like Transformers models)
Simple and fast API to download and pre-process the datasets
Super easy to tokenize and process them in an efficient way
All dataset memory mapped on drive (no RAM limitation)
Smart caching on drive, process once, reuse everytime

Soon: datasets streaming for huge datasets and 100+ datasets

In [1]:

import logging
logging.basicConfig(level=logging.INFO)

In [2]:

# Let's import the library
import nlp

INFO:nlp.utils.file_utils:PyTorch version 1.4.0 available.

Currently available 54 datasets (not tested yet for most of them):

aeslc
amazon_us_reviews
big_patent
billsum
blimp
c4
cfq
civil_comments
cnn_dailymail
cos_e
definite_pronoun_resolution
eraser_multi_rc
esnli
flores
forest_fires
gap
german_credit_numeric
gigaword
glue
higgs
imdb
iris
librispeech_lm
lm1b
math_dataset
movie_rationales
multi_news
multi_nli
multi_nli_mismatch
natural_questions
newsroom
opinosis
para_crawl
qa4mre
reddit_tifu
rock_you
scan
scicite
scientific_papers
snli
squad
super_glue
ted_hrlr
ted_multi
tiny_shakespeare
titanic
trivia_qa
wiki40b
wikihow
wikipedia
wmt
xnli
xsum
yelp_polarity

An example with SQuAD¶

In [3]:

# Downloading and loading a dataset is a one-liner

dataset = nlp.load('squad', split='validation[:10%]')

INFO:nlp.load:Dataset script /Users/thomwolf/.cache/huggingface/datasets/ee43d2be6898ebb9c2afefda4455306911d308bcf924d21c975796832cc7c114.e7d8881147e5da61c98918c61832c7f1c88b33b51a082c464e70e119bb24983d already found in datasets directory at /Users/thomwolf/Documents/GitHub/datasets/src/nlp/datasets/686d79c021d7dcd78da4d67fe01fbe30dfecabcd4bd02d06aa9d51edab713144/squad.py, returning it. Use `force_reload=True` to override.
INFO:nlp.builder:No config specified, defaulting to first: squad/plain_text
INFO:nlp.builder:Overwrite dataset info from restored data version.
INFO:nlp.info:Loading Dataset info from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:nlp.builder:Reusing dataset squad (/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0)
INFO:nlp.builder:Constructing Dataset for split validation[:10%], from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0

This call to nlp.load() does the following steps under the hood:

Download and import in the library the SQuAD python processing script from our S3 if it's not already stored in the library. You can find the SQuAD processing script here for instance.

Proecssing scripts are small python scripts that define the info and format of the dataset, contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files.
Run the SQuAD python processing script which will:
- Download the SQuAD dataset from the original URL (see the script) if it's not already downloaded and cached.
- Process and cache all SQuAD in a structured Arrow table for each standard splits stored on the drive.
  
  Arrow table are arbitrarly long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.
Return a dataset build from the splits asked by the user (default: all), in the above example we create a dataset with the first 10% of the validation split.

In [4]:

# General informations on the dataset are provided in the `.info` property
print(dataset.info)

DatasetInfo(
        name='squad',
        version=1.0.0,
        description='Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
',
        homepage='https://rajpurkar.github.io/SQuAD-explorer/',
        features=struct<id: string, title: string, context: string, question: string, answers: struct<text: list<item: string>, answer_start: list<item: int32>>>,
        total_num_examples=98169,
        splits={
        'train': 87599,
        'validation': 10570,
    },
        supervised_keys=None,
        citation="""@article{2016arXiv160605250R,
                 author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                                     Konstantin and {Liang}, Percy},
                    title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
                journal = {arXiv e-prints},
                     year = 2016,
                        eid = {arXiv:1606.05250},
                    pages = {arXiv:1606.05250},
    archivePrefix = {arXiv},
                 eprint = {1606.05250},
    }""",
        license=None,
)

Inspecting the dataset: elements, slices and columns¶

The returned Dataset object is a memory mapped dataset that behave similarly to a normal map-style dataset. It is backed by an Apache Arrow table which allows many interesting features.

In [5]:

print(dataset)

Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 1057)

You can query it's length and get items or slices like you would do normally with a python mapping.

In [6]:

from pprint import pprint

print(f"Dataset len(dataset): {len(dataset)}")
print("First item:")
pprint(dataset[0])
print("Slice of the first two items:")
pprint(dataset[:2])

Dataset len(dataset): 1057
First item:
{'answers': {'answer_start': [177, 177, 177],
             'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the '
            'champion of the National Football League (NFL) for the 2015 '
            'season. The American Football Conference (AFC) champion Denver '
            'Broncos defeated the National Football Conference (NFC) champion '
            'Carolina Panthers 24–10 to earn their third Super Bowl title. The '
            "game was played on February 7, 2016, at Levi's Stadium in the San "
            'Francisco Bay Area at Santa Clara, California. As this was the '
            '50th Super Bowl, the league emphasized the "golden anniversary" '
            'with various gold-themed initiatives, as well as temporarily '
            'suspending the tradition of naming each Super Bowl game with '
            'Roman numerals (under which the game would have been known as '
            '"Super Bowl L"), so that the logo could prominently feature the '
            'Arabic numerals 50.',
 'id': '56be4db0acb8001400a502ec',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'title': 'Super_Bowl_50'}
Slice of the first two items:
{'answers': [{'answer_start': [177, 177, 177],
              'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
             {'answer_start': [249, 249, 249],
              'text': ['Carolina Panthers',
                       'Carolina Panthers',
                       'Carolina Panthers']}],
 'context': ['Super Bowl 50 was an American football game to determine the '
             'champion of the National Football League (NFL) for the 2015 '
             'season. The American Football Conference (AFC) champion Denver '
             'Broncos defeated the National Football Conference (NFC) champion '
             'Carolina Panthers 24–10 to earn their third Super Bowl title. '
             "The game was played on February 7, 2016, at Levi's Stadium in "
             'the San Francisco Bay Area at Santa Clara, California. As this '
             'was the 50th Super Bowl, the league emphasized the "golden '
             'anniversary" with various gold-themed initiatives, as well as '
             'temporarily suspending the tradition of naming each Super Bowl '
             'game with Roman numerals (under which the game would have been '
             'known as "Super Bowl L"), so that the logo could prominently '
             'feature the Arabic numerals 50.',
             'Super Bowl 50 was an American football game to determine the '
             'champion of the National Football League (NFL) for the 2015 '
             'season. The American Football Conference (AFC) champion Denver '
             'Broncos defeated the National Football Conference (NFC) champion '
             'Carolina Panthers 24–10 to earn their third Super Bowl title. '
             "The game was played on February 7, 2016, at Levi's Stadium in "
             'the San Francisco Bay Area at Santa Clara, California. As this '
             'was the 50th Super Bowl, the league emphasized the "golden '
             'anniversary" with various gold-themed initiatives, as well as '
             'temporarily suspending the tradition of naming each Super Bowl '
             'game with Roman numerals (under which the game would have been '
             'known as "Super Bowl L"), so that the logo could prominently '
             'feature the Arabic numerals 50.'],
 'id': ['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed'],
 'question': ['Which NFL team represented the AFC at Super Bowl 50?',
              'Which NFL team represented the NFC at Super Bowl 50?'],
 'title': ['Super_Bowl_50', 'Super_Bowl_50']}

You can get a full column of the dataset by indexing with its name as a string:

In [7]:

print(dataset['question'][:10])

['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?', 'Where did Super Bowl 50 take place?', 'Which NFL team won Super Bowl 50?', 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'What was the theme of Super Bowl 50?', 'What day was the game played on?', 'What is the AFC short for?', 'What was the theme of Super Bowl 50?', 'What does AFC stand for?']

Items are returned as dict of element.

Slices are returned as dict of lists of elements.

Columns are returned as a list.

You can thus permute slice, index and columns indexings with identical results:

In [8]:

print(dataset[0]['question'] == dataset['question'][0])
print(dataset[10:20]['context'] == dataset['context'][10:20])

True
True

In [9]:

# The underlying table is typed (int/float/strings/lists/dict) and structured 
print(dataset.column_names)
print(dataset.schema)

['id', 'title', 'context', 'question', 'answers']
id: string
title: string
context: string
question: string
answers: struct<text: list<item: string>, answer_start: list<item: int32>>
  child 0, text: list<item: string>
      child 0, item: string
  child 1, answer_start: list<item: int32>
      child 0, item: int32

Additional misc properties¶

In [10]:

# Datasets also have a bunch of properties you can access
print("The number of bytes allocated on the drive is ", dataset.nbytes)
print("For comparison, here is the number of bytes allocated in memory which can be")
print("accessed with `nlp.total_allocated_bytes()`: ", nlp.total_allocated_bytes())
print("The number of rows", dataset.num_rows)
print("The number of columns", dataset.num_columns)
print("The shape (rows, columns)", dataset.shape)

The number of bytes allocated on the drive is  10472672
For comparison, here is the number of bytes allocated in memory which can be
accessed with `nlp.total_allocated_bytes()`:  0
The number of rows 1057
The number of columns 5
The shape (rows, columns) (1057, 5)

Additional misc methods¶

In [11]:

# We can list the unique elements in a column. This is done by the backend (so fast!)
print(dataset.unique('title'))

['Super_Bowl_50', 'Warsaw']

In [12]:

# This will drop the column 'id'
dataset.drop('id')  # Remove column 'id'
print(dataset.column_names)

['title', 'context', 'question', 'answers']

In [13]:

# This will flatten the nested columns in 'answers'
dataset.flatten()
print(dataset.column_names)

['title', 'context', 'question', 'answers.text', 'answers.answer_start']

In [14]:

# We can also "dictionnary encode" a column if many of it's elements are similar
# This will reduce it's size by only storing the distinct elements (e.g. string)
# It only has effect on the internal storage (no difference from a user point of view)
dataset.dictionary_encode_column('title')

Cache files¶

You can check the current cache files backing the dataset with the .cache_file property

In [15]:

dataset.cache_files

Out[15]:

({'filename': '/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/squad-validation.arrow',
  'skip': 0,
  'take': 1057},)

You can clean up the cache files for in the current dataset directory with the .cleanup_cache_files().

Be careful that no other process is using these cache files when running this command.

In [16]:

dataset.cleanup_cache_files()  # Returns the number of removed cache files

INFO:nlp.arrow_dataset:Listing files in /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-2b0c4368cd1b9d9ab7dd158754adb501.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-fef84cefe794447d6dc0b28596974c80.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b9d042be98ac7ed20cb12b2e9d65d208.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-0d81cced63f868bf1a233bffb4c94b85.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-fdd554f8e6ee8230941052eceac92e0f.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-2d5d9f6d0f564bbd27c91aee95cfc0dc.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-79ea07cbbe2ddf3afe1d0c6ac0269cc3.arrow

Out[16]:

Modifying the dataset with `dataset.map`¶

There is a powerful method .map() that you can use to apply a function to each examples, independantly or in batch.

In [17]:

# `.map()` takes a callable accepting a dict as argument
# (same dict as returned by dataset[i])
# and iterate over the dataset by calling the function with each example.

# Let's print the length of each `context` string in our subset of the dataset
# (10% of the validation i.e. 1057 examples)

dataset.map(lambda example: print(len(example['context']), end=','))

1057it [00:00, 10624.60it/s]

775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,179,179,179,179,179,179,179,179,179,179,179,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,704,704,704,704,704,704,704,704,704,704,704,704,704,704,353,353,353,353,353,353,353,353,353,353,353,353,353,353,353,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,306,306,306,306,306,306,306,306,306,306,306,306,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,496,496,496,496,496,496,496,496,496,496,496,496,496,496,496,260,260,260,260,260,260,260,260,260,874,874,874,874,874,874,874,874,874,874,874,874,874,874,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,536,536,536,536,536,536,536,536,536,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,495,495,495,495,495,495,495,495,495,495,495,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,441,441,441,441,441,441,441,441,441,441,441,357,357,357,357,357,357,357,357,357,296,296,296,296,296,296,296,296,296,296,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,804,804,804,804,804,804,804,804,804,804,804,397,397,397,397,397,397,397,397,397,397,397,397,397,397,360,360,360,360,360,360,360,973,973,973,973,973,973,973,973,973,973,973,973,973,973,263,263,263,263,263,263,263,263,263,263,263,568,568,568,568,568,568,568,568,568,568,568,264,264,264,264,264,264,264,264,264,264,264,264,264,264,264,892,892,892,892,892,892,892,892,892,892,892,206,206,206,206,206,489,489,489,489,489,489,489,489,489,489,489,489,489,181,181,181,181,181,181,181,181,181,181,181,181,531,531,531,531,531,531,531,531,531,531,531,531,664,664,664,664,664,664,664,664,664,664,664,664,664,664,672,672,672,672,672,672,672,672,672,672,672,672,672,672,858,858,858,858,858,858,858,858,858,858,858,858,634,634,634,634,634,634,634,634,634,634,634,634,634,634,891,891,891,891,891,891,891,891,891,891,891,891,891,488,488,488,488,488,488,488,488,488,488,488,488,942,942,942,942,942,942,942,942,942,942,942,942,942,942,942,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,522,522,522,522,522,1643,1643,1643,1643,1643,628,628,628,628,628,758,758,758,758,758,883,883,883,883,883,559,559,559,559,559,603,603,603,603,631,631,631,631,631,626,626,626,626,626,541,541,541,541,541,795,795,795,795,795,591,591,591,591,591,568,568,568,568,568,536,536,536,536,536,575,575,575,575,575,571,571,571,571,571,641,641,641,641,641,665,665,665,665,665,1088,1088,1088,1088,1088,1619,1619,1619,1619,1619,939,939,939,939,939,865,865,865,865,865,711,711,711,711,711,831,831,831,831,831,501,501,501,501,501,676,676,676,676,676,854,854,854,854,854,784,784,784,784,784,641,641,641,641,641,544,544,544,544,544,918,918,918,918,918,763,763,763,763,763,906,906,906,906,906,632,632,632,632,632,869,869,869,869,869,1044,1044,1044,1044,1044,760,760,760,760,760,715,715,715,715,715,838,838,838,838,838,881,881,881,881,881,940,940,940,940,940,618,618,618,618,618,1205,1205,1205,534,534,534,534,534,757,757,757,757,757,1239,1239,1239,1239,1239,609,609,609,609,609,798,798,798,798,798,613,613,613,613,613,613,613,613,613,613,

Out[17]:

Dataset(schema: {'title': 'string', 'context': 'string', 'question': 'string', 'answers.text': 'list<item: string>', 'answers.answer_start': 'list<item: int32>'}, num_rows: 1057)

This is basically the same as doing

for example in dataset:
    function(example)

The above example had no effect on the dataset because our function supplied to .map() didn't return a dict or a abc.Mapping that could be used to update the examples in the dataset. .map() then just return the same dataset (self).

Now let's see how to use a function that can modify the dataset.

Modifying the dataset example by example¶

The main interest of .map() is to update and modify the content of the table.

To use .map() to update elements in the table you should provide a function with the following signature: function(example: dict) -> dict.

In [18]:

# Let's add a prefix 'My cute title: ' to each of our titles

def add_prefix_to_title(example):
    example['title'] = 'My cute title: ' + example['title']
    return example

dataset = dataset.map(add_prefix_to_title)

print(dataset.unique('title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-ed8b1249a765df5c159965379e685e44.arrow
1057it [00:00, 21208.28it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 906626 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-ed8b1249a765df5c159965379e685e44.arrow.

['My cute title: Super_Bowl_50', 'My cute title: Warsaw']

This call to .map() compute and return the updated table. It will also store the updated table in a cache file indexed by the current state and the mapped function. A subsequent call to .map() (even in another python session) will reuse the cached file instead of recomputing the operation (this caching may not work in jupyter notebooks yet).

The returned updated dataset is (again) directly memory mapped from drive and not allocated in RAM.

Your function should accept an input with the format of an item of the dataset: function(dataset[0]) and return a python dict.

The columns and type of the outputs can be different than the input dict. In this case the new keys will be added as additional columns in the dataset.

The example is updated() with the output dictionary: examples.update(function(example)).

In [19]:

# Since the input example is updated with our function output,
# we can actually just return the updated 'title' field
dataset = dataset.map(lambda example: {'title': 'My cutest title: ' + example['title']})

print(dataset.unique('title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-17091682b8ed78e55221838b2595bbd5.arrow
1057it [00:00, 24103.23it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 924595 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-17091682b8ed78e55221838b2595bbd5.arrow.

['My cutest title: My cute title: Super_Bowl_50', 'My cutest title: My cute title: Warsaw']

Removing columns¶

You can also remove columns when running map with the remove_columns=List[str] argument.

In [20]:

# This will select the 'title' input to send to our function (as only field in the input)
# and replace it with the output of the method as a 'new_title' field
dataset = dataset.map(lambda example: {'new_title': 'Wouhahh: ' + example['title']},
                     remove_columns=['title'])

print(dataset.column_names)
print(dataset.unique('new_title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-9741cad18be490ab827b103119d5c732.arrow
1057it [00:00, 25135.67it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 934108 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-9741cad18be490ab827b103119d5c732.arrow.

['context', 'question', 'answers.text', 'answers.answer_start', 'new_title']
['Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'Wouhahh: My cutest title: My cute title: Warsaw']

Using examples indices¶

With with_indices=True, dataset indices (from 0 to len(dataset)) will be supplied to the function which must thus have the following signature: function(example: dict, indice: int) -> dict

In [21]:

# This will add the index in the dataset to the 'question' field
dataset = dataset.map(lambda example, idx: {'question': f'{idx}: ' + example['question']},
                      with_indices=True)

print('\n'.join(dataset['question'][:5]))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-93827205b7769e301be275a794040d51.arrow
1057it [00:00, 24952.75it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 939340 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-93827205b7769e301be275a794040d51.arrow.

0: Which NFL team represented the AFC at Super Bowl 50?
1: Which NFL team represented the NFC at Super Bowl 50?
2: Where did Super Bowl 50 take place?
3: Which NFL team won Super Bowl 50?
4: What color was used to emphasize the 50th anniversary of the Super Bowl?

Modifying the dataset with batched updates¶

.map() can also work with batch of examples (slices of the dataset).

This is particularly interesting if you have a function that can handle batch of inputs like the tokenizers of HuggingFace tokenizers.

To work on batched inputs set batched=True when calling .map() and supply a function with the following signature: function(examples: Dict[List]) -> Dict[List] or, if you use indices, function(examples: Dict[List], indices: List[int]) -> Dict[List]).

Your function should accept an input with the format of a slice of the dataset: e.g. function(dataset[:10]).

In [22]:

# Let's import a fast tokenizer that can work on batched inputs
# (the 'Fast' tokenizers in HuggingFace)
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

INFO:transformers.file_utils:PyTorch version 1.4.0 available.
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /Users/thomwolf/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1

In [23]:

# Now let's batch tokenize our dataset 'context'
dataset = dataset.map(lambda example: tokenizer.batch_encode_plus(example['context']),
                      batched=True)

print("dataset[0]", dataset[0])

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-09ed6375515654521b963025766295d1.arrow
100%|██████████| 2/2 [00:00<00:00, 18.20it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 4811564 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-09ed6375515654521b963025766295d1.arrow.

dataset[0] {'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': '0: Which NFL team represented the AFC at Super Bowl 50?', 'answers.text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answers.answer_start': [177, 177, 177], 'new_title': 'Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'input_ids': [101, 3198, 5308, 1851, 1108, 1126, 1237, 1709, 1342, 1106, 4959, 1103, 3628, 1104, 1103, 1305, 2289, 1453, 113, 4279, 114, 1111, 1103, 1410, 1265, 119, 1109, 1237, 2289, 3047, 113, 10402, 114, 3628, 7068, 14722, 2378, 1103, 1305, 2289, 3047, 113, 24743, 114, 3628, 2938, 13598, 1572, 782, 1275, 1106, 7379, 1147, 1503, 3198, 5308, 1641, 119, 1109, 1342, 1108, 1307, 1113, 1428, 128, 117, 1446, 117, 1120, 12388, 112, 188, 3339, 1107, 1103, 1727, 2948, 2410, 3894, 1120, 3364, 10200, 117, 1756, 119, 1249, 1142, 1108, 1103, 13163, 3198, 5308, 117, 1103, 2074, 13463, 1103, 107, 5404, 5453, 107, 1114, 1672, 2284, 118, 12005, 11751, 117, 1112, 1218, 1112, 7818, 28117, 20080, 16264, 1103, 3904, 1104, 10505, 1296, 3198, 5308, 1342, 1114, 2264, 183, 15447, 16179, 113, 1223, 1134, 1103, 1342, 1156, 1138, 1151, 1227, 1112, 107, 3198, 5308, 149, 107, 114, 117, 1177, 1115, 1103, 7998, 1180, 15199, 2672, 1103, 4944, 183, 15447, 16179, 1851, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [24]:

# we have added additional columns
# we could have replaced the dataset with `remove_columns=True`
print(dataset.column_names)

['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask']

In [25]:

# Let show a more complex processing with the full preparation of the SQuAD dataset
# for training a model from Transformers
def convert_to_features(batch):
    # Tokenize contexts and questions (as pairs of inputs)
    # keep offset mappings for evaluation
    input_pairs = list(zip(batch['context'], batch['question']))
    encodings = tokenizer.batch_encode_plus(input_pairs,
                                            pad_to_max_length=True,
                                            return_offsets_mapping=True)

    # Compute start and end tokens for labels
    start_positions, end_positions = [], []
    for i, (text, start) in enumerate(zip(batch['answers.text'], batch['answers.answer_start'])):
        first_char = start[0]
        last_char = first_char + len(text[0]) - 1
        start_positions.append(encodings.char_to_token(i, first_char))
        end_positions.append(encodings.char_to_token(i, last_char))

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
    return encodings

dataset = dataset.map(convert_to_features, batched=True)

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b01e56f9216a5c4e04189ae568585041.arrow
100%|██████████| 2/2 [00:00<00:00,  6.16it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 21999734 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b01e56f9216a5c4e04189ae568585041.arrow.

In [26]:

# Now our dataset comprise the labels for the start and end position
# as well as the offsets for converting back tokens
# in span of the original string for evaluation
print("column_names", dataset.column_names)
print("start_positions", dataset[:5]['start_positions'])

column_names ['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']
start_positions [34, 45, 80, 34, 98]

Formating outputs for numpy/torch/tensorflow¶

Now that we hae all our tokenized inputs, we would like to use this dataset in a torch.Dataloader or a tf.data.Dataset.

To be able to do this we need to tweak two things:

format the indexing (__getitem__) to return numpy/torch/tensorflow tensors, instead of python objects, and
format the indexing (__getitem__) to return only the subset of the columns that we need for our model inputs.

We don't want the columns id or title as input sto train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model.

This is handled by the .set_format(type: Union[None, str], columns: Union[None, str, List[str]]) where:

type define the return type for our dataset __getitem__ method and is one of [None, 'numpy', 'torch', 'tensorflow'] (None means return python objects), and
columns define the columns returned by __getitem__ and takes the name of a column in the dataset or a list of columns to return (None means return all columns).

In [27]:

columns_to_return = ['input_ids', 'token_type_ids', 'attention_mask',
                     'start_positions', 'end_positions']

dataset.set_format(type='torch',
                   columns=columns_to_return)

# Our dataset indexing output is now ready for being used in a pytorch dataloader
print('\n'.join([' '.join((n, str(type(t)), str(t.shape))) for n, t in dataset[:10].items()]))

INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch and filter ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'] columns  (when key is int or slice).

input_ids <class 'torch.Tensor'> torch.Size([10, 451])
token_type_ids <class 'torch.Tensor'> torch.Size([10, 451])
attention_mask <class 'torch.Tensor'> torch.Size([10, 451])
start_positions <class 'torch.Tensor'> torch.Size([10])
end_positions <class 'torch.Tensor'> torch.Size([10])

In [28]:

# Note that the columns are not removed from the dataset,
# just not returned when calling __getitem__
print(dataset.column_names)

['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']

In [29]:

# We can remove the formating with `.reset_format()`
# or, identically, a call to `.set_format()` with no arguments
dataset.reset_format()

print('\n'.join([' '.join((n, str(type(t)))) for n, t in dataset[:10].items()]))

INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects and filter no columns  (when key is int or slice).

context <class 'list'>
question <class 'list'>
answers.text <class 'list'>
answers.answer_start <class 'list'>
new_title <class 'list'>
input_ids <class 'list'>
token_type_ids <class 'list'>
attention_mask <class 'list'>
offset_mapping <class 'list'>
start_positions <class 'list'>
end_positions <class 'list'>

In [30]:

# The current format can be checked with `.format`,
# which is a dict of the type and formating
dataset.format

Out[30]:

{'type': 'python',
 'columns': ['context',
  'question',
  'answers.text',
  'answers.answer_start',
  'new_title',
  'input_ids',
  'token_type_ids',
  'attention_mask',
  'offset_mapping',
  'start_positions',
  'end_positions']}

Wrapping this all up¶

Let's wrap this all up with the full code to load and prepare SQuAD for training a PyTorch model.

In [31]:

import nlp
import torch 
from transformers import BertTokenizerFast

# Load our training dataset and tokenizer
dataset = nlp.load('squad')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

# Tokenize our training dataset
def convert_to_features(example_batch):
    # Tokenize contexts and questions (as pairs of inputs)
    input_pairs = list(zip(example_batch['context'], example_batch['question']))
    encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True)

    # Compute start and end tokens for labels
    start_positions, end_positions = [], []
    for i, answer in enumerate(example_batch['answers']):
        first_char = answer['answer_start'][0]
        last_char = first_char + len(answer['text'][0]) - 1
        start_positions.append(encodings.char_to_token(i, first_char))
        end_positions.append(encodings.char_to_token(i, last_char))

    encodings.update({'start_positions': start_positions,
                      'end_positions': end_positions})
    return encodings

dataset['train'] = dataset['train'].map(convert_to_features, batched=True)

# Format our outputs to train a pytorch model
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
dataset['train'].set_format(type='torch', columns=columns)

# Instantiate a PyTorch Dataloader around our dataset
dataloader = torch.utils.data.DataLoader(dataset['train'], batch_size=8)

INFO:nlp.load:Dataset script /Users/thomwolf/.cache/huggingface/datasets/ee43d2be6898ebb9c2afefda4455306911d308bcf924d21c975796832cc7c114.e7d8881147e5da61c98918c61832c7f1c88b33b51a082c464e70e119bb24983d already found in datasets directory at /Users/thomwolf/Documents/GitHub/datasets/src/nlp/datasets/686d79c021d7dcd78da4d67fe01fbe30dfecabcd4bd02d06aa9d51edab713144/squad.py, returning it. Use `force_reload=True` to override.
INFO:nlp.builder:No config specified, defaulting to first: squad/plain_text
INFO:nlp.builder:Overwrite dataset info from restored data version.
INFO:nlp.info:Loading Dataset info from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:nlp.builder:Reusing dataset squad (/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0)
INFO:nlp.builder:Constructing Dataset for split None, from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /Users/thomwolf/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7008a010d9ded38b9f1f7e5bfe57c19a.arrow
100%|██████████| 88/88 [00:15<00:00,  5.83it/s]
INFO:nlp.arrow_writer:Done writing 87599 examples in 1114822607 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7008a010d9ded38b9f1f7e5bfe57c19a.arrow.
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch and filter ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'] columns  (when key is int or slice).

In [32]:

# Let's load a pretrained Bert model and a simple optimizer
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-base-cased')
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /Users/thomwolf/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.3d5adf10d3445c36ce131f4c6416aa62e9b58e1af56b97664773f4858a46286e
INFO:transformers.configuration_utils:Model config BertConfig {
  "_num_labels": 2,
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bad_words_ids": null,
  "bos_token_id": null,
  "decoder_start_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "eos_token_id": null,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 512,
  "min_length": 0,
  "model_type": "bert",
  "no_repeat_ngram_size": 0,
  "num_attention_heads": 12,
  "num_beams": 1,
  "num_hidden_layers": 12,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_token_id": 0,
  "prefix": null,
  "pruned_heads": {},
  "repetition_penalty": 1.0,
  "task_specific_params": null,
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 28996
}

INFO:transformers.modeling_utils:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /Users/thomwolf/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
INFO:transformers.modeling_utils:Weights of BertForQuestionAnswering not initialized from pretrained model: ['qa_outputs.weight', 'qa_outputs.bias']
INFO:transformers.modeling_utils:Weights from pretrained model not used in BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

In [33]:

# Now let's train our model

model.train()
for i, batch in enumerate(dataloader):
    outputs = model(**batch)
    loss = outputs[0]
    loss.backward()
    optimizer.step()
    model.zero_grad()
    print(f'Step {i} - loss: {loss:.3}')
    if i > 3:
        break

Step 0 - loss: 6.26

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-33-fd6d56e12581> in <module>
      3 model.train()
      4 for i, batch in enumerate(dataloader):
----> 5     outputs = model(**batch)
      6     loss = outputs[0]
      7     loss.backward()

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, start_positions, end_positions)
   1478             position_ids=position_ids,
   1479             head_mask=head_mask,
-> 1480             inputs_embeds=inputs_embeds,
   1481         )
   1482 

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask)
    788             head_mask=head_mask,
    789             encoder_hidden_states=encoder_hidden_states,
--> 790             encoder_attention_mask=encoder_extended_attention_mask,
    791         )
    792         sequence_output = encoder_outputs[0]

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
    405 
    406             layer_outputs = layer_module(
--> 407                 hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask
    408             )
    409             hidden_states = layer_outputs[0]

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
    366         encoder_attention_mask=None,
    367     ):
--> 368         self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
    369         attention_output = self_attention_outputs[0]
    370         outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
    312     ):
    313         self_outputs = self.self(
--> 314             hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask
    315         )
    316         attention_output = self.output(self_outputs[0], hidden_states)

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/Documents/GitHub/transformers/src/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
    239 
    240         # Normalize the attention scores to probabilities.
--> 241         attention_probs = nn.Softmax(dim=-1)(attention_scores)
    242 
    243         # This is actually dropping out entire tokens to attend to, which might

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/modules/activation.py in forward(self, input)
   1016 
   1017     def forward(self, input):
-> 1018         return F.softmax(input, self.dim, _stacklevel=5)
   1019 
   1020     def extra_repr(self):

~/miniconda2/envs/datasets/lib/python3.7/site-packages/torch/nn/functional.py in softmax(input, dim, _stacklevel, dtype)
   1229         dim = _get_softmax_dim('softmax', input.dim(), _stacklevel)
   1230     if dtype is None:
-> 1231         ret = input.softmax(dim)
   1232     else:
   1233         ret = input.softmax(dim, dtype=dtype)

KeyboardInterrupt:

In [ ]: