from fastai.gen_doc.nbdoc import *
from fastai.text import *
from fastai.gen_doc.nbdoc import *
from fastai import *
This module contains the TextDataset
class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in text.transform
. It also contains all the functions to quickly get a TextDataBunch
ready.
You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the TextDataBunch
classes:
If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a DataBunch
with one of those functions, your data will be preprocessed automatically and saved, so that the next time you call it is almost instantaneous.
Below are the classes that help assembling the raw data in a DataBunch
suitable for NLP.
show_doc(TextLMDataBunch, title_level=3, doc_string=False)
class
TextLMDataBunch
[source]
TextLMDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
) ::TextDataBunch
show_doc(TextClasDataBunch, title_level=3, doc_string=False)
class
TextClasDataBunch
[source]
TextClasDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
) ::TextDataBunch
Create a DataBunch
suitable for a text classifier: all the texts are grouped by length (with a bit of randomness for the training set) then padded.
show_doc(TextDataBunch, title_level=3, doc_string=False)
class
TextDataBunch
[source]
TextDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
) ::DataBunch
Create a DataBunch
with the raw texts. This is only going to work if they all ahve the same lengths.
All those classes have the following factory methods.
show_doc(TextDataBunch.from_folder, doc_string=False)
This function will create a DataBunch
from texts placed in path
in a train
, valid
and maybe test
folders. Text files in the train
and valid
folders should be places in subdirectories according to their classes (always the same for a language model) and the ones for the test
folder should all be placed there directly. tokenizer
will be used to parse those texts into tokens. The shuffle
flag will optionally shuffle the texts found.
You can pass a specific vocab
for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset
function and to the class initialization, you can precise there parameters such as max_vocab
, chunksize
, min_freq
, n_labels
(see the TextDataset
documentation) or bs
, bptt
and pad_idx
(see the sections LM data and classifier data).
show_doc(TextDataBunch.from_csv, doc_string=False)
This function will create a DataBunch
from texts placed in path
in a train
.csv, valid
.csv and maybe test
.csv files. These csv files should have no header or index, and the label(s) should be the first column(s) (be sure to adjust the parameter n_labels
if you have more than one). tokenizer
will be used to parse those texts into tokens.
You can pass a specific vocab
for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset
function and to the class initialization, you can precise there parameters such as max_vocab
, chunksize
, min_freq
, n_labels
(see the TextDataset
documentation) or bs
, bptt
and pad_idx
(see the sections LM data and classifier data).
show_doc(TextDataBunch.from_tokens, doc_string=False)
This function will create a DataBunch
from texts already tokenized placed in path
in files named f{train}{tok_suff}
.npy, f{train}{lbl_suff}
.npy, f{valid}{tok_suff}
.npy, f{valid}{lbl_suff}
.npy and maybe f{test}{tok_suff}
.npy. If no label file exists, labels will default to all zeros. tok_suff
and lbl_suff
are '_tok' and '_lbl' respectively.
You can pass a specific vocab
for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset
function and to the class initialization, you can precise there parameters such as max_vocab
, chunksize
, min_freq
, n_labels
, tok_suff
and lbl_suff
(see the TextDataset
documentation) or bs
, bptt
and pad_idx
(see the sections LM data and classifier data).
show_doc(TextDataBunch.from_id_files, doc_string=False)
This function will create a DataBunch
from texts already tokenized placed in path
in files named f{train}{id_suff}
.npy, f{train}{lbl_suff}
.npy, f{valid}{id_suff}
.npy, f{valid}{lbl_suff}
.npy and maybe f{test}{id_suff}
.npy. If no label file exists, labels will default to all zeros. id_suff
and lbl_suff
are '_ids' and '_lbl' respectively. The itos
file should contain the correspondance from ids to words.
kwargs will be split between the TextDataset
function and to the class initialization, you can precise there parameters such as max_vocab
, chunksize
, min_freq
, n_labels
, tok_suff
and lbl_suff
(see the TextDataset
documentation) or bs
, bptt
and pad_idx
(see the sections LM data and classifier data).
show_doc(TextDataBunch.from_ids, doc_string=False)
This function will create a DataBunch
in path
from texts already processed into trn_ids
, trn_lbls
, val_ids
, val_lbls
and maybe tst_ids
. You can specify the corresponding classes
if applciable. You must specify the vocab_size
so that the RNNLearner
class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization.
Untar the IMDB sample dataset if not already done:
path = untar_data(URLs.IMDB_SAMPLE)
path
PosixPath('/home/ubuntu/.fastai/data/imdb_sample')
Since it comes in the form of csv files, we will use the corresponding text_data
method. Here is an overview of what your file you should look like:
pd.read_csv(path/'train.csv', header=None).head()
0 | 1 | |
---|---|---|
0 | 0 | Un-bleeping-believable! Meg Ryan doesn't even ... |
1 | 1 | This is a extremely well-made film. The acting... |
2 | 0 | Every once in a long while a movie will come a... |
3 | 1 | Name just says it all. I watched this movie wi... |
4 | 0 | This movie succeeds at being one of the most u... |
And here is a simple way of creating your DataBunch
for language modelling or classification.
data_lm = TextLMDataBunch.from_csv(Path(path))
data_clas = TextClasDataBunch.from_csv(Path(path))
Behind the scenes, the previous functions will create a training, validation and maybe test TextDataset
which is the class responsible for collecting and preprocessing the data.
show_doc(TextDataset, doc_string=False)
class
TextDataset
[source]
TextDataset
(path
:PathOrStr
,tokenizer
:Tokenizer
=None
,vocab
:Vocab
=None
,max_vocab
:int
=60000
,chunksize
:int
=10000
,name
:str
='train'
,df
=None
,min_freq
:int
=2
,n_labels
:int
=1
,txt_cols
=None
,label_cols
=None
,create_mtd
:TextMtd
=<TextMtd.DF: 1>
,classes
:ArgStar
=None
,clear_cache
:bool
=False
) ::BaseTextDataset
This class shouldn't be initialized directly as it will rely on internal files being put in an 'tmp' folder of path
. tokenizer
and vocab
will be used to tokenize and numericalize the texts (if needed). max_vocab
and min_freq
are passed at the create of the vocabulary (if needed). chunksize
is the size of chunks preprocessed when loading the data from csv or folders. name
is the name of the set that will be used to name the temporary files. n_labels
is the number of labels if creating the data from a csv file. classes
is the correspondance between label and classe. create_mtd
is an internal flag that tells the TextDataset
how it was created. It can be:
CSV
if it was created from texts or csvTOK
if it was created from tokens (which means the TextDataset
will always skip the tokenization)IDS
if it was created from tokens (which means the TextDataset
will always skip the tokenization and the numericalization)Instead of using the TextDataset
init method, one of the following factory functions should be used instead:
show_doc(TextDataset.from_folder, doc_string=False)
Creates a TextDataset
named name
by scanning the subfolders in folder
and using tokenizer
. If classes
are passed, only the subfolders named accordingly are checked. If shuffle
is True, the data will be shuffled. Any additional kwargs
are passed to the init method of TextDataset
.
show_doc(TextDataset.from_one_folder, doc_string=False)
Creates a TextDataset
named name
by scanning the text files in folder
and using tokenizer
. All files are labelled classes[0]
so this is typically used for the test set. If shuffle
is True, the data will be shuffled. Any additional kwargs
are passed to the init method of TextDataset
.
show_doc(TextDataset.from_df)
show_doc(TextDataset.from_tokens, doc_string=False)
from_tokens
[source]
from_tokens
(folder
:PathOrStr
,name
:str
='train'
,tok_suff
:str
='_tok'
,lbl_suff
:str
='_lbl'
,kwargs
) →TextDataset
Creates a TextDataset
named name
from tokens and labels saved in f{name}{tok_suff}.npy
and f{name}{lbl_suff}.npy
respectively. Any additional kwargs
are passed to the init method of TextDataset
.
show_doc(TextDataset.from_ids, doc_string=False)
from_ids
[source]
from_ids
(folder
:PathOrStr
,name
:str
='train'
,id_suff
:str
='_ids'
,lbl_suff
:str
='_lbl'
,itos
:str
='itos.pkl'
,kwargs
) →TextDataset
Creates a TextDataset
named name
from ids, labels and dictionary saved in f{name}{id_suff}.npy
, f{name}{lbl_suff}.npy
and itos
respectively. Any additional kwargs
are passed to the init method of TextDataset
.
The internal preprocessing is done by the two following methods:
show_doc(TextDataset.tokenize)
show_doc(TextDataset.numericalize)
Internally, the TextDataset
will create a 'tmp' folder in which he will copy or save the following files:
name
.csv (if created from folders or csv)name
_tok.npy and name
_lbl.npy (created by TextDataset.tokenize
from the last step or copied if created from tokens)name
_ids.npy, name
_lbl.npy and itos
(created by TextDataset.numericalize
from the last step or copied if created from ids)Then, when you invoke the TextDataset
again, it will look for those temporary files and check their consistency to use them, in order to avoid doing again the numericalization or the tokenization. If you feel those files have been corrupted in any way, the following method will clear the 'tmp' subfolder of those files:
show_doc(TextDataset.clear)
show_doc(TextDataset.check_ids)
show_doc(TextDataset.check_toks)
show_doc(TextDataset.general_check)
general_check
[source]
general_check
(pre_files
:Collection
[PathOrStr
],post_files
:Collection
[PathOrStr
])
Check that post_files exist and were modified after all the prefiles.
show_doc(BaseTextDataset)
class
BaseTextDataset
[source]
BaseTextDataset
(ids
:Collection
[Collection
[int
]],labels
:Collection
[Union
[int
,float
]],vocab_size
:int
,classes
:ArgStar
=None
)
To directly create a text datasets from ids
and labels
.
A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into bs
chuncks of continuous texts. Note that in all NLP tasks, we use the pytoch convention of sequence length being the first dimension (and batch size being the second one) so we transpose that array so that we can read the chunks of texts in columns. Here is an example of batch from our imdb sample dataset.
path = untar_data(URLs.IMDB_SAMPLE)
data = TextLMDataBunch.from_csv(path)
x,y = next(iter(data.train_dl))
example = x[:20,:10].cpu()
texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])
texts
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | xxfld | protagonist | xxunk | into | occasionally | start | humor | his | the | xxunk |
1 | 1 | is | for | this | xxunk | planning | is | revenge | box | in |
2 | un | xxunk | a | film | in | and | the | . | office | my |
3 | - | her | massive | , | other | not | biggest | still | , | xxunk |
4 | xxunk | early | series | although | versions | filming | problem | alive | xxunk | . |
5 | - | life | of | having | of | until | with | , | b. | first |
6 | believable | as | gags | the | the | everything | the | it | demille | , |
7 | ! | a | built | main | story | has | film | looks | stopped | the |
8 | meg | butcher | upon | character | . | come | . | like | doing | xxunk |
9 | ryan | . | gags | a | wells | down | sure | carradine | films | scene |
10 | does | weird | , | drunk | ' | on | , | tries | about | between |
11 | n't | stuff | but | and | description | a | making | to | non | the |
12 | even | . | stops | a | of | storyboard | fun | shoot | - | women |
13 | look | then | short | heroine | the | . | of | her | american | at |
14 | her | there | ( | addict | martians | you | mentally | and | history | the |
15 | usual | 's | for | did | | certainly | ill | misses | . | xxunk |
16 | xxunk | the | all | n't | a | have | people | , | his | xxunk |
17 | lovable | core | the | come | giant | the | is | but | films | -- |
18 | self | premise | xxunk | as | head | ability | pretty | it | for | undertext |
19 | in | of | ) | an | xxunk | and | low | does | the | : |
Then, as suggested in this article from Stephen Merity et al., we don't use a fixed bptt
through the different batches but slightly change it from batch to batch.
iter_dl = iter(data.train_dl)
for _ in range(5):
x,y = next(iter_dl)
print(x.size())
torch.Size([81, 64]) torch.Size([66, 64]) torch.Size([27, 64]) torch.Size([69, 64]) torch.Size([67, 64])
This is all done internally when we use TextLMDataBunch
, by creating DataLoader
using the following class:
show_doc(LanguageModelLoader, doc_string=False)
class
LanguageModelLoader
[source]
LanguageModelLoader
(dataset
:TextDataset
,bs
:int
=64
,bptt
:int
=70
,backwards
:bool
=False
)
Takes the texts from dataset
and concatenate them all, then create a big array with bs
columns (transposed from the data source so that we read the texts in the columns). Spits batches with a size approximately equal to bptt
but changing at every batch. If backwards
is True, reverses the original text.
show_doc(LanguageModelLoader.batchify, doc_string=False)
batchify
[source]
batchify
(data
:ndarray
) →LongTensor
Called at the inialization to create the big array of text ids from the data
array.
show_doc(LanguageModelLoader.get_batch)
get_batch
[source]
get_batch
(i
:int
,seq_len
:int
) →Tuple
[LongTensor
,LongTensor
]
Create a batch at i
of a given seq_len
.
When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:
PAD
token to get all the ones we picked to the same sizePAD
tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).
path = untar_data(URLs.IMDB_SAMPLE)
data = TextClasDataBunch.from_csv(path)
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[:20,-10:]
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')
This is all done internally when we use TextClasDataBunch
, by using the following classes:
show_doc(SortSampler, doc_string=False)
pytorch Sampler
to batchify the data_source
by order of length of the texts. Used for the validation and (if applicable) the test set.
show_doc(SortishSampler, doc_string=False)
pytorch Sampler
to batchify with size bs
the data_source
by order of length of the texts with a bit of randomness. Used for the training set.
show_doc(pad_collate, doc_string=False)
pad_collate
[source]
pad_collate
(samples
:BatchSamples
,pad_idx
:int
=1
,pad_first
:bool
=True
) →Tuple
[LongTensor
,LongTensor
]
Function used by the pytorch DataLoader
to collate the samples
in batches while adding padding with pad_idx
. If pad_first
is True, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end.
show_doc(TextMtd, alt_doc_string='`TextDataset` enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids)')
Enum
= [DF, TOK, IDS]
TextDataset
enum to keep track of what data needs to be processed (dataframe, csv, tokens, ids)
show_doc(read_classes)
read_classes
[source]
read_classes
(fname
)
show_doc(TextLMDataBunch.create)
create
[source]
create
(datasets
:Collection
[TextDataset
],path
:PathOrStr
,kwargs
) →DataBunch
Create a TextDataBunch
in path
from the datasets
for language modelling.
show_doc(TextClasDataBunch.create)
create
[source]
create
(datasets
:Collection
[TextDataset
],path
:PathOrStr
,bs
=64
,pad_idx
=1
,pad_first
=True
,kwargs
) →DataBunch
Function that transform the datasets
in a DataBunch
for classification.
show_doc(TextDataBunch.from_df)
from_df
[source]
from_df
(path
:PathOrStr
,train_df
:Union
[DataFrame
,TextFileReader
],valid_df
:Union
[DataFrame
,TextFileReader
],test_df
:Union
[DataFrame
,TextFileReader
,NoneType
]=None
,tokenizer
:Tokenizer
=None
,vocab
:Vocab
=None
,kwargs
) →DataBunch
Create a TextDataBunch
from DataFrames.