from fastai.gen_doc.nbdoc import *
from fastai.text import *
from fastai.gen_doc.nbdoc import *
This module contains the TextDataset
class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in text.transform
. It also contains all the functions to quickly get a TextDataBunch
ready.
You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the TextDataBunch
classes:
If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a DataBunch
with one of those functions, your data will be preprocessed automatically. You can save it, so that the next time you call it is almost instantaneous.
Below are the classes that help assembling the raw data in a DataBunch
suitable for NLP.
show_doc(TextLMDataBunch, title_level=3)
class
TextLMDataBunch
[source][test]
TextLMDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,fix_dl
:DataLoader
=*None
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
*) ::TextDataBunch
Tests found for TextLMDataBunch
:
Some other tests where TextLMDataBunch
is used:
pytest -sv tests/test_text_data.py::test_from_csv_and_from_df
[source]pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_1
[source]pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_2
[source]To run tests please refer to this guide.
Create a TextDataBunch
suitable for training a language model.
All the texts in the datasets
are concatenated and the labels are ignored. Instead, the target is the next word in the sentence.
show_doc(TextLMDataBunch.create)
create
[source][test]
create
(train_ds
,valid_ds
,test_ds
=*None
,path
:PathOrStr
='.'
,no_check
:bool
=False
,bs
=64
,val_bs
:int
=None
,num_workers
:int
=0
,device
:device
=None
,collate_fn
:Callable
='data_collate'
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,bptt
:int
=70
,backwards
:bool
=False
, ***dl_kwargs
**) →DataBunch
No tests found for create
. To contribute a test please refer to this guide and this discussion.
Create a TextDataBunch
in path
from the datasets
for language modelling. Passes **dl_kwargs
on to DataLoader()
show_doc(TextClasDataBunch, title_level=3)
class
TextClasDataBunch
[source][test]
TextClasDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,fix_dl
:DataLoader
=*None
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
*) ::TextDataBunch
Tests found for TextClasDataBunch
:
Some other tests where TextClasDataBunch
is used:
pytest -sv tests/test_text_data.py::test_backwards_cls_databunch
[source]pytest -sv tests/test_text_data.py::test_from_csv_and_from_df
[source]pytest -sv tests/test_text_data.py::test_from_ids_works_for_equally_length_sentences
[source]pytest -sv tests/test_text_data.py::test_from_ids_works_for_variable_length_sentences
[source]pytest -sv tests/test_text_data.py::test_load_and_save_test
[source]To run tests please refer to this guide.
Create a TextDataBunch
suitable for training an RNN classifier.
show_doc(TextClasDataBunch.create)
create
[source][test]
create
(train_ds
,valid_ds
,test_ds
=*None
,path
:PathOrStr
='.'
,bs
:int
=32
,val_bs
:int
=None
,pad_idx
=1
,pad_first
=True
,device
:device
=None
,no_check
:bool
=False
,backwards
:bool
=False
,dl_tfms
:Optional
[Collection
[Callable
]]=None
, ***dl_kwargs
**) →DataBunch
No tests found for create
. To contribute a test please refer to this guide and this discussion.
Function that transform the datasets
in a DataBunch
for classification. Passes **dl_kwargs
on to DataLoader()
All the texts are grouped by length (with a bit of randomness for the training set) then padded so that the samples have the same length to get in a batch.
show_doc(TextDataBunch, title_level=3)
class
TextDataBunch
[source][test]
TextDataBunch
(train_dl
:DataLoader
,valid_dl
:DataLoader
,fix_dl
:DataLoader
=*None
,test_dl
:Optional
[DataLoader
]=None
,device
:device
=None
,dl_tfms
:Optional
[Collection
[Callable
]]=None
,path
:PathOrStr
='.'
,collate_fn
:Callable
='data_collate'
,no_check
:bool
=False
*) ::DataBunch
No tests found for TextDataBunch
. To contribute a test please refer to this guide and this discussion.
General class to get a DataBunch
for NLP. Subclassed by TextLMDataBunch
and TextClasDataBunch
.
jekyll_warn("This class can only work directly if all the texts have the same length.")
All those classes have the following factory methods.
show_doc(TextDataBunch.from_folder)
from_folder
[source][test]
from_folder
(path
:PathOrStr
,train
:str
=*'train'
,valid
:str
='valid'
,test
:Optional
[str
]=None
,classes
:ArgStar
=None
,tokenizer
:Tokenizer
=None
,vocab
:Vocab
=None
,chunksize
:int
=10000
,max_vocab
:int
=60000
,min_freq
:int
=2
,mark_fields
:bool
=False
,include_bos
:bool
=True
,include_eos
:bool
=False
, ***kwargs
**)
Create a TextDataBunch
from text files in folders.
The floders are scanned in path
with a train
, valid
and maybe test
folders. Text files in the train
and valid
folders should be places in subdirectories according to their classes (not applicable for a language model). tokenizer
will be used to parse those texts into tokens.
You can pass a specific vocab
for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset
function and to the class initialization, you can precise there parameters such as max_vocab
, chunksize
, min_freq
, n_labels
(see the TextDataset
documentation) or bs
, bptt
and pad_idx
(see the sections LM data and classifier data).
show_doc(TextDataBunch.from_csv)
from_csv
[source][test]
from_csv
(path
:PathOrStr
,csv_name
,valid_pct
:float
=*0.2
,test
:Optional
[str
]=None
,tokenizer
:Tokenizer
=None
,vocab
:Vocab
=None
,classes
:StrList
=None
,delimiter
:str
=None
,header
='infer'
,text_cols
:IntsOrStrs
=1
,label_cols
:IntsOrStrs
=0
,label_delim
:str
=None
,chunksize
:int
=10000
,max_vocab
:int
=60000
,min_freq
:int
=2
,mark_fields
:bool
=False
,include_bos
:bool
=True
,include_eos
:bool
=False
, ***kwargs
**) →DataBunch
Create a TextDataBunch
from texts in csv files. kwargs
are passed to the dataloader creation.
This method will look for csv_name
, and optionally a test
csv file, in path
. These will be opened with header
, using delimiter
. You can specify which are the text_cols
and label_cols
; by default a single label column is assumed to come before a single text column. If your csv has no header, you must specify these as indices. If you're training a language model and don't have labels, you must specify the text_cols
. If there are several text_cols
, the texts will be concatenated together with an optional field token. If there are several label_cols
, the labels will be assumed to be one-hot encoded and classes
will default to label_cols
(you can ignore that argument for a language model). label_delim
can be used to specify the separator between multiple labels in a column.
You can pass a tokenizer
to be used to parse the texts into tokens and/or a specific vocab
for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). Otherwise you can specify parameters such as max_vocab
, min_freq
, chunksize
for the Tokenizer and Numericalizer (processors). Other parameters (e.g. bs
, val_bs
and num_workers
, etc.) will be passed to LabelLists.databunch()
documentation) (see the LM data and classifier data sections for more info).
show_doc(TextDataBunch.from_df)
from_df
[source][test]
from_df
(path
:PathOrStr
,train_df
:DataFrame
,valid_df
:DataFrame
,test_df
:OptDataFrame
=*None
,tokenizer
:Tokenizer
=None
,vocab
:Vocab
=None
,classes
:StrList
=None
,text_cols
:IntsOrStrs
=1
,label_cols
:IntsOrStrs
=0
,label_delim
:str
=None
,chunksize
:int
=10000
,max_vocab
:int
=60000
,min_freq
:int
=2
,mark_fields
:bool
=False
,include_bos
:bool
=True
,include_eos
:bool
=False
, ***kwargs
**) →DataBunch
Tests found for from_df
:
pytest -sv tests/test_text_data.py::test_from_csv_and_from_df
[source]Some other tests where from_df
is used:
pytest -sv tests/test_text_data.py::test_backwards_cls_databunch
[source]pytest -sv tests/test_text_data.py::test_load_and_save_test
[source]pytest -sv tests/test_text_data.py::test_regression
[source]pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_1
[source]pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_2
[source]To run tests please refer to this guide.
Create a TextDataBunch
from DataFrames. kwargs
are passed to the dataloader creation.
This method will use train_df
, valid_df
and optionally test_df
to build the TextDataBunch
in path
. You can specify text_cols
and label_cols
; by default a single label column comes before a single text column. If you're training a language model and don't have labels, you must specify the text_cols
. If there are several text_cols
, the texts will be concatenated together with an optional field token. If there are several label_cols
, the labels will be assumed to be one-hot encoded and classes
will default to label_cols
(you can ignore that argument for a language model).
You can pass a tokenizer
to be used to parse the texts into tokens and/or a specific vocab
for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). Otherwise you can specify parameters such as max_vocab
, min_freq
, chunksize
for the default Tokenizer and Numericalizer (processors). Other parameters (e.g. bs
, val_bs
and num_workers
, etc.) will be passed to LabelLists.databunch()
documentation) (see the LM data and classifier data sections for more info).
show_doc(TextDataBunch.from_tokens)
from_tokens
[source][test]
from_tokens
(path
:PathOrStr
,trn_tok
:Tokens
,trn_lbls
:Collection
[Union
[int
,float
]],val_tok
:Tokens
,val_lbls
:Collection
[Union
[int
,float
]],vocab
:Vocab
=*None
,tst_tok
:Tokens
=None
,classes
:ArgStar
=None
,max_vocab
:int
=60000
,min_freq
:int
=3
, ***kwargs
**) →DataBunch
No tests found for from_tokens
. To contribute a test please refer to this guide and this discussion.
Create a TextDataBunch
from tokens and labels. kwargs
are passed to the dataloader creation.
This function will create a DataBunch
from trn_tok
, trn_lbls
, val_tok
, val_lbls
and maybe tst_tok
.
You can pass a specific vocab
for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset
function and to the class initialization, you can precise there parameters such as max_vocab
, chunksize
, min_freq
, n_labels
, tok_suff
and lbl_suff
(see the TextDataset
documentation) or bs
, bptt
and pad_idx
(see the sections LM data and classifier data).
show_doc(TextDataBunch.from_ids)
from_ids
[source][test]
from_ids
(path
:PathOrStr
,vocab
:Vocab
,train_ids
:Collection
[Collection
[int
]],valid_ids
:Collection
[Collection
[int
]],test_ids
:Collection
[Collection
[int
]]=*None
,train_lbls
:Collection
[Union
[int
,float
]]=None
,valid_lbls
:Collection
[Union
[int
,float
]]=None
,classes
:ArgStar
=None
,processor
:PreProcessor
=None
, ***kwargs
**) →DataBunch
Create a TextDataBunch
from ids, labels and a vocab
. kwargs
are passed to the dataloader creation.
Texts are already preprocessed into train_ids
, train_lbls
, valid_ids
, valid_lbls
and maybe test_ids
. You can specify the corresponding classes
if applicable. You must specify a path
and the vocab
so that the RNNLearner
class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization.
To avoid losing time preprocessing the text data more than once, you should save and load your TextDataBunch
using DataBunch.save
and load_data
.
show_doc(TextDataBunch.load)
load
[source][test]
load
(path
:PathOrStr
,cache_name
:PathOrStr
=*'tmp'
,processor
:PreProcessor
=None
, ***kwargs
**)
No tests found for load
. To contribute a test please refer to this guide and this discussion.
Load a TextDataBunch
from path/cache_name
. kwargs
are passed to the dataloader creation.
jekyll_warn("This method should only be used to load back `TextDataBunch` saved in v1.0.43 or before, it is now deprecated.")
Untar the IMDB sample dataset if not already done:
path = untar_data(URLs.IMDB_SAMPLE)
path
PosixPath('/home/ubuntu/.fastai/data/imdb_sample')
Since it comes in the form of csv files, we will use the corresponding text_data
method. Here is an overview of what your file you should look like:
pd.read_csv(path/'texts.csv').head()
label | text | is_valid | |
---|---|---|---|
0 | negative | Un-bleeping-believable! Meg Ryan doesn't even ... | False |
1 | positive | This is a extremely well-made film. The acting... | False |
2 | negative | Every once in a long while a movie will come a... | False |
3 | positive | Name just says it all. I watched this movie wi... | False |
4 | negative | This movie succeeds at being one of the most u... | False |
And here is a simple way of creating your DataBunch
for language modelling or classification.
data_lm = TextLMDataBunch.from_csv(Path(path), 'texts.csv')
data_clas = TextClasDataBunch.from_csv(Path(path), 'texts.csv')
Behind the scenes, the previous functions will create a training, validation and maybe test TextList
that will be tokenized and numericalized (if needed) using PreProcessor
.
show_doc(Text, title_level=3)
class
Text
[source][test]
Text
(ids
,text
) ::ItemBase
No tests found for Text
. To contribute a test please refer to this guide and this discussion.
Basic item for text
data in numericalized ids
.
show_doc(TextList, title_level=3)
vocab
contains the correspondence between ids and tokens, pad_idx
is the id used for padding. You can pass a custom processor
in the kwargs
to change the defaults for tokenization or numericalization. It should have the following form:
tokenizer = Tokenizer(SpacyTokenizer, 'en')
processor = [TokenizeProcessor(tokenizer=tokenizer), NumericalizeProcessor(max_vocab=30000)]
To use sentencepiece instead of space (requires to install sentencepiece separately) you would pass
processor = SPProcessor()
See below for all the arguments those tokenizers can take.
show_doc(TextList.label_for_lm)
label_for_lm
[source][test]
label_for_lm
(****kwargs
**)
No tests found for label_for_lm
. To contribute a test please refer to this guide and this discussion.
A special labelling method for language models.
show_doc(TextList.from_folder)
from_folder
[source][test]
from_folder
(path
:PathOrStr
=*'.'
,extensions
:StrList
={'.txt'}
,vocab
:Vocab
=None
,processor
:PreProcessor
=None
, ***kwargs
**) →TextList
Get the list of files in path
that have a text suffix. recurse
determines if we search subfolders.
show_doc(TextList.show_xys)
show_xys
[source][test]
show_xys
(xs
,ys
,max_len
:int
=*70
*)
No tests found for show_xys
. To contribute a test please refer to this guide and this discussion.
Show the xs
(inputs) and ys
(targets). max_len
is the maximum number of tokens displayed.
show_doc(TextList.show_xyzs)
show_xyzs
[source][test]
show_xyzs
(xs
,ys
,zs
,max_len
:int
=*70
*)
No tests found for show_xyzs
. To contribute a test please refer to this guide and this discussion.
Show xs
(inputs), ys
(targets) and zs
(predictions). max_len
is the maximum number of tokens displayed.
show_doc(OpenFileProcessor, title_level=3)
class
OpenFileProcessor
[source][test]
OpenFileProcessor
(ds
:Collection
[T_co
]=*None
*) ::PreProcessor
No tests found for OpenFileProcessor
. To contribute a test please refer to this guide and this discussion.
PreProcessor
that opens the filenames and read the texts.
show_doc(open_text)
open_text
[source][test]
open_text
(fn
:PathOrStr
,enc
=*'utf-8'
*)
No tests found for open_text
. To contribute a test please refer to this guide and this discussion.
Read the text in fn
.
show_doc(TokenizeProcessor, title_level=3)
class
TokenizeProcessor
[source][test]
TokenizeProcessor
(ds
:ItemList
=*None
,tokenizer
:Tokenizer
=None
,chunksize
:int
=10000
,mark_fields
:bool
=False
,include_bos
:bool
=True
,include_eos
:bool
=False
*) ::PreProcessor
No tests found for TokenizeProcessor
. To contribute a test please refer to this guide and this discussion.
PreProcessor
that tokenizes the texts in ds
.
tokenizer
is used on bits of chunksize
. If mark_fields=True
, add field tokens between each parts of the texts (given when the texts are read in several columns of a dataframe). Depending on include_bos
and include_eos
, BOS
and EOS
will be automatically added at the beginning or the end of each text. See more about tokenizers in the transform documentation.
show_doc(NumericalizeProcessor, title_level=3)
class
NumericalizeProcessor
[source][test]
NumericalizeProcessor
(ds
:ItemList
=*None
,vocab
:Vocab
=None
,max_vocab
:int
=60000
,min_freq
:int
=3
*) ::PreProcessor
No tests found for NumericalizeProcessor
. To contribute a test please refer to this guide and this discussion.
PreProcessor
that numericalizes the tokens in ds
.
Uses vocab
for this (if not None), otherwise create one with max_vocab
and min_freq
from tokens.
show_doc(SPProcessor, title_level=3)
class
SPProcessor
[source][test]
SPProcessor
(ds
:ItemList
=*None
,pre_rules
:ListRules
=None
,post_rules
:ListRules
=None
,vocab_sz
:int
=None
,max_vocab_sz
:int
=30000
,model_type
:str
='unigram'
,max_sentence_len
:int
=20480
,lang
='en'
,char_coverage
=None
,tmp_dir
='tmp'
,mark_fields
:bool
=False
,include_bos
:bool
=True
,include_eos
:bool
=False
,sp_model
=None
,sp_vocab
=None
*) ::PreProcessor
No tests found for SPProcessor
. To contribute a test please refer to this guide and this discussion.
PreProcessor
that tokenize and numericalizes with sentencepiece
pre_rules
and post_rules
default to defaults.text_pre_rules
and defaults.text_post_rules
respectively, vocab_sz
defaults to the minimum between max_vocab_sz
and one quarter of the number of words in the training texts (rounded to the nearest multiple of 8). model_type
is passed to sentencepiece, so can be unigram
(default), bpe
, char
, or word
. Other sentencepiece parameters are lang
m max_sentence_len
and char_coverage
(default to 1. for European languages and 0.99 for others).
mark_fields=True
will add fields tokens between each text columns (if they are in several columns of a dataframe) and depending on include_bos
and include_eos
, BOS
and EOS
will be automatically added at the beginning or the end of each text. The sentencepiece model used for tokenization will be saved in path/tmp_dir
where path
will be given by the data this processor is applied to.
If you already have a trained tokenizer, you can passa long the model and vocab files with sp_model
and sp_vocab
.
A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into bs
chunks of continuous texts. Note that in all NLP tasks, we don't use the usual convention of sequence length being the first dimension so batch size is the first dimension and sequence length is the second. Here you can read the chunks of texts in lines.
path = untar_data(URLs.IMDB_SAMPLE)
data = TextLMDataBunch.from_csv(path, 'texts.csv')
x,y = next(iter(data.train_dl))
example = x[:15,:15].cpu()
texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])
texts
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | crew | that | he | can | trust | to | help | him | pull | it | off | and | get | his | xxunk | None | None |
1 | want | a | good | family | movie | , | this | might | do | . | xxmaj | it | is | clean | . | None | None |
2 | director | of | many | bad | xxunk | ) | tries | to | cover | the | info | up | , | but | goo | None | None |
3 | film | , | and | the | xxunk | xxunk | of | the | villain | , | humorous | or | not | , | are | None | None |
4 | cole | in | the | beginning | are | meant | to | draw | comparisons | which | leave | the | audience | xxunk | . | None | None |
5 | witness | xxmaj | brian | dealing | with | his | situation | through | first | , | primitive | means | , | and | then | None | None |
6 | film | , | or | not | . | \n | \n | xxmaj | this | film | . | xxmaj | film | ? | xxmaj | this | |
7 | xxunk | sitting | through | this | bomb | . | xxmaj | the | crew | member | who | was | in | charge | of | None | None |
8 | this | film | is | viewed | as | non | xxup | xxunk | but | there | is | a | speech | by | xxmaj | None | None |
9 | mention | the | pace | of | the | movie | . | xxmaj | to | my | mind | , | this | new | version | None | None |
10 | of | yours | ! | ' | \n | \n | xxmaj | director | xxmaj | xxunk | xxmaj | xxunk | , | who | is | xxunk | |
11 | pair | , | xxmaj | harry | xxmaj | michell | as | xxmaj | harry | , | xxmaj | rosie | xxmaj | michell | as | None | None |
12 | cares | who | lives | and | who | dies | , | i | 'll | be | shocked | . | xxmaj | the | same | None | None |
13 | is | incredibly | stupid | , | with | a | detective | trying | to | track | down | a | suspected | serial | killer | None | None |
14 | independent | film | was | one | of | the | best | films | at | the | tall | grass | film | festival | that | None | None |
jekyll_warn("If you are used to another convention, beware! fastai always uses batch as a first dimension, even in NLP.")
This is all done internally when we use TextLMDataBunch
, by wrapping the dataset in the following pre-loader before calling a DataLoader
.
show_doc(LanguageModelPreLoader)
class
LanguageModelPreLoader
[source][test]
LanguageModelPreLoader
(dataset
:LabelList
,lengths
:Collection
[int
]=*None
,bs
:int
=32
,bptt
:int
=70
,backwards
:bool
=False
,shuffle
:bool
=False
*) ::Callback
No tests found for LanguageModelPreLoader
. To contribute a test please refer to this guide and this discussion.
Transforms the tokens in dataset
to a stream of contiguous batches for language modelling.
LanguageModelPreLoader is an internal class uses for training a language model. It takes the sentences passed as a jagged array of numericalised sentences in dataset
and returns contiguous batches to the pytorch dataloader with batch size bs
and a sequence length bptt
.
lengths
can be provided for the jagged training data else lengths is calculated internallybackwards=True
will reverses the sentences.shuffle=True
, will shuffle the order of the sentences, at the start of each epoch - except the firstThe following description is usefull for understanding the implementation of LanguageModelPreLoader
:
idx: instance of CircularIndex that indexes items while taking the following into account 1) shuffle, 2) direction of indexing, 3) wraps around to head (reading forward) or tail (reading backwards) of the ragged array as needed in order to fill the last batch(s)
ro: index of the first rag of each row in the batch to be extract. Returns as index to the next rag to be extracted
ri: Reading forward: index to the first token to be extracted in the current rag (ro). Reading backwards: one position after the last token to be extracted in the rag
overlap: overlap between batches is 1, because we only predict the next token
When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:
PAD
token to get all the ones we picked to the same sizePAD
tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).
path = untar_data(URLs.IMDB_SAMPLE)
data = TextClasDataBunch.from_csv(path, 'texts.csv')
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[-10:,:20]
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')
This is all done internally when we use TextClasDataBunch
, by using the following classes:
show_doc(SortSampler)
This pytorch Sampler
is used for the validation and (if applicable) the test set.
show_doc(SortishSampler)
This pytorch Sampler
is generally used for the training set.
show_doc(pad_collate)
pad_collate
[source][test]
pad_collate
(samples
:BatchSamples
,pad_idx
:int
=*1
,pad_first
:bool
=True
,backwards
:bool
=False
*) →Tuple
[LongTensor
,LongTensor
]
No tests found for pad_collate
. To contribute a test please refer to this guide and this discussion.
Function that collect samples and adds padding. Flips token order if needed
This will collate the samples
in batches while adding padding with pad_idx
. If pad_first=True
, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end.
show_doc(TextList.new)
new
[source][test]
new
(items
:Iterator
[T_co
],processor
:Union
[PreProcessor
,Collection
[PreProcessor
]]=*None
, ***kwargs
**) →ItemList
No tests found for new
. To contribute a test please refer to this guide and this discussion.
Create a new ItemList
from items
, keeping the same attributes.
show_doc(TextList.get)
get
[source][test]
get
(i
)
No tests found for get
. To contribute a test please refer to this guide and this discussion.
Subclass if you want to customize how to create item i
from self.items
.
show_doc(TokenizeProcessor.process_one)
process_one
[source][test]
process_one
(item
)
No tests found for process_one
. To contribute a test please refer to this guide and this discussion.
show_doc(TokenizeProcessor.process)
process
[source][test]
process
(ds
)
No tests found for process
. To contribute a test please refer to this guide and this discussion.
show_doc(OpenFileProcessor.process_one)
process_one
[source][test]
process_one
(item
)
No tests found for process_one
. To contribute a test please refer to this guide and this discussion.
show_doc(NumericalizeProcessor.process)
process
[source][test]
process
(ds
)
No tests found for process
. To contribute a test please refer to this guide and this discussion.
show_doc(NumericalizeProcessor.process_one)
process_one
[source][test]
process_one
(item
)
No tests found for process_one
. To contribute a test please refer to this guide and this discussion.
show_doc(TextList.reconstruct)
reconstruct
[source][test]
reconstruct
(t
:Tensor
)
No tests found for reconstruct
. To contribute a test please refer to this guide and this discussion.
Reconstruct one of the underlying item for its data t
.
show_doc(LanguageModelPreLoader.on_epoch_begin)
on_epoch_begin
[source][test]
on_epoch_begin
(****kwargs
**)
No tests found for on_epoch_begin
. To contribute a test please refer to this guide and this discussion.
At the beginning of each epoch.
show_doc(LanguageModelPreLoader.on_epoch_end)
on_epoch_end
[source][test]
on_epoch_end
(****kwargs
**)
No tests found for on_epoch_end
. To contribute a test please refer to this guide and this discussion.
Called at the end of an epoch.
show_doc(LMLabelList)
class
LMLabelList
[source][test]
LMLabelList
(items
:Iterator
[T_co
], ****kwargs
**) ::EmptyLabelList
No tests found for LMLabelList
. To contribute a test please refer to this guide and this discussion.
Basic ItemList
for dummy labels.
show_doc(LanguageModelPreLoader.allocate_buffers)
allocate_buffers
[source][test]
allocate_buffers
()
No tests found for allocate_buffers
. To contribute a test please refer to this guide and this discussion.
Create the ragged array that will be filled when we ask for items.
show_doc(LanguageModelPreLoader.CircularIndex.shuffle)
shuffle
[source][test]
shuffle
()
No tests found for shuffle
. To contribute a test please refer to this guide and this discussion.
show_doc(LanguageModelPreLoader.fill_row)
fill_row
[source][test]
fill_row
(forward
,items
,idx
,row
,ro
,ri
,overlap
,lengths
)
No tests found for fill_row
. To contribute a test please refer to this guide and this discussion.
Fill the row with tokens from the ragged array. --OBS-- overlap != 1 has not been implemented