The data block API¶

In [ ]:

from fastai.gen_doc.nbdoc import *
from fastai.vision import * 
from fastai import *
np.random.seed(42)

The data block API lets you customize how to create a DataBunch by isolating the underlying parts of that process in separate blocks, mainly:

Where are the inputs and how to create them?
How to split the data into a training and validation set?
How to label the inputs?
What transforms to apply?
How to add a test set?
How to wrap in dataloaders and create the DataBunch?

For each of those questions, you can have multiple possible blocks: your inputs might be in a folder, a csv file, a dataframe. You may want to split them randomly, by certain indexes or depending on the folder they are in. You can have your labels in your csv file or your dataframe, but it may come from folders or a specific function of the input. You may or may not have data augmentation to deal with. Or a test set. Finally you have to set the arguments to put the data together in a DataBunch (batch size, collate function...)

The data block API is called as such because you can mix and match each one of those blocks with the others, allowing you total flexibility to create your customized DataBunch for training. The factory methods of the various DataBunch are great for beginners but you can't always make your data fit in the tracks they require.

As usual, we'll begin with end-to-end examples, then switch to the details of each of those parts.

Examples of use¶

Let's begin by our traditional MNIST example.

In [ ]:

path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
path.ls()

Out[ ]:

[PosixPath('/home/jhoward/.fastai/data/mnist_tiny/history.csv'),
 PosixPath('/home/jhoward/.fastai/data/mnist_tiny/valid'),
 PosixPath('/home/jhoward/.fastai/data/mnist_tiny/models'),
 PosixPath('/home/jhoward/.fastai/data/mnist_tiny/train'),
 PosixPath('/home/jhoward/.fastai/data/mnist_tiny/test'),
 PosixPath('/home/jhoward/.fastai/data/mnist_tiny/labels.csv')]

In [ ]:

(path/'train').ls()

Out[ ]:

[PosixPath('/home/jhoward/.fastai/data/mnist_tiny/train/3'),
 PosixPath('/home/jhoward/.fastai/data/mnist_tiny/train/7')]

In vision.data, we create an easy DataBunch suitable for classification by simply typing:

In [ ]:

data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=24)

This is aimed at data that is in folders following an ImageNet style, with a train and valid directory containing each one subdirectory per class, where all the pictures are. There is also a test set containing unlabelled pictures. With the data block API, we can group everything together like this:

In [ ]:

data = (ImageItemList.from_folder(path) #Where to find the data? -> in path and its subfolders
        .split_by_folder()              #How to split in train/valid? -> use the folders
        .label_from_folder()            #How to label? -> depending on the folder of the filenames
        .add_test_folder()              #Optionally add a test set (here default name is test)
        .transform(tfms, size=64)       #Data augmentation? -> use tfms with a size of 64
        .databunch())                   #Finally? -> use the defaults for conversion to ImageDataBunch

In [ ]:

data.show_batch(3, figsize=(6,6), hide_axis=False)

In [ ]:

data.train_ds[0], data.test_ds.classes

Out[ ]:

((Image (3, 64, 64), Category 7), ['7', '3'])

Let's look at another example from vision.data with the planet dataset. This time, it's a multiclassification problem with the labels in a csv file and no given split between valid and train data, so we use a random split. The factory method is:

In [ ]:

planet = untar_data(URLs.PLANET_TINY)
planet_tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)

In [ ]:

data = ImageDataBunch.from_csv(planet, folder='train', size=128, suffix='.jpg', sep = ' ', ds_tfms=planet_tfms)

With the data block API we can rewrite this like that:

In [ ]:

data = (ImageItemList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')
        #Where to find the data? -> in planet 'train' folder
        .random_split_by_pct()
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_df(sep=' ')
        #How to label? -> use the csv file
        .transform(planet_tfms, size=128)
        #Data augmentation? -> use tfms with a size of 128
        .databunch())                          
        #Finally -> use the defaults for conversion to databunch

In [ ]:

data.show_batch(rows=2, figsize=(9,7))

The data block API also allows you to get your data together in problems for which there is no direct ImageDataBunch factory method. For a segmentation task, for instance, we can use it to quickly get a DataBunch. Let's take the example of the camvid dataset. The images are in an 'images' folder and their corresponding mask is in a 'labels' folder.

In [ ]:

camvid = untar_data(URLs.CAMVID_TINY)
path_lbl = camvid/'labels'
path_img = camvid/'images'

We have a file that gives us the names of the classes (what each code inside the masks corresponds to: a pedestrian, a tree, a road...)

In [ ]:

codes = np.loadtxt(camvid/'codes.txt', dtype=str); codes

Out[ ]:

array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole',
       'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock',
       'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',
       'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype='<U17')

And we define the following function that infers the mask filename from the image filename.

In [ ]:

get_y_fn = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'

Then we can easily define a DataBunch using the data block API. Here we need to use tfm_y=True in the transform call because we need the same transforms to be applied to the target mask as were applied to the image.

In [ ]:

data = (SegmentationItemList.from_folder(path_img)
        .random_split_by_pct()
        .label_from_func(get_y_fn, classes=codes)
        .transform(get_transforms(), tfm_y=True, size=128)
        .databunch())

In [ ]:

data.show_batch(rows=2, figsize=(7,5))

Another example for object detection. We use our tiny sample of the COCO dataset here. There is a helper function in the library that reads the annotation file and returns the list of images names with the list of labelled bboxes associated to it. We convert it to a dictionary that maps image names with their bboxes and then write the function that will give us the target for each image filename.

In [ ]:

coco = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
get_y_func = lambda o:img2bbox[o.name]

The following code is very similar to what we saw before. The only new addition is the use of special function to collate the samples in batches. This comes from the fact that our images may have multiple bounding boxes, so we need to pad them to the largest number of bounding boxes.

In [ ]:

data = (ObjectItemList.from_folder(coco)
        #Where are the images? -> in coco
        .random_split_by_pct()                          
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_func(get_y_func)
        #How to find the labels? -> use get_y_func
        .transform(get_transforms(), tfm_y=True)
        #Data augmentation? -> Standard transforms with tfm_y=True
        .databunch(bs=16, collate_fn=bb_pad_collate))   
        #Finally we convert to a DataBunch and we use bb_pad_collate

In [ ]:

data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))

But vision isn't the only application where the data block API works, it can also be used for text or tabular data. With ouy sample of the IMDB dataset (labelled texts in a csv file), here is how to get the data together for a language model.

In [ ]:

from fastai.text import *

In [ ]:

imdb = untar_data(URLs.IMDB_SAMPLE)

In [ ]:

data_lm = (TextList.from_csv(imdb, 'texts.csv', cols='text')
           #Where are the inputs? Column 'text' of this csv
                   .random_split_by_pct()
           #How to split it? Randomly with the default 20%
                   .label_for_lm()
           #Label it for a language model
                   .databunch())

In [ ]:

data_lm.show_batch()

idx	text
0	xxfld 1 old jane 's mannered tale seems very popular these days . i have lost count of the number of versions going around . probably the reason is that her " xxunk " are our " xxunk " even at this late date . this tv mini - series gives it a mannered telling suitable to the novel . xxunk , xxunk emma is a pretty " modern " girl when you think about it , even though the xxunk of jane austen 's world may seem a xxunk artificial to us today .
1	country - road music score from xxunk jones , amazing performances in two principal roles from robert blake and scott wilson and first time in a movie a sad comment about xxunk punishment at the last moments before their deaths . jones , hall and brooks ( as director and as writer for adapted screenplay ) are academy award xxunk . gripping , superbly directed and frightening , one of the best films of this decade xxfld 1 there were a lot of truly great horror movies produced in the seventies - but this film
2	sister xxunk , who pretty much steals the show . with absolutely beautiful xxunk , she sings several songs throughout the film , though i actually would have liked to have seen them feature her even more in this . the plot in this film is a bit silly , but nevertheless , i found the film to be entertaining and fun . xxfld 1 there 's something compelling and strangely believable about this episode . from the very beginning , an atmosphere of tension is created by the knowledge that a certain planet is
3	" xxunk " plot , this one has a xxunk mess of a story , with too many dull characters xxunk each other in the back so many times the potential for any sympathy or xxunk is xxunk . gone is the effective xxunk between the lead characters ; azumi and her xxunk are often reduced to a bunch of xxunk teenagers xxunk in a forest . xxunk is non existent ; if anyone watching actually cares who lives and who dies , i 'll be shocked . the same xxunk to the villains here
4	gary cooper as wild bill xxunk , with jean arthur as xxunk jane . james xxunk was buffalo bill , john xxunk ( not a villain as usual ) was general george a. xxunk , and anthony quinn was one of the indians who fought at little big xxunk . the villains were led by charles xxunk ( xxunk arms to the indians ) and porter hall as jack xxunk ( who killed wild bill xxunk ) . \n\n basically the film takes up the history of the u.s . after the civil war .
5	xxunk . where it all comes xxunk is in the script , which did n't do any better when it was called missing in action and starred xxunk xxunk . what little semblance of logic there was in the original is now gone , as the filmmakers decide to paint a big s on rambo 's massive chest . \n\n the film picks up a little while after the end of first blood . the film , that is - the novel did n't allow for the possibility of sequels . in this mediocre follow
6	the xxunk and the xxunk ' ( xxunk ) , ' 28 days later ' ( 2002 ) and its sequel , as well as many , many , others too numerous to mention . \n\n this one is not really a zombie film . judging this movie on its own terms , it 's more of a semi - gothic romance . as such it ranks a little below some of universal 's bottom billed b horror movies of the late 30s and early xxunk . so i 'll give it a 5 .
7	of xxup the xxup demon ) \n\n * spoiler * \n\n this was a drive - in feature , co - billed with xxup the xxup xxunk xxup vampire . a spanish - italian co - production where a series of women in a village are being murdered around the same time a local count named yanos xxunk is seen on xxunk , riding off with his ' man - eating ' dog behind him . \n\n the xxunk already suspect he is the one behind it all and want his castle burned down .
8	the visual than in the message . \n\n thus , you will find some funny scenes ( the first xxunk of the town , a " xxunk " xxunk xxunk ) and the casting is xxunk , with special mentions to " doc " , who xxunk in a " xxunk fly " character , and to xxunk , who seems open to xxunk - xxunk . \n\n ice on the cake : the main title is xxunk by danny xxunk , and like every other great xxunk , you recognize his " voice "
9	xxunk only very slightly by a little inept gore , a gratuitous rape scene , and loads of nudity . \n\n gorgeous blonde xxunk xxunk plays movie star laura xxunk who is abducted by a gang of ruthless xxunk and taken to a remote xxunk island inhabited by a savage xxunk who worship the ' devil god ' that xxunk in the jungle ( a big , naked , xxunk - xxunk native who likes to eat the hearts of xxunk female sacrifices ) . \n\n employed by laura 's agent to deliver a $

For a classification problem, we just have to change the way labelling is done. Here we use the column 'label' of our csv.

In [ ]:

data_clas = (TextList.from_csv(imdb, 'texts.csv', cols='text')
                   .split_from_df(col='is_valid')
                   .label_from_df(cols='label')
                   .databunch())

In [ ]:

data_clas.show_batch()

text	label
xxfld 1 raising victor vargas : a review \n\n you know , raising victor vargas is like sticking your hands into a big , xxunk bowl of xxunk . it 's warm and gooey , but you 're not sure if it feels right . try as i might ,	negative
xxfld 1 xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words	positive
xxfld 1 now that che(2008 ) has finished its relatively short australian cinema run ( extremely limited xxunk screen in xxunk , after xxunk ) , i can xxunk join both xxunk of " at the movies " in taking steven soderbergh to task . \n\n it 's usually satisfying	negative
xxfld 1 many neglect that this is n't just a classic due to the fact that it 's the first 3d game , or even the first xxunk - up . it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly	positive
xxfld 1 i really wanted to love this show . i truly , honestly did . \n\n for the first time , gay viewers get their own version of the " the bachelor " . with the help of his obligatory " hag " xxunk , james , a good	negative

In [ ]:

from fastai.tabular import *

Lastly, for tabular data, we just have to pass the name of our categorical and continuous variables as an extra argument. We also add PreProcessor that are going to be applied to our data once the splitting and the labelling is done.

In [ ]:

adult = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(adult/'adult.csv')
dep_var = '>=50k'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['education-num', 'hours-per-week', 'age', 'capital-loss', 'fnlwgt', 'capital-gain']
procs = [FillMissing, Categorify, Normalize]

In [ ]:

data = (TabularList.from_df(df, path=adult, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx=range(800,1000))
                           .label_from_df(cols=dep_var)
                           .databunch())

In [ ]:

data.show_batch()

workclass	education	marital-status	occupation	relationship	race	sex	native-country	education-num_na	education-num	hours-per-week	age	capital-loss	fnlwgt	capital-gain	target
Self-emp-inc	HS-grad	Married-civ-spouse	Protective-serv	Husband	White	Male	United-States	False	-0.4224	2.3943	0.5434	-0.2164	-0.6319	-0.1459	0
Local-gov	Masters	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States	False	1.5334	-0.0356	1.5695	-0.2164	-0.4559	-0.1459	1
Private	10th	Separated	Other-service	Own-child	Black	Female	United-States	False	-1.5958	-2.3036	0.0303	-0.2164	1.2202	-0.1459	0
Private	Bachelors	Never-married	Exec-managerial	Not-in-family	White	Male	United-States	False	1.1422	-0.0356	-1.0692	-0.2164	-0.4714	-0.1459	0
Private	Bachelors	Never-married	Machine-op-inspct	Not-in-family	White	Female	United-States	False	1.1422	-0.1166	-0.3362	-0.2164	2.3204	-0.1459	0
Local-gov	Assoc-voc	Never-married	Other-service	Unmarried	Black	Female	United-States	False	0.3599	-0.6836	-0.1163	-0.2164	0.3157	-0.1459	0
Federal-gov	HS-grad	Married-civ-spouse	Adm-clerical	Husband	White	Male	United-States	False	-0.4224	-0.0356	0.3968	-0.2164	-0.0533	-0.1459	1
Private	11th	Married-civ-spouse	Craft-repair	Husband	White	Male	United-States	False	-1.2046	-0.0356	0.8365	-0.2164	0.7282	-0.1459	0

Step 1: Provide inputs¶

The basic class to get your inputs into is the following one. It's also the same class that will contain all of your labels (hence the name ItemList).

In [ ]:

show_doc(ItemList, title_level=3, doc_string=False)

`class` `ItemList`[source]

ItemList(items:Iterator, create_func:Callable=None, path:PathOrStr='.', label_cls:Callable=None, xtra:Any=None, processor:PreProcessor=None, kwargs)

This class regroups the inputs for our model in items and saves a path attribute which is where it will look for any files (image files, csv file with labels...) create_func is applied to items to get the final output. label_cls will be called to create the labels from the result of the label function, xtra contains additional information (usually an underlying dataframe) and processor is to be applied to the inputs after the splitting and labelling.

It has multiple subclasses depending on the type of data you're handling. Here is a quick list:

CategoryList for labels in classification
MultiCategoryList for labels in a multi classification problem
FloatList for float labels in a regression problem
ImageItemList for data that are images
SegmentationItemList like ImageItemList but will default labels to SegmentationLabelList
SegmentationLabelList for segmentation masks
ObjectItemList like ImageItemList but will default labels to ObjectLabelList
ObjectLabelList for object detection
PointsItemList for points (of the type ImagePoints)
TextList for text data
TextFilesList for text data stored in files
TabularList for tabular data

Once you have selected the class that is suitable, you can instantiate it with one of the following factory methods

In [ ]:

show_doc(ItemList.from_folder)

`from_folder`[source]

from_folder(path:PathOrStr, extensions:StrList=None, recurse=True, kwargs) → ItemList

Get the list of files in path that have a suffix in extensions. recurse determines if we search subfolders.

In [ ]:

show_doc(ItemList.from_df)

`from_df`[source]

from_df(df:DataFrame, path:PathOrStr='.', cols:Union[int, Collection[int], str, StrList]=0, kwargs) → ItemList

Create an ItemList in path from the inputs in the cols of df.

In [ ]:

show_doc(ItemList.from_csv)

`from_csv`[source]

from_csv(path:PathOrStr, csv_name:str, cols:Union[int, Collection[int], str, StrList]=0, header:str='infer', kwargs) → ItemList

Create an ItemList in path from the inputs in the cols of path/csv_name opened with header.

Optional step: filter your data¶

The factory method may have grabbed too many items. For instance, if you were searching sub folders with the from_folder method, you may have gotten files you don't want. To remove those, you can use one of the following methods.

In [ ]:

show_doc(ItemList.filter_by_func)

`filter_by_func`[source]

filter_by_func(func:Callable) → ItemList

Only keeps elements for which func returns True.

In [ ]:

show_doc(ItemList.filter_by_folder)

`filter_by_folder`[source]

filter_by_folder(include=None, exclude=None)

Only keep filenames in include folder or reject the ones in exclude.

Writing your own `ItemList`¶

First check if you can't easily customize one of the existing subclass by:

changing the create_func (example: opening images with your custom function and not open_image)
applying a custom processor (see step 4)
changing the default label_cls for the label creation.

If this isn't the case and you really need to write your own class, here is what you should code:

class MyCustomItemList():
    #If you need custom arguments you will have to overwrite __init__ and new like this.
    def __init__(self, items:Iterator, my_args, **kwargs):
        super().__init__(items, **kwargs)
        #store my args, initialize what is needed.

    def new(self, items:Iterator, **kwargs)->'NumericalizedTextList':
        #Retrive your custom args stored and send them to new like this
        return super().new(items=items,  my_args, **kwargs)

    #This is how to get your data stored at index i
    def get(self, i):
        o = super().get(i)
        return what you need from o

You can add custom splitting or labelling methods if you need them.

In [ ]:

show_doc(ItemList.predict)

`predict`[source]

predict(res)

Called at the end of Learn.predict; override for optional post-processing

Step 2: Split the data between the training and the validation set¶

This step is normally straightforward, you just have to pick oe of the following functions depending on what you need.

In [ ]:

show_doc(ItemList.random_split_by_pct)

`random_split_by_pct`[source]

random_split_by_pct(valid_pct:float=0.2, seed:int=None) → ItemLists

Split the items randomly by putting valid_pct in the validation set. Set the seed in numpy if passed.

In [ ]:

show_doc(ItemList.split_by_files)

`split_by_files`[source]

split_by_files(valid_names:ItemList) → ItemLists

Split the data by using the names in valid_names for validation.

In [ ]:

show_doc(ItemList.split_by_fname_file)

`split_by_fname_file`[source]

split_by_fname_file(fname:PathOrStr, path:PathOrStr=None) → ItemLists

Split the data by using the file names in fname for the validation set. path will override self.path.

In [ ]:

show_doc(ItemList.split_by_folder)

`split_by_folder`[source]

split_by_folder(train:str='train', valid:str='valid') → ItemLists

Split the data depending on the folder (train or valid) in which the filenames are.

In [ ]:

jekyll_note("This method looks at the folder immediately after `self.path` for `valid` and `train`.")

Note: This method looks at the folder immediately after `self.path` for `valid` and `train`.

In [ ]:

show_doc(ItemList.split_by_idx)

`split_by_idx`[source]

split_by_idx(valid_idx:Collection[int]) → ItemLists

Split the data according to the indexes in valid_idx.

In [ ]:

show_doc(ItemList.split_by_idxs)

`split_by_idxs`[source]

split_by_idxs(train_idx, valid_idx)

Split the data between train_idx and valid_idx.

In [ ]:

show_doc(ItemList.split_by_list)

`split_by_list`[source]

split_by_list(train, valid)

Split the data between train and valid.

In [ ]:

show_doc(ItemList.split_by_valid_func)

`split_by_valid_func`[source]

split_by_valid_func(func:Callable) → ItemLists

Split the data by result of func (which returns True for validation set)

In [ ]:

show_doc(ItemList.split_from_df)

`split_from_df`[source]

split_from_df(col:Union[int, Collection[int], str, StrList]=2)

Split the data from the col in the dataframe in self.xtra.

In [ ]:

jekyll_warn("This method assumes the data has been created from a csv file or a dataframe.")

Warning: This method assumes the data has been created from a csv file or a dataframe.

Step 3: Label the inputs¶

To label your inputs, use one of the following functions. Note that even if it's not in the documented arguments, you can always pass a label_cls that will be used to create those labels (the default is the one from your input ItemList, and if there is none, it will go to CategoryList, MultiCategoryList or FloatList depending on the type of the labels).

In [ ]:

show_doc(ItemList.label_from_list)

`label_from_list`[source]

label_from_list(labels:Iterator, kwargs) → LabelList

Label self.items with labels using label_cls

In [ ]:

show_doc(ItemList.label_from_df)

`label_from_df`[source]

label_from_df(cols:Union[int, Collection[int], str, StrList]=1, kwargs)

Label self.items from the values in cols in self.xtra.

In [ ]:

jekyll_warn("This method assumes the data has been created from a csv file or a dataframe.")

Warning: This method assumes the data has been created from a csv file or a dataframe.

In [ ]:

show_doc(ItemList.label_const)

`label_const`[source]

label_const(const:Any=0, kwargs) → LabelList

Label every item with const.

In [ ]:

show_doc(ItemList.label_from_folder)

`label_from_folder`[source]

label_from_folder(kwargs) → LabelList

Give a label to each filename depending on its folder.

In [ ]:

jekyll_note("This method looks at the last subfolder in the path to determine the classes.")

Note: This method looks at the last subfolder in the path to determine the classes.

In [ ]:

show_doc(ItemList.label_from_func)

`label_from_func`[source]

label_from_func(func:Callable, kwargs) → LabelList

Apply func to every input to get its label.

In [ ]:

show_doc(ItemList.label_from_re)

`label_from_re`[source]

label_from_re(pat:str, full_path:bool=False, kwargs) → LabelList

Apply the re in pat to determine the label of every filename. If full_path, search in the full name.

In [ ]:

show_doc(CategoryList, title_level=3)

`class` `CategoryList`[source]

CategoryList(items:Iterator, classes:Collection=None, processor:PreProcessor=None, kwargs) :: CategoryListBase

ItemList suitable for storing labels in items belonging to classes. If None are passed, classes will be determined by the unique different labels. processor will default to CategoryProcessor.

In [ ]:

show_doc(MultiCategoryList, title_level=3)

`class` `MultiCategoryList`[source]

MultiCategoryList(items:Iterator, classes:Collection=None, processor:PreProcessor=None, sep:str=None, kwargs) :: CategoryListBase

ItemList suitable for storing list of labels in items belonging to classes. If None are passed, classes will be determined by the unique different labels. sep is used to split the content of items in a list of labels.

In [ ]:

show_doc(FloatList, title_level=3)

`class` `FloatList`[source]

FloatList(items:Iterator, log:bool=False, kwargs) :: ItemList

ItemList suitable for storing the floats in items for regression. Will add a log if this flag is True.

Invisible step: preprocessing¶

This isn't seen tehre in the API, but if you passed a processor (or a list of them) in your initial ItemList during step 1, it will be applied here. A processor is a transformation that is applied to all the inputs once and for all, with a state computed on the training set that is then applied without modification on the validation set (and maybe the test set). For instance, it can be processing texts to tokenize then numericalize them. In that case we want the validation set to be numericalized with exactly the same vocabulary as the training set.

Another example is in tabular data, where we fill missing values with (for instance) the median computed on the training set. That statistic is stored in the inner state of the PreProcessor and applied on the validation set.

This is the generic class for all processors.

In [ ]:

show_doc(PreProcessor, title_level=3)

`class` `PreProcessor`[source]

PreProcessor()

In [ ]:

show_doc(PreProcessor.process_one)

`process_one`[source]

process_one(item)

Process one item. This method needs to be written in any subclass.

In [ ]:

show_doc(PreProcessor.process)

`process`[source]

process(ds:Collection)

Process a dataset. This default to apply process_one on every item of ds.

In [ ]:

show_doc(CategoryProcessor, title_level=3)

`class` `CategoryProcessor`[source]

CategoryProcessor(classes:Collection=None) :: PreProcessor

PreProcessor that will convert labels to codes usings classes (if passed) in a single classificatio problem.

In [ ]:

show_doc(MultiCategoryProcessor, title_level=3)

`class` `MultiCategoryProcessor`[source]

MultiCategoryProcessor(classes:Collection=None) :: CategoryProcessor

PreProcessor that will convert labels to codes usings classes (if passed) in a single multi-classificatio problem.

Optional steps¶

Add transforms¶

Transforms differ from processors in the sense they are applied on the fly when we grab one item. They also may change each time we ask for the same item in the case of random transforms.

In [ ]:

show_doc(LabelLists.transform)

`transform`[source]

transform(tfms:Optional[Tuple[Union[Callable, Collection[Callable]], Union[Callable, Collection[Callable]]]]=(None, None), kwargs)

Set tfms to be applied to the train and validation set.

This is primary for the vision application. The kwargs are the one expected by the type of transforms you pass. tfm_y is among them and if set to True, the transforms will be applied to input and target.

Add a test set¶

To add a test set, you can use one of the two following methods.

In [ ]:

show_doc(LabelLists.add_test)

`add_test`[source]

add_test(items:Iterator, label:Any=None)

Add test set containing items from items and an arbitrary label

In [ ]:

jekyll_note("Here `items` can be an `ItemList` or a collection.")

Note: Here `items` can be an `ItemList` or a collection.

In [ ]:

show_doc(LabelLists.add_test_folder)

`add_test_folder`[source]

add_test_folder(test_folder:str='test', label:Any=None)

Add test set containing items from folder test_folder and an arbitrary label.

Step 4: convert to a `DataBunch`¶

This last step is usually pretty straightforward. You just have to include all the arguments we pass to DataBunch.create (bs, num_workers, collate_fn). The class called to create a DataBunch is set in the _bunch attribute of the inputs of the training set if you need to modify it. Normally, the various subclasses we showed before handle that for you.

In [ ]:

show_doc(LabelLists.databunch)

`databunch`[source]

databunch(path:PathOrStr=None, kwargs) → ImageDataBunch

Create an DataBunch from self, path will override self.path, kwargs are passed to DataBunch.create.

Inner classes¶

In [ ]:

show_doc(LabelList, title_level=3, doc_string=False)

`class` `LabelList`[source]

LabelList(x:ItemList, y:ItemList, tfms:Union[Callable, Collection[Callable]]=None, tfm_y:bool=False, kwargs) :: Dataset

The basic dataset in fastai. Inputs are in x, targets in y. Optionally apply tfms to x and also y if tfm_y is True.

In [ ]:

show_doc(LabelList.from_lists)

`from_lists`[source]

from_lists(path:PathOrStr, inputs, labels) → LabelList

Create a LabelList in path with inputs and labels.

In [ ]:

show_doc(ItemLists, doc_string=False, title_level=3)

`class` `ItemLists`[source]

ItemLists(path:PathOrStr, train:ItemList, valid:ItemList, test:ItemList=None)

Data in path split between several streams of inputs, train, valid and maybe test.

In [ ]:

show_doc(ItemLists.label_from_lists)

`label_from_lists`[source]

label_from_lists(train_labels:Iterator, valid_labels:Iterator, label_cls:Callable=None, kwargs) → LabelList

Use the labels in train_labels and valid_labels to label the data. label_cls will overwrite the default.

In [ ]:

show_doc(LabelLists, title_level=3, doc_string=False)

`class` `LabelLists`[source]

LabelLists(path:PathOrStr, train:ItemList, valid:ItemList, test:ItemList=None) :: ItemLists

Helper functions¶

In [ ]:

show_doc(get_files)

`get_files`[source]

get_files(c:PathOrStr, extensions:StrList=None, recurse:bool=False) → FilePathList

Return list of files in c that have a suffix in extensions. recurse determines if we search subfolders.

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

In [ ]:

show_doc(ItemList.get)

`get`[source]

get(i) → Any

In [ ]:

show_doc(CategoryList.new)

`new`[source]

new(items, classes=None, kwargs)

In [ ]:

show_doc(ItemList.label_cls)

`label_cls`[source]

label_cls(labels, label_cls:Callable=None, sep:str=None, kwargs)

In [ ]:

show_doc(LabelLists.get_processors)

`get_processors`[source]

get_processors()

In [ ]:

show_doc(LabelList.from_lists)

`from_lists`[source]

from_lists(path:PathOrStr, inputs, labels) → LabelList

Create a LabelList in path with inputs and labels.

In [ ]:

show_doc(LabelList.set_item)

`set_item`[source]

set_item(item)

In [ ]:

show_doc(LabelList.new)

`new`[source]

new(x, y, kwargs) → LabelList

In [ ]:

show_doc(CategoryList.get)

`get`[source]

get(i)

In [ ]:

show_doc(LabelList.predict)

`predict`[source]

predict(res)

In [ ]:

show_doc(ItemList.new)

`new`[source]

new(items:Iterator, create_func:Callable=None, processor:PreProcessor=None, kwargs) → ItemList

In [ ]:

show_doc(LabelList.clear_item)

`clear_item`[source]

clear_item()

In [ ]:

show_doc(ItemList.process_one)

`process_one`[source]

process_one(item, processor=None)

In [ ]:

show_doc(ItemList.process)

`process`[source]

process(processor=None)

In [ ]:

show_doc(LabelLists.process)

`process`[source]

process()

In [ ]:

show_doc(CategoryList.predict)

`predict`[source]

predict(res)

Called at the end of Learn.predict; override for optional post-processing

In [ ]:

show_doc(ItemLists.transform)

`transform`[source]

transform(tfms:Optional[Tuple[Union[Callable, Collection[Callable]], Union[Callable, Collection[Callable]]]]=(None, None), kwargs)

Set tfms to be applied to the train and validation set.

In [ ]:

show_doc(LabelList.process)

`process`[source]

process(xp=None, yp=None)

Launch the preprocessing on xp and yp.

In [ ]:

show_doc(LabelList.transform)

`transform`[source]

transform(tfms:Union[Callable, Collection[Callable]], tfm_y:bool=None, kwargs)

Set the tfms and `` tfm_y` value to be applied to the inputs and targets.

New Methods - Please document or move to the undocumented section¶

In [ ]:

show_doc(MultiCategoryProcessor.process_one)

`process_one`[source]

process_one(item)

In [ ]:

show_doc(FloatList.get)

`get`[source]

get(i)

In [ ]:

show_doc(CategoryProcessor.process_one)

`process_one`[source]

process_one(item)

In [ ]:

show_doc(CategoryProcessor.create_classes)

`create_classes`[source]

create_classes(classes)

In [ ]:

show_doc(CategoryProcessor.process)

`process`[source]

process(ds)

In [ ]:

show_doc(MultiCategoryList.get)

`get`[source]

get(i)

In [ ]:

show_doc(FloatList.new)

`new`[source]

new(items, kwargs)

In [ ]:

show_doc(MultiCategoryProcessor.generate_classes)

`generate_classes`[source]

generate_classes(items)

In [ ]:

show_doc(CategoryProcessor.generate_classes)

`generate_classes`[source]

generate_classes(items)

The data block API¶

Examples of use¶

Step 1: Provide inputs¶

class ItemList[source]

from_folder[source]

from_df[source]

from_csv[source]

Optional step: filter your data¶

filter_by_func[source]

filter_by_folder[source]

Writing your own ItemList¶

predict[source]

Step 2: Split the data between the training and the validation set¶

random_split_by_pct[source]

split_by_files[source]

split_by_fname_file[source]

split_by_folder[source]

split_by_idx[source]

split_by_idxs[source]

split_by_list[source]

split_by_valid_func[source]

split_from_df[source]

Step 3: Label the inputs¶

label_from_list[source]

label_from_df[source]

label_const[source]

label_from_folder[source]

label_from_func[source]

label_from_re[source]

class CategoryList[source]

class MultiCategoryList[source]

class FloatList[source]

Invisible step: preprocessing¶

class PreProcessor[source]

process_one[source]

process[source]

class CategoryProcessor[source]

class MultiCategoryProcessor[source]

Optional steps¶

Add transforms¶

transform[source]

Add a test set¶

add_test[source]

add_test_folder[source]

Step 4: convert to a DataBunch¶

databunch[source]

Inner classes¶

class LabelList[source]

from_lists[source]

class ItemLists[source]

label_from_lists[source]

class LabelLists[source]

Helper functions¶

get_files[source]

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

get[source]

new[source]

label_cls[source]

get_processors[source]

from_lists[source]

set_item[source]

new[source]

get[source]

predict[source]

new[source]

clear_item[source]

process_one[source]

process[source]

process[source]

predict[source]

transform[source]

process[source]

transform[source]

New Methods - Please document or move to the undocumented section¶

process_one[source]

get[source]

process_one[source]

create_classes[source]

process[source]

get[source]

`class` `ItemList`[source]

`from_folder`[source]

`from_df`[source]

`from_csv`[source]

`filter_by_func`[source]

`filter_by_folder`[source]

Writing your own `ItemList`¶

`predict`[source]

`random_split_by_pct`[source]

`split_by_files`[source]

`split_by_fname_file`[source]

`split_by_folder`[source]

`split_by_idx`[source]

`split_by_idxs`[source]

`split_by_list`[source]

`split_by_valid_func`[source]

`split_from_df`[source]

`label_from_list`[source]

`label_from_df`[source]

`label_const`[source]

`label_from_folder`[source]

`label_from_func`[source]

`label_from_re`[source]

`class` `CategoryList`[source]

`class` `MultiCategoryList`[source]

`class` `FloatList`[source]

`class` `PreProcessor`[source]

`process_one`[source]

`process`[source]

`class` `CategoryProcessor`[source]

`class` `MultiCategoryProcessor`[source]

`transform`[source]

`add_test`[source]

`add_test_folder`[source]

Step 4: convert to a `DataBunch`¶

`databunch`[source]

`class` `LabelList`[source]

`from_lists`[source]

`class` `ItemLists`[source]

`label_from_lists`[source]

`class` `LabelLists`[source]

`get_files`[source]

`get`[source]

`new`[source]

`label_cls`[source]

`get_processors`[source]

`from_lists`[source]

`set_item`[source]

`new`[source]

`get`[source]

`predict`[source]

`new`[source]

`clear_item`[source]

`process_one`[source]

`process`[source]

`process`[source]

`predict`[source]

`transform`[source]

`process`[source]

`transform`[source]

`process_one`[source]

`get`[source]

`process_one`[source]

`create_classes`[source]

`process`[source]

`get`[source]

`new`[source]

`generate_classes`[source]

`generate_classes`[source]