The data block API¶

In [ ]:

from fastai.gen_doc.nbdoc import *
from fastai.basics import *
np.random.seed(42)

The data block API lets you customize the creation of a DataBunch by isolating the underlying parts of that process in separate blocks, mainly:

Where are the inputs and how to create them?
How to split the data into a training and validation sets?
How to label the inputs?
What transforms to apply?
How to add a test set?
How to wrap in dataloaders and create the DataBunch?

Each of these may be addresses with a specific block designed for your unique setup. Your inputs might be in a folder, a csv file, or a dataframe. You may want to split them randomly, by certain indices or depending on the folder they are in. You can have your labels in your csv file or your dataframe, but it may come from folders or a specific function of the input. You may choose to add data augmentation or not. A test set is optional too. Finally you have to set the arguments to put the data together in a DataBunch (batch size, collate function...)

The data block API is called as such because you can mix and match each one of those blocks with the others, allowing for a total flexibility to create your customized DataBunch for training, validation and testing. The factory methods of the various DataBunch are great for beginners but you can't always make your data fit in the tracks they require.

As usual, we'll begin with end-to-end examples, then switch to the details of each of those parts.

Examples of use¶

Let's begin with our traditional MNIST example.

In [ ]:

from fastai.vision import *

In [ ]:

path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
path.ls()

Out[ ]:

[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/export.pkl'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/cleaned.csv'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid')]

In [ ]:

(path/'train').ls()

Out[ ]:

[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/3'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/7')]

In vision.data, we can create a DataBunch suitable for image classification by simply typing:

In [ ]:

data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=64)

This is a shortcut method which is aimed at data that is in folders following an ImageNet style, with the train and valid directories, each containing one subdirectory per class, where all the labelled pictures are. There is also a test directory containing unlabelled pictures.

Here is the same code, but this time using the data block API, which can work with any style of a dataset. All the stages, which will be explained below, can be grouped together like this:

In [ ]:

data = (ImageList.from_folder(path) #Where to find the data? -> in path and its subfolders
        .split_by_folder()              #How to split in train/valid? -> use the folders
        .label_from_folder()            #How to label? -> depending on the folder of the filenames
        .add_test_folder()              #Optionally add a test set (here default name is test)
        .transform(tfms, size=64)       #Data augmentation? -> use tfms with a size of 64
        .databunch())                   #Finally? -> use the defaults for conversion to ImageDataBunch

Now we can look at the created DataBunch:

In [ ]:

data.show_batch(3, figsize=(6,6), hide_axis=False)

Let's look at another example from vision.data with the planet dataset. This time, it's a multiclassification problem with the labels in a csv file and no given split between valid and train data, so we use a random split. The factory method is:

In [ ]:

planet = untar_data(URLs.PLANET_TINY)
planet_tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)

In [ ]:

data = ImageDataBunch.from_csv(planet, folder='train', size=128, suffix='.jpg', label_delim = ' ', ds_tfms=planet_tfms)

With the data block API we can rewrite this like that:

In [ ]:

data = (ImageList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')
        #Where to find the data? -> in planet 'train' folder
        .split_by_rand_pct()
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_df(label_delim=' ')
        #How to label? -> use the csv file
        .transform(planet_tfms, size=128)
        #Data augmentation? -> use tfms with a size of 128
        .databunch())                          
        #Finally -> use the defaults for conversion to databunch

In [ ]:

data.show_batch(rows=2, figsize=(9,7))

The data block API also allows you to get your data together in problems for which there is no direct ImageDataBunch factory method. For a segmentation task, for instance, we can use it to quickly get a DataBunch. Let's take the example of the camvid dataset. The images are in an 'images' folder and their corresponding mask is in a 'labels' folder.

In [ ]:

camvid = untar_data(URLs.CAMVID_TINY)
path_lbl = camvid/'labels'
path_img = camvid/'images'

We have a file that gives us the names of the classes (what each code inside the masks corresponds to: a pedestrian, a tree, a road...)

In [ ]:

codes = np.loadtxt(camvid/'codes.txt', dtype=str); codes

Out[ ]:

array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole',
       'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock',
       'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',
       'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype='<U17')

And we define the following function that infers the mask filename from the image filename.

In [ ]:

get_y_fn = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'

Then we can easily define a DataBunch using the data block API. Here we need to use tfm_y=True in the transform call because we need the same transforms to be applied to the target mask as were applied to the image.

In [ ]:

data = (SegmentationItemList.from_folder(path_img)
        .split_by_rand_pct()
        .label_from_func(get_y_fn, classes=codes)
        .transform(get_transforms(), tfm_y=True, size=128)
        .databunch())

In [ ]:

data.show_batch(rows=2, figsize=(7,5))

Another example for object detection. We use our tiny sample of the COCO dataset here. There is a helper function in the library that reads the annotation file and returns the list of images names with the list of labelled bboxes associated to it. We convert it to a dictionary that maps image names with their bboxes and then write the function that will give us the target for each image filename.

In [ ]:

coco = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
get_y_func = lambda o:img2bbox[o.name]

The following code is very similar to what we saw before. The only new addition is the use of a special function to collate the samples in batches. This comes from the fact that our images may have multiple bounding boxes, so we need to pad them to the largest number of bounding boxes.

In [ ]:

data = (ObjectItemList.from_folder(coco)
        #Where are the images? -> in coco
        .split_by_rand_pct()                          
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_func(get_y_func)
        #How to find the labels? -> use get_y_func
        .transform(get_transforms(), tfm_y=True)
        #Data augmentation? -> Standard transforms with tfm_y=True
        .databunch(bs=16, collate_fn=bb_pad_collate))   
        #Finally we convert to a DataBunch and we use bb_pad_collate

In [ ]:

data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))

But vision isn't the only application where the data block API works. It can also be used for text and tabular data. With our sample of the IMDB dataset (labelled texts in a csv file), here is how to get the data together for a language model.

In [ ]:

from fastai.text import *

In [ ]:

imdb = untar_data(URLs.IMDB_SAMPLE)

In [ ]:

data_lm = (TextList.from_csv(imdb, 'texts.csv', cols='text')
           #Where are the inputs? Column 'text' of this csv
                   .split_by_rand_pct()
           #How to split it? Randomly with the default 20%
                   .label_for_lm()
           #Label it for a language model
                   .databunch())

In [ ]:

data_lm.show_batch()

idx	text
0	! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk ! xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is
1	us into the hearts of these two xxunk , and it is indeed a grand xxunk for the audience as well as the two principals . xxmaj the imagery throughout is impressive , especially the final scenes in xxmaj xxunk . xxmaj it xxunk for me once again how much different the world can be , but also at the same time , how similar . xxmaj the same was
2	acting xxunk this episode , with a touching performance by xxmaj xxunk xxmaj xxunk as a woman xxunk to the xxmaj ice xxmaj age , and xxmaj ian xxmaj wolfe as the xxunk xxmaj librarian . xxmaj somewhat reminiscent of the classic episode xxmaj city xxmaj on xxmaj the xxmaj edge of xxmaj forever , this time travel story is a rich and compelling finale to the series , which
3	it seems positively silly . i have no sympathy for people who have neglected to read one of the xxunk works in xxmaj english literature , so let 's get right to the chase . xxmaj the aliens are destroyed through catching an xxmaj earth disease , against which they have no xxunk . xxmaj if that 's a spoiler , so be it ; after a book and 3
4	.. and of course xxmaj andrew xxmaj davis directed it ... xxmaj xxunk xxmaj xxunk gives a great performance for his first film ... the storyline is very cool and interesting ... there 's humor , heart and intensity ... it is very similar to the book .. i find this film to be not the least bit boring ... i absolutely loved it ... and i encourage anyone to

For a classification problem, we just have to change the way labelling is done. Here we use the csv column label.

In [ ]:

data_clas = (TextList.from_csv(imdb, 'texts.csv', cols='text')
                   .split_from_df(col='is_valid')
                   .label_from_df(cols='label')
                   .databunch())

In [ ]:

data_clas.show_batch()

text	target
xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj	negative
xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with	positive
xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " xxmaj at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject ,	negative
xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' " xxmaj la xxmaj xxunk , "	positive
xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics	positive

Lastly, for tabular data, we just have to pass the name of our categorical and continuous variables as an extra argument. We also add some PreProcessors that are going to be applied to our data once the splitting and labelling is done.

In [ ]:

from fastai.tabular import *

In [ ]:

adult = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(adult/'adult.csv')
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['education-num', 'hours-per-week', 'age', 'capital-loss', 'fnlwgt', 'capital-gain']
procs = [FillMissing, Categorify, Normalize]

In [ ]:

data = (TabularList.from_df(df, path=adult, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx=range(800,1000))
                           .label_from_df(cols=dep_var)
                           .databunch())

In [ ]:

data.show_batch()

workclass	education	marital-status	occupation	relationship	race	sex	native-country	education-num_na	education-num	hours-per-week	age	capital-loss	fnlwgt	capital-gain	target
Private	HS-grad	Never-married	Craft-repair	Unmarried	Asian-Pac-Islander	Male	Vietnam	False	-0.4224	-0.0356	-0.6294	-0.2164	0.7476	-0.1459	<50k
Private	9th	Married-civ-spouse	Farming-fishing	Wife	White	Female	United-States	False	-1.9869	0.1264	-0.5561	-0.2164	1.9847	-0.1459	<50k
Private	Some-college	Married-civ-spouse	Transport-moving	Husband	White	Male	United-States	False	-0.0312	-0.0356	0.3968	-0.2164	0.1973	-0.1459	<50k
Self-emp-not-inc	Bachelors	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States	False	1.1422	-0.0356	1.7894	-0.2164	-0.6119	-0.1459	>=50k
?	HS-grad	Never-married	?	Own-child	Other	Female	United-States	False	-0.4224	-0.0356	-1.5090	-0.2164	1.8018	-0.1459	<50k

Step 1: Provide inputs¶

The basic class to get your inputs into is the following one. It's also the same class that will contain all of your labels (hence the name ItemList).

In [ ]:

show_doc(ItemList, title_level=3)

`class` `ItemList`[source][test]

ItemList(items:Iterator[T_co], path:PathOrStr=*'.', label_cls:Callable=None, inner_df:Any=None, processor:Union[PreProcessor, Collection[PreProcessor]]=None, x:ItemList=None, ignore_empty:bool=False*)

Tests found for ItemList:

Related tests:

pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
pytest -sv tests/test_data_block.py::test_category [source]
pytest -sv tests/test_data_block.py::test_splitdata_datasets [source]
pytest -sv tests/test_data_block.py::test_split_subsets [source]
pytest -sv tests/test_data_block.py::test_regression [source]
pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_multi_category [source]

To run tests please refer to this guide.

A collection of items with __len__ and __getitem__ with ndarray indexing semantics.

This class regroups the inputs for our model in items and saves a path attribute which is where it will look for any files (image files, csv file with labels...). create_func is applied to items to get the final output. label_cls will be called to create the labels from the result of the label function, xtra contains additional information (usually an underlying dataframe) and processor is to be applied to the inputs after the splitting and labelling.

It has multiple subclasses depending on the type of data you're handling. Here is a quick list:

CategoryList for labels in classification
MultiCategoryList for labels in a multi classification problem
FloatList for float labels in a regression problem
ImageList for data that are images
SegmentationItemList like ImageList but will default labels to SegmentationLabelList
SegmentationLabelList for segmentation masks
ObjectItemList like ImageList but will default labels to ObjectLabelList
ObjectLabelList for object detection
PointsItemList for points (of the type ImagePoints)
ImageImageList for image to image tasks
TextList for text data
TextList for text data stored in files
TabularList for tabular data
CollabList for collaborative filtering

Once you have selected the class that is suitable, you can instantiate it with one of the following factory methods

In [ ]:

show_doc(ItemList.from_folder)

`from_folder`[source][test]

from_folder(path:PathOrStr, extensions:StrList=*None, recurse:bool=True, include:OptStrList=None, processor:Union[PreProcessor, Collection[PreProcessor]]=None, ***kwargs**) → ItemList

Tests found for from_folder:

Related tests:

pytest -sv tests/test_data_block.py::test_wrong_order [source]

To run tests please refer to this guide.

Create an ItemList in path from the filenames that have a suffix in extensions. recurse determines if we search subfolders.

In [ ]:

show_doc(ItemList.from_df)

`from_df`[source][test]

from_df(df:DataFrame, path:PathOrStr=*'.', cols:IntsOrStrs=0, processor:Union[PreProcessor, Collection[PreProcessor]]=None, ***kwargs**) → ItemList

Tests found for from_df:

Related tests:

pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
pytest -sv tests/test_data_block.py::test_category [source]
pytest -sv tests/test_data_block.py::test_regression [source]
pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_multi_category [source]

To run tests please refer to this guide.

Create an ItemList in path from the inputs in the cols of df.

In [ ]:

show_doc(ItemList.from_csv)

`from_csv`[source][test]

from_csv(path:PathOrStr, csv_name:str, cols:IntsOrStrs=*0, delimiter:str=None, header:str='infer', processor:Union[PreProcessor, Collection[PreProcessor]]=None, ***kwargs**) → ItemList

No tests found for from_csv. To contribute a test please refer to this guide and this discussion.

Create an ItemList in path from the inputs in the cols of path/csv_name

Optional step: filter your data¶

The factory method may have grabbed too many items. For instance, if you were searching sub folders with the from_folder method, you may have gotten files you don't want. To remove those, you can use one of the following methods.

In [ ]:

show_doc(ItemList.filter_by_func)

`filter_by_func`[source][test]

filter_by_func(func:Callable) → ItemList

No tests found for filter_by_func. To contribute a test please refer to this guide and this discussion.

Only keep elements for which func returns True.

In [ ]:

show_doc(ItemList.filter_by_folder)

`filter_by_folder`[source][test]

filter_by_folder(include=*None, exclude=None*)

No tests found for filter_by_folder. To contribute a test please refer to this guide and this discussion.

Only keep filenames in include folder or reject the ones in exclude.

In [ ]:

show_doc(ItemList.filter_by_rand)

`filter_by_rand`[source][test]

filter_by_rand(p:float, seed:int=*None*)

No tests found for filter_by_rand. To contribute a test please refer to this guide and this discussion.

Keep random sample of items with probability p and an optional seed.

In [ ]:

show_doc(ItemList.to_text)

`to_text`[source][test]

to_text(fn:str)

No tests found for to_text. To contribute a test please refer to this guide and this discussion.

Save self.items to fn in self.path.

In [ ]:

show_doc(ItemList.use_partial_data)

`use_partial_data`[source][test]

use_partial_data(sample_pct:float=*0.01, seed:int=None*) → ItemList

No tests found for use_partial_data. To contribute a test please refer to this guide and this discussion.

Use only a sample of sample_pctof the full dataset and an optional seed.

Writing your own `ItemList`¶

First check if you can't easily customize one of the existing subclass by:

subclassing an existing one and replacing the get method (or the open method if you're dealing with images)
applying a custom processor (see step 4)
changing the default label_cls for the label creation
adding a default PreProcessor with the _processor class variable

If this isn't the case and you really need to write your own class, there is a full tutorial that explains how to proceed.

In [ ]:

show_doc(ItemList.analyze_pred)

`analyze_pred`[source][test]

analyze_pred(pred:Tensor)

No tests found for analyze_pred. To contribute a test please refer to this guide and this discussion.

Called on pred before reconstruct for additional preprocessing.

In [ ]:

show_doc(ItemList.get)

`get`[source][test]

get(i) → Any

No tests found for get. To contribute a test please refer to this guide and this discussion.

Subclass if you want to customize how to create item i from self.items.

In [ ]:

show_doc(ItemList.new)

`new`[source][test]

new(items:Iterator[T_co], processor:Union[PreProcessor, Collection[PreProcessor]]=*None, ***kwargs**) → ItemList

No tests found for new. To contribute a test please refer to this guide and this discussion.

Create a new ItemList from items, keeping the same attributes.

You'll never need to subclass this normally, just don't forget to add to self.copy_new the names of the arguments that needs to be copied each time new is called in __init__.

In [ ]:

show_doc(ItemList.reconstruct)

`reconstruct`[source][test]

reconstruct(t:Tensor, x:Tensor=*None*)

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

Reconstruct one of the underlying item for its data t.

Step 2: Split the data between the training and the validation set¶

This step is normally straightforward, you just have to pick oe of the following functions depending on what you need.

In [ ]:

show_doc(ItemList.split_none)

`split_none`[source][test]

split_none()

No tests found for split_none. To contribute a test please refer to this guide and this discussion.

Don't split the data and create an empty validation set.

In [ ]:

show_doc(ItemList.split_by_rand_pct)

`split_by_rand_pct`[source][test]

split_by_rand_pct(valid_pct:float=*0.2, seed:int=None*) → ItemLists

Tests found for split_by_rand_pct:

pytest -sv tests/test_data_block.py::test_splitdata_datasets [source]

Related tests:

pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

Split the items randomly by putting valid_pct in the validation set, optional seed can be passed.

In [ ]:

show_doc(ItemList.split_subsets)

`split_subsets`[source][test]

split_subsets(train_size:float, valid_size:float, seed=*None*) → ItemLists

Tests found for split_subsets:

pytest -sv tests/test_data_block.py::test_split_subsets [source]

To run tests please refer to this guide.

Split the items into train set with size train_size * n and valid set with size valid_size * n.

This function is handy if you want to work with subsets of specific sizes, e.g., you want to use 20% of the data for the validation dataset, but you only want to train on a small subset of the rest of the data: split_subsets(train_size=0.08, valid_size=0.2).

In [ ]:

show_doc(ItemList.split_by_files)

`split_by_files`[source][test]

split_by_files(valid_names:ItemList) → ItemLists

No tests found for split_by_files. To contribute a test please refer to this guide and this discussion.

Split the data by using the names in valid_names for validation.

In [ ]:

show_doc(ItemList.split_by_fname_file)

`split_by_fname_file`[source][test]

split_by_fname_file(fname:PathOrStr, path:PathOrStr=*None*) → ItemLists

No tests found for split_by_fname_file. To contribute a test please refer to this guide and this discussion.

Split the data by using the names in fname for the validation set. path will override self.path.

In [ ]:

show_doc(ItemList.split_by_folder)

`split_by_folder`[source][test]

split_by_folder(train:str=*'train', valid:str='valid'*) → ItemLists

Tests found for split_by_folder:

Related tests:

pytest -sv tests/test_data_block.py::test_wrong_order [source]

To run tests please refer to this guide.

Split the data depending on the folder (train or valid) in which the filenames are.

In [ ]:

jekyll_note("This method looks at the folder immediately after `self.path` for `valid` and `train`.")

Note: This method looks at the folder immediately after `self.path` for `valid` and `train`.

In [ ]:

show_doc(ItemList.split_by_idx)

`split_by_idx`[source][test]

split_by_idx(valid_idx:Collection[int]) → ItemLists

Tests found for split_by_idx:

Related tests:

pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
pytest -sv tests/test_data_block.py::test_category [source]
pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_multi_category [source]

To run tests please refer to this guide.

Split the data according to the indexes in valid_idx.

In [ ]:

show_doc(ItemList.split_by_idxs)

`split_by_idxs`[source][test]

split_by_idxs(train_idx, valid_idx)

No tests found for split_by_idxs. To contribute a test please refer to this guide and this discussion.

Split the data between train_idx and valid_idx.

In [ ]:

show_doc(ItemList.split_by_list)

`split_by_list`[source][test]

split_by_list(train, valid)

No tests found for split_by_list. To contribute a test please refer to this guide and this discussion.

Split the data between train and valid.

In [ ]:

show_doc(ItemList.split_by_valid_func)

`split_by_valid_func`[source][test]

split_by_valid_func(func:Callable) → ItemLists

No tests found for split_by_valid_func. To contribute a test please refer to this guide and this discussion.

Split the data by result of func (which returns True for validation set).

In [ ]:

show_doc(ItemList.split_from_df)

`split_from_df`[source][test]

split_from_df(col:IntsOrStrs=*2*)

No tests found for split_from_df. To contribute a test please refer to this guide and this discussion.

Split the data from the col in the dataframe in self.inner_df.

In [ ]:

jekyll_warn("This method assumes the data has been created from a csv file or a dataframe.")

Warning: This method assumes the data has been created from a csv file or a dataframe.

Step 3: Label the inputs¶

To label your inputs, use one of the following functions. Note that even if it's not in the documented arguments, you can always pass a label_cls that will be used to create those labels (the default is the one from your input ItemList, and if there is none, it will go to CategoryList, MultiCategoryList or FloatList depending on the type of the labels). This is implemented in the following function:

In [ ]:

show_doc(ItemList.get_label_cls)

`get_label_cls`[source][test]

get_label_cls(labels, label_cls:Callable=*None, label_delim:str=None, ***kwargs**)

No tests found for get_label_cls. To contribute a test please refer to this guide and this discussion.

Return label_cls or guess one from the first element of labels.

The first example in these docs created labels as follows:

In [ ]:

path = untar_data(URLs.MNIST_TINY)
ll = ImageList.from_folder(path).split_by_folder().label_from_folder().train

If you want to save the data necessary to recreate your LabelList (not including saving the actual image/text/etc files), you can use to_df or to_csv:

ll.train.to_csv('tmp.csv')

Or just grab a pd.DataFrame directly:

In [ ]:

ll.to_df().head()

Out[ ]:

	x	y
0	train/3/9932.png	3
1	train/3/7189.png	3
2	train/3/8498.png	3
3	train/3/8888.png	3
4	train/3/9004.png	3

In [ ]:

show_doc(ItemList.label_empty)

`label_empty`[source][test]

label_empty(****kwargs**)

No tests found for label_empty. To contribute a test please refer to this guide and this discussion.

Label every item with an EmptyLabel.

In [ ]:

show_doc(ItemList.label_from_df)

`label_from_df`[source][test]

label_from_df(cols:IntsOrStrs=*1, label_cls:Callable=None, ***kwargs**)

Tests found for label_from_df:

Related tests:

pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]
pytest -sv tests/test_data_block.py::test_category [source]
pytest -sv tests/test_data_block.py::test_regression [source]
pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_multi_category [source]

To run tests please refer to this guide.

Label self.items from the values in cols in self.inner_df.

In [ ]:

jekyll_warn("This method only works with data objects created with either `from_csv` or `from_df` methods.")

Warning: This method only works with data objects created with either `from_csv` or `from_df` methods.

In [ ]:

show_doc(ItemList.label_const)

`label_const`[source][test]

label_const(const:Any=*0, label_cls:Callable=None, ***kwargs**) → LabelList

Tests found for label_const:

Related tests:

pytest -sv tests/test_data_block.py::test_splitdata_datasets [source]
pytest -sv tests/test_data_block.py::test_split_subsets [source]

To run tests please refer to this guide.

Label every item with const.

In [ ]:

show_doc(ItemList.label_from_folder)

`label_from_folder`[source][test]

label_from_folder(label_cls:Callable=*None, ***kwargs**) → LabelList

Tests found for label_from_folder:

pytest -sv tests/test_text_data.py::test_from_folder [source]
pytest -sv tests/test_text_data.py::test_filter_classes [source]

Related tests:

pytest -sv tests/test_data_block.py::test_wrong_order [source]

To run tests please refer to this guide.

Give a label to each filename depending on its folder.

In [ ]:

jekyll_note("This method looks at the last subfolder in the path to determine the classes.")

Note: This method looks at the last subfolder in the path to determine the classes.

In [ ]:

show_doc(ItemList.label_from_func)

`label_from_func`[source][test]

label_from_func(func:Callable, label_cls:Callable=*None, ***kwargs**) → LabelList

No tests found for label_from_func. To contribute a test please refer to this guide and this discussion.

Apply func to every input to get its label.

In [ ]:

show_doc(ItemList.label_from_re)

`label_from_re`[source][test]

label_from_re(pat:str, full_path:bool=*False, label_cls:Callable=None, ***kwargs**) → LabelList

No tests found for label_from_re. To contribute a test please refer to this guide and this discussion.

Apply the re in pat to determine the label of every filename. If full_path, search in the full name.

In [ ]:

show_doc(CategoryList, title_level=3)

`class` `CategoryList`[source][test]

CategoryList(items:Iterator[T_co], classes:Collection[T_co]=*None, label_delim:str=None, ***kwargs**) :: CategoryListBase

Basic ItemList for single classification labels.

ItemList suitable for storing labels in items belonging to classes. If None are passed, classes will be determined by the unique different labels. processor will default to CategoryProcessor.

In [ ]:

show_doc(MultiCategoryList, title_level=3)

`class` `MultiCategoryList`[source][test]

MultiCategoryList(items:Iterator[T_co], classes:Collection[T_co]=*None, label_delim:str=None, one_hot:bool=False, ***kwargs**) :: CategoryListBase

No tests found for MultiCategoryList. To contribute a test please refer to this guide and this discussion.

Basic ItemList for multi-classification labels.

It will store list of labels in items belonging to classes. If None are passed, classes will be determined by the unique different labels. sep is used to split the content of items in a list of tags.

If one_hot=True, the items contain the labels one-hot encoded. In this case, it is mandatory to pass a list of classes (as we can't use the different labels).

In [ ]:

show_doc(FloatList, title_level=3)

`class` `FloatList`[source][test]

FloatList(items:Iterator[T_co], log:bool=*False, classes:Collection[T_co]=None, ***kwargs**) :: ItemList

No tests found for FloatList. To contribute a test please refer to this guide and this discussion.

ItemList suitable for storing the floats in items for regression. Will add a log if this flag is True.

In [ ]:

show_doc(EmptyLabelList, title_level=3)

`class` `EmptyLabelList`[source][test]

EmptyLabelList(items:Iterator[T_co], path:PathOrStr=*'.', label_cls:Callable=None, inner_df:Any=None, processor:Union[PreProcessor, Collection[PreProcessor]]=None, x:ItemList=None, ignore_empty:bool=False*) :: ItemList

No tests found for EmptyLabelList. To contribute a test please refer to this guide and this discussion.

Basic ItemList for dummy labels.

Invisible step: preprocessing¶

This isn't seen here in the API, but if you passed a processor (or a list of them) in your initial ItemList during step 1, it will be applied here. If you didn't pass any processor, a list of them might still be created depending on what is in the _processor variable of your class of items (this can be a list of PreProcessor classes).

A processor is a transformation that is applied to all the inputs once at initialization, with a state computed on the training set that is then applied without modification on the validation set (and maybe the test set). For instance, it can be processing texts to tokenize then numericalize them. In that case we want the validation set to be numericalized with exactly the same vocabulary as the training set.

Another example is in tabular data, where we fill missing values with (for instance) the median computed on the training set. That statistic is stored in the inner state of the PreProcessor and applied on the validation set.

This is the generic class for all processors.

In [ ]:

show_doc(PreProcessor, title_level=3)

`class` `PreProcessor`[source][test]

PreProcessor(ds:Collection[T_co]=*None*)

No tests found for PreProcessor. To contribute a test please refer to this guide and this discussion.

Basic class for a processor that will be applied to items at the end of the data block API.

In [ ]:

show_doc(PreProcessor.process_one)

`process_one`[source][test]

process_one(item:Any)

Tests found for process_one:

Related tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

Process one item. This method needs to be written in any subclass.

In [ ]:

show_doc(PreProcessor.process)

`process`[source][test]

process(ds:Collection[T_co])

Tests found for process:

Direct tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

Process a dataset. This default to apply process_one on every item of ds.

In [ ]:

show_doc(CategoryProcessor, title_level=3)

`class` `CategoryProcessor`[source][test]

CategoryProcessor(ds:ItemList) :: PreProcessor

No tests found for CategoryProcessor. To contribute a test please refer to this guide and this discussion.

PreProcessor that create classes from ds.items and handle the mapping.

In [ ]:

show_doc(CategoryProcessor.generate_classes)

`generate_classes`[source][test]

generate_classes(items)

No tests found for generate_classes. To contribute a test please refer to this guide and this discussion.

Generate classes from items by taking the sorted unique values.

In [ ]:

show_doc(MultiCategoryProcessor, title_level=3)

`class` `MultiCategoryProcessor`[source][test]

MultiCategoryProcessor(ds:ItemList, one_hot:bool=*False*) :: CategoryProcessor

No tests found for MultiCategoryProcessor. To contribute a test please refer to this guide and this discussion.

PreProcessor that create classes from ds.items and handle the mapping.

In [ ]:

show_doc(MultiCategoryProcessor.generate_classes)

`generate_classes`[source][test]

generate_classes(items)

No tests found for generate_classes. To contribute a test please refer to this guide and this discussion.

Generate classes from items by taking the sorted unique values.

Optional steps¶

Add transforms¶

Transforms differ from processors in the sense they are applied on the fly when we grab one item. They also may change each time we ask for the same item in the case of random transforms.

In [ ]:

show_doc(LabelLists.transform)

`transform`[source][test]

transform(tfms:Optional[Tuple[Union[Callable, Collection[Callable]], Union[Callable, Collection[Callable]]]]=*(None, None), ***kwargs**)

No tests found for transform. To contribute a test please refer to this guide and this discussion.

Set tfms to be applied to the xs of the train and validation set.

This is primary for the vision application. The kwargs arguments are the ones expected by the type of transforms you pass. tfm_y is among them and if set to True, the transforms will be applied to input and target.

For examples see: vision.transforms.

Add a test set¶

To add a test set, you can use one of the two following methods.

In [ ]:

show_doc(LabelLists.add_test)

`add_test`[source][test]

add_test(items:Iterator[T_co], label:Any=*None*)

No tests found for add_test. To contribute a test please refer to this guide and this discussion.

Add test set containing items with an arbitrary label.

In [ ]:

jekyll_note("Here `items` can be an `ItemList` or a collection.")

Note: Here `items` can be an `ItemList` or a collection.

In [ ]:

show_doc(LabelLists.add_test_folder)

`add_test_folder`[source][test]

add_test_folder(test_folder:str=*'test', label:Any=None*)

No tests found for add_test_folder. To contribute a test please refer to this guide and this discussion.

Add test set containing items from test_folder and an arbitrary label.

In [ ]:

jekyll_warn("In fastai the test set is unlabeled! No labels will be collected even if they are available.")

Warning: In fastai the test set is unlabeled! So no labels will be collected even if they are available.

Instead, either the passed label argument or an empty label will be used for all entries of this dataset (this is required by the internal pipeline of fastai).

In the fastai framework test datasets have no labels - this is the unknown data to be predicted. If you want to validate your model on a test dataset with labels, you probably need to use it as a validation set, as in:

data_test = (ImageList.from_folder(path)
        .split_by_folder(train='train', valid='test')
        .label_from_folder()
        ...)

Another approach, where you do use a normal validation set, and then when the training is over, you just want to validate the test set w/ labels as a validation set, you can do this:

tfms = []
path = Path('data').resolve()
data = (ImageList.from_folder(path)
        .split_by_pct()
        .label_from_folder()
        .transform(tfms)
        .databunch()
        .normalize() ) 
learn = cnn_learner(data, models.resnet50, metrics=accuracy)
learn.fit_one_cycle(5,1e-2)

# now replace the validation dataset entry with the test dataset as a new validation dataset: 
# everything is exactly the same, except replacing `split_by_pct` w/ `split_by_folder` 
# (or perhaps you were already using the latter, so simply switch to valid='test')
data_test = (ImageList.from_folder(path)
        .split_by_folder(train='train', valid='test')
        .label_from_folder()
        .transform(tfms)
        .databunch()
        .normalize()
       ) 
learn.validate(data_test.valid_dl)

Of course, your data block can be totally different, this is just an example.

Step 4: convert to a `DataBunch`¶

This last step is usually pretty straightforward. You just have to include all the arguments we pass to DataBunch.create (bs, num_workers, collate_fn). The class called to create a DataBunch is set in the _bunch attribute of the inputs of the training set if you need to modify it. Normally, the various subclasses we showed before handle that for you.

In [ ]:

show_doc(LabelLists.databunch)

`databunch`[source][test]

databunch(path:PathOrStr=*None, bs:int=64, val_bs:int=None, num_workers:int=8, dl_tfms:Optional[Collection[Callable]]=None, device:device=None, collate_fn:Callable='data_collate', no_check:bool=False, ***kwargs**) → DataBunch

Tests found for databunch:

pytest -sv tests/test_vision_data.py::test_vision_datasets [source]

Related tests:

pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

Create an DataBunch from self, path will override self.path, kwargs are passed to DataBunch.create.

Inner classes¶

In [ ]:

show_doc(LabelList, title_level=3)

`class` `LabelList`[source][test]

LabelList(x:ItemList, y:ItemList, tfms:Union[Callable, Collection[Callable]]=*None, tfm_y:bool=False, ***kwargs**) :: Dataset

No tests found for LabelList. To contribute a test please refer to this guide and this discussion.

A list of inputs x and labels y with optional tfms.

Optionally apply tfms to y if tfm_y is True.

In [ ]:

show_doc(LabelList.export)

`export`[source][test]

export(fn:PathOrStr, ****kwargs**)

No tests found for export. To contribute a test please refer to this guide and this discussion.

Export the minimal state and save it in fn to load an empty version for inference.

In [ ]:

show_doc(LabelList.transform_y)

`transform_y`[source][test]

transform_y(tfms:Union[Callable, Collection[Callable]]=*None, ***kwargs**)

No tests found for transform_y. To contribute a test please refer to this guide and this discussion.

Set tfms to be applied to the targets only.

In [ ]:

show_doc(LabelList.get_state)

`get_state`[source][test]

get_state(****kwargs**)

No tests found for get_state. To contribute a test please refer to this guide and this discussion.

Return the minimal state for export.

In [ ]:

show_doc(LabelList.load_empty)

`load_empty`[source][test]

load_empty(path:PathOrStr, fn:PathOrStr)

No tests found for load_empty. To contribute a test please refer to this guide and this discussion.

Load the state in fn to create an empty LabelList for inference.

In [ ]:

show_doc(LabelList.load_state)

`load_state`[source][test]

load_state(path:PathOrStr, state:dict) → LabelList

No tests found for load_state. To contribute a test please refer to this guide and this discussion.

Create a LabelList from state.

In [ ]:

show_doc(LabelList.process)

`process`[source][test]

process(xp:PreProcessor=*None, yp:PreProcessor=None, name:str=None*)

Tests found for process:

Direct tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

Launch the processing on self.x and self.y with xp and yp.

In [ ]:

show_doc(LabelList.set_item)

`set_item`[source][test]

set_item(item)

No tests found for set_item. To contribute a test please refer to this guide and this discussion.

For inference, will briefly replace the dataset with one that only contains item.

In [ ]:

show_doc(LabelList.to_df)

`to_df`[source][test]

to_df()

No tests found for to_df. To contribute a test please refer to this guide and this discussion.

Create pd.DataFrame containing items from self.x and self.y.

In [ ]:

show_doc(LabelList.to_csv)

`to_csv`[source][test]

to_csv(dest:str)

No tests found for to_csv. To contribute a test please refer to this guide and this discussion.

Save self.to_df() to a CSV file in self.path/dest.

In [ ]:

show_doc(LabelList.transform)

`transform`[source][test]

transform(tfms:Union[Callable, Collection[Callable]], tfm_y:bool=*None, ***kwargs**)

No tests found for transform. To contribute a test please refer to this guide and this discussion.

Set the tfms and tfm_y value to be applied to the inputs and targets.

In [ ]:

show_doc(ItemLists, title_level=3)

`class` `ItemLists`[source][test]

ItemLists(path:PathOrStr, train:ItemList, valid:ItemList)

No tests found for ItemLists. To contribute a test please refer to this guide and this discussion.

An ItemList for each of train and valid (optional test).

In [ ]:

show_doc(ItemLists.label_from_lists)

`label_from_lists`[source][test]

label_from_lists(train_labels:Iterator[T_co], valid_labels:Iterator[T_co], label_cls:Callable=*None, ***kwargs**) → LabelList

No tests found for label_from_lists. To contribute a test please refer to this guide and this discussion.

Use the labels in train_labels and valid_labels to label the data. label_cls will overwrite the default.

In [ ]:

show_doc(ItemLists.transform)

`transform`[source][test]

transform(tfms:Optional[Tuple[Union[Callable, Collection[Callable]], Union[Callable, Collection[Callable]]]]=*(None, None), ***kwargs**)

No tests found for transform. To contribute a test please refer to this guide and this discussion.

Set tfms to be applied to the xs of the train and validation set.

In [ ]:

show_doc(ItemLists.transform_y)

`transform_y`[source][test]

transform_y(tfms:Optional[Tuple[Union[Callable, Collection[Callable]], Union[Callable, Collection[Callable]]]]=*(None, None), ***kwargs**)

No tests found for transform_y. To contribute a test please refer to this guide and this discussion.

Set tfms to be applied to the ys of the train and validation set.

In [ ]:

show_doc(LabelLists, title_level=3)

`class` `LabelLists`[source][test]

LabelLists(path:PathOrStr, train:ItemList, valid:ItemList) :: ItemLists

No tests found for LabelLists. To contribute a test please refer to this guide and this discussion.

A LabelList for each of train and valid (optional test).

In [ ]:

show_doc(LabelLists.get_processors)

`get_processors`[source][test]

get_processors()

No tests found for get_processors. To contribute a test please refer to this guide and this discussion.

Read the default class processors if none have been set.

In [ ]:

show_doc(LabelLists.load_empty)

`load_empty`[source][test]

load_empty(path:PathOrStr, fn:PathOrStr=*'export.pkl'*)

No tests found for load_empty. To contribute a test please refer to this guide and this discussion.

Create a LabelLists with empty sets from the serialized file in path/fn.

In [ ]:

show_doc(LabelLists.load_state)

`load_state`[source][test]

load_state(path:PathOrStr, state:dict)

No tests found for load_state. To contribute a test please refer to this guide and this discussion.

Create a LabelLists with empty sets from the serialized state.

In [ ]:

show_doc(LabelLists.process)

`process`[source][test]

process()

Tests found for process:

Direct tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

Process the inner datasets.

Helper functions¶

In [ ]:

show_doc(get_files)

`get_files`[source][test]

get_files(path:PathOrStr, extensions:StrList=*None, recurse:bool=False, include:OptStrList=None*) → FilePathList

No tests found for get_files. To contribute a test please refer to this guide and this discussion.

Return list of files in path that have a suffix in extensions; optionally recurse.

Undocumented Methods - Methods moved below this line will intentionally be hidden¶

In [ ]:

show_doc(CategoryList.new)

`new`[source][test]

new(items:Iterator[T_co], processor:Union[PreProcessor, Collection[PreProcessor]]=*None, ***kwargs**) → ItemList

No tests found for new. To contribute a test please refer to this guide and this discussion.

Create a new ItemList from items, keeping the same attributes.

In [ ]:

show_doc(LabelList.new)

`new`[source][test]

new(x, y, ****kwargs**) → LabelList

No tests found for new. To contribute a test please refer to this guide and this discussion.

In [ ]:

show_doc(CategoryList.get)

`get`[source][test]

get(i)

Subclass if you want to customize how to create item i from self.items.

In [ ]:

show_doc(LabelList.predict)

`predict`[source][test]

predict(res)

No tests found for predict. To contribute a test please refer to this guide and this discussion.

Delegates predict call on res to self.y.

In [ ]:

show_doc(ItemList.new)

`new`[source][test]

new(items:Iterator[T_co], processor:Union[PreProcessor, Collection[PreProcessor]]=*None, ***kwargs**) → ItemList

No tests found for new. To contribute a test please refer to this guide and this discussion.

Create a new ItemList from items, keeping the same attributes.

In [ ]:

show_doc(ItemList.process_one)

`process_one`[source][test]

process_one(item:ItemBase, processor:Union[PreProcessor, Collection[PreProcessor]]=*None*)

Tests found for process_one:

Related tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

Apply processor or self.processor to item.

In [ ]:

show_doc(ItemList.process)

`process`[source][test]

process(processor:Union[PreProcessor, Collection[PreProcessor]]=*None*)

Tests found for process:

Direct tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

Apply processor or self.processor to self.

In [ ]:

show_doc(MultiCategoryProcessor.process_one)

`process_one`[source][test]

process_one(item)

Tests found for process_one:

Related tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

In [ ]:

show_doc(FloatList.get)

`get`[source][test]

get(i)

No tests found for get. To contribute a test please refer to this guide and this discussion.

Subclass if you want to customize how to create item i from self.items.

In [ ]:

show_doc(CategoryProcessor.process_one)

`process_one`[source][test]

process_one(item)

Tests found for process_one:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

In [ ]:

show_doc(CategoryProcessor.create_classes)

`create_classes`[source][test]

create_classes(classes)

No tests found for create_classes. To contribute a test please refer to this guide and this discussion.

In [ ]:

show_doc(CategoryProcessor.process)

`process`[source][test]

process(ds)

Tests found for process:

Direct tests:

pytest -sv tests/test_data_block.py::test_category_processor_existing_class [source]
pytest -sv tests/test_data_block.py::test_category_processor_non_existing_class [source]

To run tests please refer to this guide.

In [ ]:

show_doc(MultiCategoryList.get)

`get`[source][test]

get(i)

No tests found for get. To contribute a test please refer to this guide and this discussion.

Subclass if you want to customize how to create item i from self.items.

In [ ]:

show_doc(FloatList.new)

`new`[source][test]

new(items:Iterator[T_co], processor:Union[PreProcessor, Collection[PreProcessor]]=*None, ***kwargs**) → ItemList

No tests found for new. To contribute a test please refer to this guide and this discussion.

Create a new ItemList from items, keeping the same attributes.

In [ ]:

show_doc(FloatList.reconstruct)

`reconstruct`[source][test]

reconstruct(t)

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

Reconstruct one of the underlying item for its data t.

In [ ]:

show_doc(MultiCategoryList.analyze_pred)

`analyze_pred`[source][test]

analyze_pred(pred, thresh:float=*0.5*)

No tests found for analyze_pred. To contribute a test please refer to this guide and this discussion.

Called on pred before reconstruct for additional preprocessing.

In [ ]:

show_doc(MultiCategoryList.reconstruct)

`reconstruct`[source][test]

reconstruct(t)

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

Reconstruct one of the underlying item for its data t.

In [ ]:

show_doc(CategoryList.reconstruct)

`reconstruct`[source][test]

reconstruct(t)

Reconstruct one of the underlying item for its data t.

In [ ]:

show_doc(CategoryList.analyze_pred)

`analyze_pred`[source][test]

analyze_pred(pred, thresh:float=*0.5*)

Called on pred before reconstruct for additional preprocessing.

In [ ]:

show_doc(EmptyLabelList.reconstruct)

`reconstruct`[source][test]

reconstruct(t:Tensor, x:Tensor=*None*)

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

Reconstruct one of the underlying item for its data t.

In [ ]:

show_doc(EmptyLabelList.get)

`get`[source][test]

get(i)

No tests found for get. To contribute a test please refer to this guide and this discussion.

Subclass if you want to customize how to create item i from self.items.

In [ ]:

show_doc(LabelList.databunch)

`databunch`[source][test]

databunch(****kwargs**)

Tests found for databunch:

Related tests:

pytest -sv tests/test_data_block.py::test_regression [source]

To run tests please refer to this guide.

To throw a clear error message when the data wasn't split.

New Methods - Please document or move to the undocumented section¶

In [ ]:

show_doc(ItemList.add)

`add`[source][test]

add(items:ItemList)

No tests found for add. To contribute a test please refer to this guide and this discussion.

The data block API¶

Examples of use¶

Step 1: Provide inputs¶

class ItemList[source][test]

from_folder[source][test]

from_df[source][test]

from_csv[source][test]

Optional step: filter your data¶

filter_by_func[source][test]

filter_by_folder[source][test]

filter_by_rand[source][test]

to_text[source][test]

use_partial_data[source][test]

Writing your own ItemList¶

analyze_pred[source][test]

get[source][test]

new[source][test]

reconstruct[source][test]

Step 2: Split the data between the training and the validation set¶

split_none[source][test]

split_by_rand_pct[source][test]

split_subsets[source][test]

split_by_files[source][test]

split_by_fname_file[source][test]

split_by_folder[source][test]

split_by_idx[source][test]

split_by_idxs[source][test]

split_by_list[source][test]

split_by_valid_func[source][test]

split_from_df[source][test]

Step 3: Label the inputs¶

get_label_cls[source][test]

label_empty[source][test]

label_from_df[source][test]

label_const[source][test]

label_from_folder[source][test]

label_from_func[source][test]

label_from_re[source][test]

class CategoryList[source][test]

class MultiCategoryList[source][test]

class FloatList[source][test]

class EmptyLabelList[source][test]

Invisible step: preprocessing¶

class PreProcessor[source][test]

process_one[source][test]

process[source][test]

class CategoryProcessor[source][test]

generate_classes[source][test]

class MultiCategoryProcessor[source][test]

generate_classes[source][test]

Optional steps¶

Add transforms¶

transform[source][test]

Add a test set¶

add_test[source][test]

add_test_folder[source][test]

Step 4: convert to a DataBunch¶

databunch[source][test]

Inner classes¶

class LabelList[source][test]

export[source][test]

transform_y[source][test]

get_state[source][test]

load_empty[source][test]

load_state[source][test]

process[source][test]

set_item[source][test]

to_df[source][test]

to_csv[source][test]

transform[source][test]

class ItemLists[source][test]

label_from_lists[source][test]

transform[source][test]

transform_y[source][test]

class LabelLists[source][test]

get_processors[source][test]

load_empty[source][test]

load_state[source][test]

process[source][test]

Helper functions¶

`class` `ItemList`[source][test]

`from_folder`[source][test]

`from_df`[source][test]

`from_csv`[source][test]

`filter_by_func`[source][test]

`filter_by_folder`[source][test]

`filter_by_rand`[source][test]

`to_text`[source][test]

`use_partial_data`[source][test]

Writing your own `ItemList`¶

`analyze_pred`[source][test]

`get`[source][test]

`new`[source][test]

`reconstruct`[source][test]

`split_none`[source][test]

`split_by_rand_pct`[source][test]

`split_subsets`[source][test]

`split_by_files`[source][test]

`split_by_fname_file`[source][test]

`split_by_folder`[source][test]

`split_by_idx`[source][test]

`split_by_idxs`[source][test]

`split_by_list`[source][test]

`split_by_valid_func`[source][test]

`split_from_df`[source][test]

`get_label_cls`[source][test]

`label_empty`[source][test]

`label_from_df`[source][test]

`label_const`[source][test]

`label_from_folder`[source][test]

`label_from_func`[source][test]

`label_from_re`[source][test]

`class` `CategoryList`[source][test]

`class` `MultiCategoryList`[source][test]

`class` `FloatList`[source][test]

`class` `EmptyLabelList`[source][test]

`class` `PreProcessor`[source][test]

`process_one`[source][test]

`process`[source][test]

`class` `CategoryProcessor`[source][test]

`generate_classes`[source][test]

`class` `MultiCategoryProcessor`[source][test]

`generate_classes`[source][test]

`transform`[source][test]

`add_test`[source][test]

`add_test_folder`[source][test]

Step 4: convert to a `DataBunch`¶

`databunch`[source][test]

`class` `LabelList`[source][test]

`export`[source][test]

`transform_y`[source][test]

`get_state`[source][test]

`load_empty`[source][test]

`load_state`[source][test]

`process`[source][test]

`set_item`[source][test]

`to_df`[source][test]

`to_csv`[source][test]

`transform`[source][test]

`class` `ItemLists`[source][test]

`label_from_lists`[source][test]

`transform`[source][test]

`transform_y`[source][test]

`class` `LabelLists`[source][test]

`get_processors`[source][test]

`load_empty`[source][test]

`load_state`[source][test]

`process`[source][test]

`get_files`[source][test]

`new`[source][test]

`new`[source][test]

`get`[source][test]

`predict`[source][test]

`new`[source][test]

`process_one`[source][test]

`process`[source][test]

`process_one`[source][test]

`get`[source][test]

`process_one`[source][test]

`create_classes`[source][test]