from fastai.gen_doc.nbdoc import *
from fastai.vision import *
from fastai import *
The data block API lets you customize how to create a DataBunch
by isolating the underlying parts of that process in separate blocks, mainly:
Dataset
to createDataBunch
This is a bit longer than using the factory methods but is way more flexible. As usual, we'll begin with end-to-end examples, then switch to the details of each of those parts.
In vision.data
, we create an easy DataBunch
suitable for classification by simply typing:
path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=24)
This is aimed at data that is in fodlers following an ImageNet style, with a train and valid directory containing each one subdirectory per class, where all the pictures are. With the data block API, the same thing is achieved like this:
path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
data = (ImageFileList.from_folder(path) #Where to find the data? -> in path and its subfolders
.label_from_folder() #How to label? -> depending on the folder of the filenames
.split_by_folder() #How to split in train/valid? -> use the folders
.add_test_folder() #Optionally add a test set
.datasets(ImageClassificationDataset) #How to convert to datasets? -> use ImageClassificationDataset
.transform(tfms, size=224) #Data augmetnation? -> use tfms with a size of 224
.databunch()) #Finally? -> use the defaults for conversion to ImageDataBunch
data.test_ds[0]
(Image (3, 224, 224), 0)
data.show_batch(rows=3, figsize=(5,5))
data.valid_ds.classes
['3', '7']
Let's look at another example from vision.data
with the planet dataset. This time, it's a multiclassification problem with the labels in a csv file and no given split between valid and train data, so we use a random split. The factory method is:
planet = untar_data(URLs.PLANET_TINY)
planet_tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)
data = ImageDataBunch.from_csv(planet, folder='train', size=128, suffix='.jpg', sep = ' ', ds_tfms=planet_tfms)
With the data block API we can rewrite this like that:
data = (ImageFileList.from_folder(planet)
#Where to find the data? -> in planet and its subfolders
.label_from_csv('labels.csv', sep=' ', folder='train', suffix='.jpg')
#How to label? -> use the csv file labels.csv in path,
#add .jpg to the names and take them in the folder train
.random_split_by_pct()
#How to split in train/valid? -> randomly with the defulat 20% in valid
.datasets(ImageMultiDataset)
#How to convert to datasets? -> use ImageMultiDataset
.transform(planet_tfms, size=128)
#Data augmetnation? -> use tfms with a size of 128
.databunch())
#Finally? -> use the defaults for conversion to databunch
data.show_batch(rows=3, figsize=(10,8), ds_type=DatasetType.Valid)
This new API also allows to use datasets type for which there is no direct ImageDataBunch
factory method. For a segmentation task, for instance, we can use it to quickly get a DataBunch
. Let's take the example of the camvid dataset. The images are in an 'images' folder and their corresponding mask is in a 'labels' folder.
camvid = untar_data(URLs.CAMVID_TINY)
path_lbl = camvid/'labels'
path_img = camvid/'images'
We have a file that gives us the names of the classes (what each code inside the masks corresponds to: a pedestrian, a tree, a road...)
codes = np.loadtxt(camvid/'codes.txt', dtype=str); codes
array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole', 'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock', 'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone', 'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype='<U17')
And we define the following function that infers the mask filename from the image filename.
get_y_fn = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'
Then we can easily define a DataBunch
using the data block API. Here we need to use tfm_y=True
in the transform call because we need the same transforms to be applied to the target mask as were applied to the image.
data = (ImageFileList.from_folder(path_img) #Where are the input files? -> in path_img
.label_from_func(get_y_fn) #How to label? -> use get_y_fn
.random_split_by_pct() #How to split between train and valid? -> randomly
.datasets(SegmentationDataset, classes=codes) #How to create a dataset? -> use SegmentationDataset
.transform(get_transforms(), size=96, tfm_y=True) #Data aug -> Use standard tfms with tfm_y=True
.databunch(bs=64)) #Lastly convert in a databunch.
data.show_batch(rows=2, figsize=(5,5))
One last example for object detection. We use our tiny sample of the COCO dataset here. There is a helper function in the library that reads the annotation file and returns the list of images names with the list of labelled bboxes associated to it. We convert it to a dictionary that maps image names with their bboxes and then write the function that will give us the target for each image filename.
coco = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco/'train.json')
img2bbox = {img:bb for img, bb in zip(images, lbl_bbox)}
get_y_func = lambda o:img2bbox[o.name]
The following code is very similar to what we saw before. The only new addition is the use of special function to collate the samples in batches. This comes from the fact that our images may have multiple bounding boxes, so we need to pad them to the largest number of bounding boxes.
data = (ImageFileList.from_folder(coco)
#Where are the images? -> in coco
.label_from_func(get_y_func)
#How to find the labels? -> use get_y_func
.random_split_by_pct()
#How to split in train/valid? -> randomly with the default 20% in valid
.datasets(ObjectDetectDataset)
#How to create datasets? -> with ObjectDetectDataset
#Data augmentation? -> Standard transforms with tfm_y=True
.databunch(bs=16, collate_fn=bb_pad_collate))
#Finally we convert to a DataBunch and we use bb_pad_collate
data.show_batch(rows=3, ds_type=DatasetType.Valid, figsize=(8,7))
The inputs we want to feed our model are regrouped in the following class. The class contains methods to get the corresponding labels.
show_doc(InputList, title_level=3, doc_string=False)
class
InputList
[source]
InputList
(items
:Iterator
,path
:PathOrStr
='.'
) ::PathItemList
This class regroups the inputs for our model in items
and saves a path
attribute which is where it will look for any files (image files, csv file with labels...)
show_doc(InputList.from_folder)
from_folder
[source]
from_folder
(path
:PathOrStr
='.'
,extensions
:StrList
=None
,recurse
=True
) →InputList
Get the list of files in path
that have a suffix in extensions
. recurse
determines if we search subfolders.
Note that InputList
is subclassed in vision by ImageFileList
that changes the default of extensions
to image file extensions (which is why we used ImageFileList
in our previous examples).
All the followings are methods of InputList
. Note that some of them are primarly intended for inputs that are filenames and might not work in general situations.
show_doc(InputList.label_from_csv)
label_from_csv
[source]
label_from_csv
(csv_fname
,header
:Union
[int
,str
,NoneType
]='infer'
,fn_col
:int
=0
,label_col
:int
=1
,sep
:str
=None
,folder
:PathOrStr
='.'
,suffix
:str
=None
) →LabelList
Look in self.path/csv_fname
for a csv loaded with an optional header
containing the filenames in fn_col
to get the corresponding label in label_col
.
If a folder
is specified, filenames are taken in self.path/folder
. suffix
is added. If sep
is specified, splits the values in label_col
accordingly. This method is intended for inputs that are filenames.
jekyll_note("This method will only keep the filenames that are both present in the csv file and in `self.items`.")
show_doc(InputList.label_from_df)
label_from_df
[source]
label_from_df
(df
,fn_col
:int
=0
,label_col
:int
=1
,sep
:str
=None
,folder
:PathOrStr
='.'
,suffix
:str
=None
) →LabelList
Look in df
for the filenames in fn_col
to get the corresponding label in label_col
.
jekyll_note("This method will only keep the filenames that are both present in the dataframe and in `self.items`.")
show_doc(InputList.label_from_folder)
label_from_folder
[source]
label_from_folder
(classes
:StrList
=None
) →LabelList
Give a label to each filename depending on its folder. If classes
are specified, only keep those.
jekyll_note("This method looks at the last subfolder in the path to determine the classes.")
show_doc(InputList.label_from_func)
label_from_func
[source]
label_from_func
(func
:Callable
) →LabelList
Apply func
to every input to get its label.
This method is primarly intended for inputs that are filenames, but could work in other settings.
show_doc(InputList.label_from_re)
label_from_re
[source]
label_from_re
(pat
:str
,full_path
:bool
=False
) →LabelList
Apply the re in pat
to determine the label of every filename. If full_path
, search in the full name.
show_doc(LabelList, title_level=3, doc_string=False)
class
LabelList
[source]
LabelList
(items
:Iterator
,path
:PathOrStr
='.'
,parent
:InputList
=None
) ::PathItemList
A list of labelled inputs in items
(expected to be tuples of input, label) with a path
attribute. This class contains methods to create SplitDataset
.
show_doc(LabelList.random_split_by_pct)
random_split_by_pct
[source]
random_split_by_pct
(valid_pct
:float
=0.2
) →SplitData
Split the items randomly by putting valid_pct
in the validation set.
show_doc(LabelList.split_by_files)
show_doc(LabelList.split_by_fname_file)
split_by_fname_file
[source]
split_by_fname_file
(fname
:PathOrStr
,path
:PathOrStr
=None
) →SplitData
Split the data by using the file names in fname
for the validation set. path
will override self.path
. This method won't work if you inputs aren't filenames.
show_doc(LabelList.split_by_folder)
split_by_folder
[source]
split_by_folder
(train
:str
='train'
,valid
:str
='valid'
) →SplitData
Split the data depending on the folder (train
or valid
) in which the filenames are. This method won't work if you inputs aren't filenames.
jekyll_note("This method looks at the folder immediately after `self.path` for `valid` and `train`.")
show_doc(LabelList.split_by_idx)
split_by_idx
[source]
split_by_idx
(valid_idx
:Collection
[int
]) →SplitData
Split the data according to the indexes in valid_idx
.
show_doc(SplitData, title_level=3)
show_doc(SplitData.datasets)
datasets
[source]
datasets
(dataset_cls
:type
,kwargs
) →SplitDatasets
Create datasets from the underlying data using dataset_cls
and passing along the kwargs
.
show_doc(SplitData.add_test)
show_doc(SplitData.add_test_folder)
split_data_add_test_folder
[source]
split_data_add_test_folder
(test_folder
:str
='test'
,label
:Any
=None
)
Add test set containing items from folder test_folder
and an arbitrary label
To create the datasets from SplitData
we have the following class method.
show_doc(SplitData.datasets)
datasets
[source]
datasets
(dataset_cls
:type
,kwargs
) →SplitDatasets
Create datasets from the underlying data using dataset_cls
and passing along the kwargs
.
show_doc(SplitDatasets, title_level=3)
This class can be constructed directly from one of the following factory methods.
show_doc(SplitDatasets.from_single)
show_doc(SplitDatasets.single_from_c)
single_from_c
[source]
single_from_c
(path
:PathOrStr
,c
:int
) →SplitDatasets
Factory method that passes a DatasetBase
on c
to from_single
.
show_doc(SplitDatasets.single_from_classes)
single_from_classes
[source]
single_from_classes
(path
:PathOrStr
,classes
:StrList
) →SplitDatasets
Factory method that passes a SingleClassificationDataset
on classes
to from_single
.
Then we can build the DataLoader
around our Dataset
like this.
show_doc(SplitDatasets.dataloaders)
dataloaders
[source]
dataloaders
(kwargs
) →Collection
[DataLoader
]
Create dataloaders with the inner datasets, pasing the kwargs
.
The methods img_transform
and img_databunch
used earlier are documented in vision.data
.
show_doc(ItemList, title_level=3)
class
ItemList
[source]
ItemList
(items
:Iterator
)
A collection of items with __len__
and __getitem__
with ndarray
indexing semantics.