Tanishq Abraham
Here, I import from fastai.vision.all
, which will import all the necessary modules. I also use the fastai-provided set_seed
functionality to ensure reproducibility.
from fastai.vision.all import *
set_seed(42,reproducible=True)
Here, I define functions to help me generate the noisy labels.
get_labels
goes through all the files
and produces a list of labels based on the file's parent folder (parent_label
).
generate_noisy_labels
takes in the list of labels (produced by get_labels
), list of the classes (unique_labels
) and the desired percentage of noise (pct_noise
) and produces new labels with the desired noise level.
get_imagenette_relative_path
gives file paths relative to the imagenette dataset directory.
def get_labels(files):
labels = []
for file in files: labels.append(parent_label(file))
return labels
def generate_noisy_labels(labels,unique_labels,pct_noise):
noisy_labels = labels.copy() #copy labels list, this is will be the new list with noisy labels
num_labels = len(labels) #number of labels
num_classes = len(unique_labels) #number of unique labels
noisy_idxs = [] #this is the list of indices where the labels will be switched
indices = np.random.permutation(num_labels) #randomly permute the indices
for i, idx in enumerate(indices):
if i < pct_noise * num_labels: # only change the first pct_noise% of the permuted labels
noisy_idxs.append(idx) #append to noisy_idxs
before_label = noisy_labels[idx]
while noisy_labels[idx] == before_label: #ensure that the new label isn't the same
new_label = unique_labels[np.random.randint(num_classes)] #randomly select a new label
noisy_labels[idx] = new_label #assign new label
return noisy_labels, noisy_idxs
def get_imagenette_relative_path(files):
_files = []
for i in range(len(files)): _files.append(os.path.join(*str(files[i]).split('/')[-3:]))
return _files
Here, I download the Imagenette data, get a list of the file paths for the training data, get the labels, and its unique elements (the 10 classes).
source = untar_data(URLs.IMAGENETTE_320)
train_files = get_image_files(source/'train')
labels = get_labels(train_files)
unique_labels = list(set(labels))
I now use the previously defined generate_noisy_labels
to generate 1%, 5%, 25%, and 50% noist labels. I can briefly check that indeed percent of noisy labels match, as well as provide an example label that has been changed.
noisy_labels_1, noisy_idxs_1 = generate_noisy_labels(labels, unique_labels, 0.01)
print(f'percentage noise: {100*len(noisy_idxs_1)/len(noisy_labels_1)}%')
percentage noise: 1.0032738409546942%
example_idx = np.random.randint(len(noisy_idxs_1))
print(noisy_labels_1[noisy_idxs_1[example_idx]], labels[noisy_idxs_1[example_idx]])
n03394916 n03000684
noisy_labels_5, noisy_idxs_5 = generate_noisy_labels(labels, unique_labels, 0.05)
noisy_labels_25, noisy_idxs_25 = generate_noisy_labels(labels, unique_labels, 0.25)
noisy_labels_50, noisy_idxs_50 = generate_noisy_labels(labels, unique_labels, 0.50)
Now, I will make a pandas DataFrame of these labels. In the DataFrame, I include the file paths, the noisy labels at the different noise levels, and a column to indicate which images are in the validation dataset. Note that the validation dataset do not have noisy labels. For this reason, I make a DataFrame of the train files, validation files, and concatenate them.
_files = get_imagenette_relative_path(train_files)
train_df = pd.DataFrame({'path': _files,
'noisy_labels_1': noisy_labels_1,
'noisy_labels_5': noisy_labels_5,
'noisy_labels_25': noisy_labels_25,
'noisy_labels_50': noisy_labels_50,
'is_valid': [False]*len(_files)
})
val_files = get_image_files(source/'val')
labels = get_labels(val_files)
_files = get_imagenette_relative_path(val_files)
val_df = pd.DataFrame({'path': _files,
'noisy_labels_1': labels,
'noisy_labels_5': labels,
'noisy_labels_25': labels,
'noisy_labels_50': labels,
'is_valid': [True]*len(_files)
})
df = pd.concat([train_df,val_df])
df.head()
path | noisy_labels_1 | noisy_labels_5 | noisy_labels_25 | noisy_labels_50 | is_valid | |
---|---|---|---|---|---|---|
0 | train/n02979186/n02979186_9036.JPEG | n02979186 | n02979186 | n02979186 | n02979186 | False |
1 | train/n02979186/n02979186_11957.JPEG | n02979186 | n02979186 | n02979186 | n03000684 | False |
2 | train/n02979186/n02979186_9715.JPEG | n02979186 | n02979186 | n03417042 | n03000684 | False |
3 | train/n02979186/n02979186_21736.JPEG | n02979186 | n02979186 | n02979186 | n03417042 | False |
4 | train/n02979186/ILSVRC2012_val_00046953.JPEG | n02979186 | n02979186 | n02979186 | n03394916 | False |
I export to a CSV file and I'm done:
df.to_csv('noisy_imagenette.csv', index=False)
I repeat the same process for the Imagewoof dataset. Again, I load the dataset, obtain the labels and classes.
source = untar_data(URLs.IMAGEWOOF_320)
train_files = get_image_files(source/'train')
labels = get_labels(train_files)
unique_labels = list(set(labels))
Again, I generate noisy labels for Imagewoof for the desired noise levels.
noisy_labels_1, noisy_idxs_1 = generate_noisy_labels(labels, unique_labels, 0.01)
print(f'percentage noise: {100*len(noisy_idxs_1)/len(noisy_labels_1)}%')
percentage noise: 1.0083102493074791%
example_idx = np.random.randint(len(noisy_idxs_1))
print(noisy_labels_1[noisy_idxs_1[example_idx]], labels[noisy_idxs_1[example_idx]])
n02089973 n02087394
noisy_labels_5, noisy_idxs_5 = generate_noisy_labels(labels, unique_labels, 0.05)
noisy_labels_25, noisy_idxs_25 = generate_noisy_labels(labels, unique_labels, 0.25)
noisy_labels_50, noisy_idxs_50 = generate_noisy_labels(labels, unique_labels, 0.50)
I make the DataFrame for Imagewoof:
_files = get_imagenette_relative_path(train_files)
train_df = pd.DataFrame({'path': _files,
'noisy_labels_1': noisy_labels_1,
'noisy_labels_5': noisy_labels_5,
'noisy_labels_25': noisy_labels_25,
'noisy_labels_50': noisy_labels_50,
'is_valid': [False]*len(_files)
})
val_files = get_image_files(source/'val')
labels = get_labels(val_files)
_files = get_imagenette_relative_path(val_files)
val_df = pd.DataFrame({'path': _files,
'noisy_labels_1': labels,
'noisy_labels_5': labels,
'noisy_labels_25': labels,
'noisy_labels_50': labels,
'is_valid': [True]*len(_files)
})
df = pd.concat([train_df,val_df])
df.head()
path | noisy_labels_1 | noisy_labels_5 | noisy_labels_25 | noisy_labels_50 | is_valid | |
---|---|---|---|---|---|---|
0 | train/n02115641/n02115641_3995.JPEG | n02115641 | n02115641 | n02115641 | n02115641 | False |
1 | train/n02115641/n02115641_843.JPEG | n02115641 | n02105641 | n02115641 | n02088364 | False |
2 | train/n02115641/n02115641_2953.JPEG | n02115641 | n02115641 | n02111889 | n02099601 | False |
3 | train/n02115641/n02115641_6458.JPEG | n02115641 | n02115641 | n02093754 | n02115641 | False |
4 | train/n02115641/n02115641_19414.JPEG | n02115641 | n02115641 | n02115641 | n02088364 | False |
Export and done:
df.to_csv('noisy_imagewoof.csv', index=False)