!pip install selfclean -Uq
!pip freeze | grep selfclean
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv selfclean==0.0.28
try:
import google.colab
IN_COLAB = True
except:
IN_COLAB = False
import os
IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
import torch
from torchvision import datasets, transforms
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import copy
import sys
from selfclean import SelfClean
from selfclean.cleaner.selfclean import PretrainingType, DINO_STANDARD_HYPERPARAMETERS
from selfclean.utils.data_downloading import get_oxford_pets3t
if IN_COLAB or IN_KAGGLE:
!git clone https://github.com/Digital-Dermatology/selfclean.git
sys.path.append("selfclean")
else:
sys.path.append("../")
if IN_COLAB or IN_KAGGLE:
pre_computed_path = Path("selfclean/assets/pre_trained_models")
else:
pre_computed_path = Path("../assets/pre_trained_models")
We start by downloading our dataset to analyze.
dataset_name = "OxfordIIITPet"
data_path = Path("../data/") / dataset_name
dataset, df = get_oxford_pets3t(
root_path=data_path,
return_dataframe=True,
transform=transforms.Resize((256, 256)),
)
dataset
Oxford PetIIIT already downloaded to `../data/OxfordIIITPet`.
Dataset ImageFolder Number of datapoints: 7390 Root location: ../data/OxfordIIITPet StandardTransform Transform: Resize(size=(256, 256), interpolation=bilinear, max_size=None, antialias=None)
fig, axes = plt.subplots(3, 6)
for h_idx, h_ax in enumerate(axes):
for v_idx, ax in enumerate(h_ax):
index = np.random.randint(0, high=len(dataset))
ax.imshow(dataset[index][0])
ax.set_xticks([])
ax.set_yticks([])
index += 1
fig.tight_layout()
plt.show()
We can get a quick overview of the quality of the dataset using the representations of an already trained ImageNet model.
parameters = copy.deepcopy(DINO_STANDARD_HYPERPARAMETERS)
# set the model to a pretrained ImageNet one
parameters['model']['base_model'] = 'pretrained_imagenet_vit_tiny'
selfclean = SelfClean(
plot_top_N=7,
auto_cleaning=True,
)
_ = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
epochs=0, # set the n.o. epochs = 0 to run w/o pre-training
batch_size=16,
hyperparameters=parameters,
)
2024-10-04 10:33:45.926 | INFO | Running on: cuda 2024-10-04 10:33:45.927 | INFO | Data loaded: there are 7390 train images and 462 batches with a batch size of 16. Some weights of the model checkpoint at WinKawaks/vit-tiny-patch16-224 were not used when initializing ViTModel: ['classifier.bias', 'classifier.weight'] - This IS expected if you are initializing ViTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing ViTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-tiny-patch16-224 and are newly initialized: ['vit.pooler.dense.bias', 'vit.pooler.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 2024-10-04 10:33:46.708 | INFO | Student and Teacher are built: they are both pretrained_imagenet_vit_tiny network. 2024-10-04 10:33:46.710 | INFO | Pre-trained weights not found. Training from scratch.
Creating dataset representation: 0%| | 0/462 [00:00<?, ?it/s]
2024-10-04 10:34:36.329 | INFO | Fitting cleaner on representation space: (7390, 192)
Creating distance matrix: 0%| | 0/74 [00:00<?, ?it/s]
Processing possible near duplicates: 0%| | 0/2731 [00:00<?, ?it/s]
Processing possible irrelevant samples: 0it [00:00, ?it/s]
We have identified that there are indeed data quality issues, and we can now start running SelfClean. As a first step, this will train a model using self-supervised learning on the provided dataset. Afterward, it will use the learned representations to detect data quality issues using simple scoring functions.
Self-supervised pre-training can take some time, so we set the number of pre-training epochs here to 10
. However, we suggest letting it run for longer to achieve optimal performance.
Here, we have already carried out the SSL pre-training to speed things up.
selfclean = SelfClean(
plot_top_N=7,
auto_cleaning=True,
)
issues = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
pretraining_type=PretrainingType.DINO,
epochs=10,
batch_size=16,
save_every_n_epochs=1,
dataset_name=dataset_name,
work_dir=pre_computed_path,
hyperparameters=DINO_STANDARD_HYPERPARAMETERS,
)
2024-10-04 10:35:34.041 | INFO | Running on: cuda 2024-10-04 10:35:34.042 | INFO | Data loaded: there are 7390 train images and 462 batches with a batch size of 16. 2024-10-04 10:35:58.078 | INFO | Student and Teacher are built: they are both pretrained_imagenet_dino network. 2024-10-04 10:35:58.080 | INFO | Found checkpoint at ../assets/pre_trained_models/DINO-OxfordIIITPet/checkpoints/model_best.pth
Creating dataset representation: 0%| | 0/462 [00:00<?, ?it/s]
2024-10-04 10:36:43.516 | INFO | Fitting cleaner on representation space: (7390, 192)
Creating distance matrix: 0%| | 0/74 [00:00<?, ?it/s]
Processing possible near duplicates: 0%| | 0/2731 [00:00<?, ?it/s]
Processing possible irrelevant samples: 0it [00:00, ?it/s]
print(f"Automatic `near duplicates` detected: {len(issues.get_issues('near_duplicates')['auto_issues'])}")
print(f"Automatic `irrelevant samples` detected: {len(issues.get_issues('irrelevants')['auto_issues'])}")
print(f"Automatic `label errors` detected: {len(issues.get_issues('label_errors')['auto_issues'])}")
Automatic `near duplicates` detected: 89 Automatic `irrelevant samples` detected: 0 Automatic `label errors` detected: 4
Let's look at each issue type in more detail.
# reset to our visualisation augmentation
dataset.transforms = None
r_index = 0
fig, axes = plt.subplots(6, 5, figsize=(10, 13))
for h_idx, h_ax in enumerate(axes):
for v_idx, ax in enumerate(h_ax):
if h_idx % 2 == 1:
continue
idx1, idx2 = issues.get_issues('near_duplicates')['indices'][r_index]
idx1, idx2 = int(idx1), int(idx2)
ax.imshow(dataset[idx1][0])
axes[h_idx + 1, v_idx].imshow(dataset[idx2][0])
ax.set_title(
f"Ranking: {r_index+1}"
f"\nIdx1: {idx1}"
f"\nIdx2: {idx2}"
)
ax.set_xticks([])
ax.set_yticks([])
axes[h_idx + 1, v_idx].set_xticks([])
axes[h_idx + 1, v_idx].set_yticks([])
r_index += 1
fig.tight_layout()
plt.show()
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_near_duplicates.head()
2024-10-04 10:37:41.031 | WARNING | Returning as dataframe requires extensive memory.
indices_1 | indices_2 | scores | auto_issues | path_indices_1 | path_indices_2 | label_indices_1 | label_indices_2 | |
---|---|---|---|---|---|---|---|---|
0 | 698 | 729 | 0.0 | True | ../data/OxfordIIITPet/images/Bombay_194.jpg | ../data/OxfordIIITPet/images/Bombay_32.jpg | Bombay | Bombay |
1 | 1112 | 1141 | 0.0 | True | ../data/OxfordIIITPet/images/Egyptian_Mau_210.jpg | ../data/OxfordIIITPet/images/Egyptian_Mau_41.jpg | Egyptian_Mau | Egyptian_Mau |
2 | 3659 | 3671 | 0.0 | True | ../data/OxfordIIITPet/images/english_cocker_sp... | ../data/OxfordIIITPet/images/english_cocker_sp... | english_cocker_spaniel | english_cocker_spaniel |
3 | 629 | 718 | 0.0 | True | ../data/OxfordIIITPet/images/Bombay_126.jpg | ../data/OxfordIIITPet/images/Bombay_220.jpg | Bombay | Bombay |
4 | 621 | 710 | 0.0 | True | ../data/OxfordIIITPet/images/Bombay_118.jpg | ../data/OxfordIIITPet/images/Bombay_209.jpg | Bombay | Bombay |
r_index = 0
fig, axes = plt.subplots(3, 5, figsize=(10, 7))
for h_ax in axes:
for ax in h_ax:
idx = issues.get_issues('irrelevants')['indices'][r_index]
ax.imshow(dataset[idx][0])
ax.set_title(f"Ranking: {r_index+1}, Idx: {idx}")
ax.set_xticks([])
ax.set_yticks([])
r_index += 1
fig.tight_layout()
plt.show()
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_irrelevants.head()
2024-10-04 10:38:12.871 | WARNING | Returning as dataframe requires extensive memory.
indices | scores | auto_issues | path | label | |
---|---|---|---|---|---|
0 | 4946 | 0.768865 | False | ../data/OxfordIIITPet/images/keeshond_50.jpg | keeshond |
1 | 5090 | 0.771782 | False | ../data/OxfordIIITPet/images/leonberger_180.jpg | leonberger |
2 | 3463 | 0.771972 | False | ../data/OxfordIIITPet/images/chihuahua_156.jpg | chihuahua |
3 | 2904 | 0.773895 | False | ../data/OxfordIIITPet/images/basset_hound_193.jpg | basset_hound |
4 | 849 | 0.774212 | False | ../data/OxfordIIITPet/images/British_Shorthair... | British_Shorthair |
r_index = 0
fig, axes = plt.subplots(3, 5, figsize=(10, 7))
for h_ax in axes:
for ax in h_ax:
idx = issues.get_issues('label_errors')['indices'][r_index]
ax.imshow(dataset[idx][0])
ax.set_title(
f"Ranking: {r_index+1}, Idx: {idx}"
f"\n{dataset.classes[dataset[idx][1]]}"
)
ax.set_xticks([])
ax.set_yticks([])
r_index += 1
fig.tight_layout()
plt.show()
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
df_label_errors.head()
2024-10-04 10:38:13.621 | WARNING | Returning as dataframe requires extensive memory.
indices | scores | auto_issues | path | label | |
---|---|---|---|---|---|
0 | 1813 | 0.069511 | True | ../data/OxfordIIITPet/images/Russian_Blue_112.jpg | Russian_Blue |
1 | 836 | 0.076745 | True | ../data/OxfordIIITPet/images/British_Shorthair... | British_Shorthair |
2 | 7240 | 0.087889 | True | ../data/OxfordIIITPet/images/yorkshire_terrier... | yorkshire_terrier |
3 | 4301 | 0.096670 | True | ../data/OxfordIIITPet/images/great_pyrenees_19... | great_pyrenees |
4 | 2758 | 0.100555 | False | ../data/OxfordIIITPet/images/american_pit_bull... | american_pit_bull_terrier |