Investigate the Quality of OxfordIIITPet¶

In [1]:

!pip install selfclean -Uq
!pip freeze | grep selfclean

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
selfclean==0.0.28

In [2]:

try:
    import google.colab

    IN_COLAB = True
except:
    IN_COLAB = False

import os

IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ

In [3]:

import torch
from torchvision import datasets, transforms
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import copy
import sys

from selfclean import SelfClean
from selfclean.cleaner.selfclean import PretrainingType, DINO_STANDARD_HYPERPARAMETERS
from selfclean.utils.data_downloading import get_oxford_pets3t

if IN_COLAB or IN_KAGGLE:
    !git clone https://github.com/Digital-Dermatology/selfclean.git
    sys.path.append("selfclean")
else:
    sys.path.append("../")

In [4]:

if IN_COLAB or IN_KAGGLE:
    pre_computed_path = Path("selfclean/assets/pre_trained_models")
else:
    pre_computed_path = Path("../assets/pre_trained_models")

We start by downloading our dataset to analyze.

In [5]:

dataset_name = "OxfordIIITPet"
data_path = Path("../data/") / dataset_name
dataset, df = get_oxford_pets3t(
    root_path=data_path, 
    return_dataframe=True, 
    transform=transforms.Resize((256, 256)),
)

dataset

Oxford PetIIIT already downloaded to `../data/OxfordIIITPet`.

Out[5]:

Dataset ImageFolder
    Number of datapoints: 7390
    Root location: ../data/OxfordIIITPet
    StandardTransform
Transform: Resize(size=(256, 256), interpolation=bilinear, max_size=None, antialias=None)

In [6]:

fig, axes = plt.subplots(3, 6)
for h_idx, h_ax in enumerate(axes):
    for v_idx, ax in enumerate(h_ax):
        index = np.random.randint(0, high=len(dataset))
        ax.imshow(dataset[index][0])
        ax.set_xticks([])
        ax.set_yticks([])
        index += 1
fig.tight_layout()
plt.show()

We can get a quick overview of the quality of the dataset using the representations of an already trained ImageNet model.

In [7]:

parameters = copy.deepcopy(DINO_STANDARD_HYPERPARAMETERS)
# set the model to a pretrained ImageNet one
parameters['model']['base_model'] = 'pretrained_imagenet_vit_tiny'

selfclean = SelfClean(
    plot_top_N=7,
    auto_cleaning=True,
)
_ = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
    epochs=0, # set the n.o. epochs = 0 to run w/o pre-training 
    batch_size=16,
    hyperparameters=parameters,
)

2024-10-04 10:33:45.926 | INFO     | Running on: cuda
2024-10-04 10:33:45.927 | INFO     | Data loaded: there are 7390 train images and 462 batches with a batch size of 16.
Some weights of the model checkpoint at WinKawaks/vit-tiny-patch16-224 were not used when initializing ViTModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing ViTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ViTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-tiny-patch16-224 and are newly initialized: ['vit.pooler.dense.bias', 'vit.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-10-04 10:33:46.708 | INFO     | Student and Teacher are built: they are both pretrained_imagenet_vit_tiny network.
2024-10-04 10:33:46.710 | INFO     | Pre-trained weights not found. Training from scratch.

Creating dataset representation:   0%|          | 0/462 [00:00<?, ?it/s]

2024-10-04 10:34:36.329 | INFO     | Fitting cleaner on representation space: (7390, 192)

Creating distance matrix:   0%|          | 0/74 [00:00<?, ?it/s]

Processing possible near duplicates:   0%|          | 0/2731 [00:00<?, ?it/s]

Processing possible irrelevant samples: 0it [00:00, ?it/s]

We have identified that there are indeed data quality issues, and we can now start running SelfClean. As a first step, this will train a model using self-supervised learning on the provided dataset. Afterward, it will use the learned representations to detect data quality issues using simple scoring functions.

Self-supervised pre-training can take some time, so we set the number of pre-training epochs here to 10. However, we suggest letting it run for longer to achieve optimal performance.

Here, we have already carried out the SSL pre-training to speed things up.

In [8]:

selfclean = SelfClean(
    plot_top_N=7,
    auto_cleaning=True,
)

In [9]:

issues = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
    pretraining_type=PretrainingType.DINO,
    epochs=10,
    batch_size=16,
    save_every_n_epochs=1,
    dataset_name=dataset_name,
    work_dir=pre_computed_path,
    hyperparameters=DINO_STANDARD_HYPERPARAMETERS,
)

2024-10-04 10:35:34.041 | INFO     | Running on: cuda
2024-10-04 10:35:34.042 | INFO     | Data loaded: there are 7390 train images and 462 batches with a batch size of 16.
2024-10-04 10:35:58.078 | INFO     | Student and Teacher are built: they are both pretrained_imagenet_dino network.
2024-10-04 10:35:58.080 | INFO     | Found checkpoint at ../assets/pre_trained_models/DINO-OxfordIIITPet/checkpoints/model_best.pth

Creating dataset representation:   0%|          | 0/462 [00:00<?, ?it/s]

2024-10-04 10:36:43.516 | INFO     | Fitting cleaner on representation space: (7390, 192)

Creating distance matrix:   0%|          | 0/74 [00:00<?, ?it/s]

Processing possible near duplicates:   0%|          | 0/2731 [00:00<?, ?it/s]

Processing possible irrelevant samples: 0it [00:00, ?it/s]

In [10]:

print(f"Automatic `near duplicates` detected: {len(issues.get_issues('near_duplicates')['auto_issues'])}")
print(f"Automatic `irrelevant samples` detected: {len(issues.get_issues('irrelevants')['auto_issues'])}")
print(f"Automatic `label errors` detected: {len(issues.get_issues('label_errors')['auto_issues'])}")

Automatic `near duplicates` detected: 89
Automatic `irrelevant samples` detected: 0
Automatic `label errors` detected: 4

Let's look at each issue type in more detail.

In [11]:

# reset to our visualisation augmentation
dataset.transforms = None

Near duplicates¶

In [12]:

r_index = 0
fig, axes = plt.subplots(6, 5, figsize=(10, 13))
for h_idx, h_ax in enumerate(axes):
    for v_idx, ax in enumerate(h_ax):
        if h_idx % 2 == 1:
            continue

        idx1, idx2 = issues.get_issues('near_duplicates')['indices'][r_index]
        idx1, idx2 = int(idx1), int(idx2)

        ax.imshow(dataset[idx1][0])
        axes[h_idx + 1, v_idx].imshow(dataset[idx2][0])
        ax.set_title(
            f"Ranking: {r_index+1}"
            f"\nIdx1: {idx1}"
            f"\nIdx2: {idx2}"
        )
        ax.set_xticks([])
        ax.set_yticks([])
        axes[h_idx + 1, v_idx].set_xticks([])
        axes[h_idx + 1, v_idx].set_yticks([])
        r_index += 1

fig.tight_layout()
plt.show()

In [13]:

df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_near_duplicates.head()

2024-10-04 10:37:41.031 | WARNING  | Returning as dataframe requires extensive memory.

Out[13]:

	indices_1	indices_2	auto_issues	path_indices_1	path_indices_2	label_indices_1	label_indices_2
0	698	729	True	../data/OxfordIIITPet/images/Bombay_194.jpg	../data/OxfordIIITPet/images/Bombay_32.jpg	Bombay	Bombay
1	1112	1141	True	../data/OxfordIIITPet/images/Egyptian_Mau_210.jpg	../data/OxfordIIITPet/images/Egyptian_Mau_41.jpg	Egyptian_Mau	Egyptian_Mau
2	3659	3671	True	../data/OxfordIIITPet/images/english_cocker_sp...	../data/OxfordIIITPet/images/english_cocker_sp...	english_cocker_spaniel	english_cocker_spaniel
3	629	718	True	../data/OxfordIIITPet/images/Bombay_126.jpg	../data/OxfordIIITPet/images/Bombay_220.jpg	Bombay	Bombay
4	621	710	True	../data/OxfordIIITPet/images/Bombay_118.jpg	../data/OxfordIIITPet/images/Bombay_209.jpg	Bombay	Bombay

Irrelevant Samples¶

In [14]:

r_index = 0
fig, axes = plt.subplots(3, 5, figsize=(10, 7))
for h_ax in axes:
    for ax in h_ax:
        idx = issues.get_issues('irrelevants')['indices'][r_index]
        ax.imshow(dataset[idx][0])
        ax.set_title(f"Ranking: {r_index+1}, Idx: {idx}")
        ax.set_xticks([])
        ax.set_yticks([])
        r_index += 1
fig.tight_layout()
plt.show()

In [15]:

df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_irrelevants.head()

2024-10-04 10:38:12.871 | WARNING  | Returning as dataframe requires extensive memory.

Out[15]:

	indices	scores	auto_issues	path	label
0	4946	0.768865	False	../data/OxfordIIITPet/images/keeshond_50.jpg	keeshond
1	5090	0.771782	False	../data/OxfordIIITPet/images/leonberger_180.jpg	leonberger
2	3463	0.771972	False	../data/OxfordIIITPet/images/chihuahua_156.jpg	chihuahua
3	2904	0.773895	False	../data/OxfordIIITPet/images/basset_hound_193.jpg	basset_hound
4	849	0.774212	False	../data/OxfordIIITPet/images/British_Shorthair...	British_Shorthair

Label Errors¶

In [16]:

r_index = 0
fig, axes = plt.subplots(3, 5, figsize=(10, 7))
for h_ax in axes:
    for ax in h_ax:
        idx = issues.get_issues('label_errors')['indices'][r_index]
        ax.imshow(dataset[idx][0])
        ax.set_title(
            f"Ranking: {r_index+1}, Idx: {idx}"
            f"\n{dataset.classes[dataset[idx][1]]}"
        )
        ax.set_xticks([])
        ax.set_yticks([])
        r_index += 1
fig.tight_layout()
plt.show()

In [17]:

df_label_errors = issues.get_issues("label_errors", return_as_df=True)
df_label_errors.head()

2024-10-04 10:38:13.621 | WARNING  | Returning as dataframe requires extensive memory.

Out[17]:

	indices	scores	auto_issues	path	label
0	1813	0.069511	True	../data/OxfordIIITPet/images/Russian_Blue_112.jpg	Russian_Blue
1	836	0.076745	True	../data/OxfordIIITPet/images/British_Shorthair...	British_Shorthair
2	7240	0.087889	True	../data/OxfordIIITPet/images/yorkshire_terrier...	yorkshire_terrier
3	4301	0.096670	True	../data/OxfordIIITPet/images/great_pyrenees_19...	great_pyrenees
4	2758	0.100555	False	../data/OxfordIIITPet/images/american_pit_bull...	american_pit_bull_terrier

In [ ]: