#!/usr/bin/env python
# coding: utf-8

# [![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)

# # Analyzing Image Classification Dataset
# 
# [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)
# [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)
# 
# This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze an image classification dataset for:
# 
# + Duplicates
# + Outliers
# + Wrong labels
# + Image clusters
# 
# 
# > **Note** - No GPU needed! You can run this notebook on a CPU-only instance.
# 
# 

# ## Installation
# 
# First let's install [fastdup](https://github.com/visual-layer/fastdup) from PyPI with:

# In[1]:


get_ipython().system('pip install -Uq fastdup')


# Now, test the installation. If there's no error message, we are ready to go.

# In[2]:


import fastdup
fastdup.__version__


# ## Download Dataset
# 
# We will analyze the [Imagenette](https://github.com/fastai/imagenette) dataset - a subset of 10 easily classified classes from Imagenet (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).

# In[ ]:


get_ipython().system('wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz')
get_ipython().system('tar -xf imagenette2-160.tgz')


# ## Load and Format Annotations

# In[3]:


import pandas as pd


# In[4]:


data_dir = 'imagenette2-160/'
csv_path = 'imagenette2-160/noisy_imagenette.csv'


# In[5]:


label_map = {
    'n02979186': 'cassette_player', 
    'n03417042': 'garbage_truck', 
    'n01440764': 'tench', 
    'n02102040': 'English_springer', 
    'n03028079': 'church',
    'n03888257': 'parachute', 
    'n03394916': 'French_horn', 
    'n03000684': 'chain_saw', 
    'n03445777': 'golf_ball', 
    'n03425413': 'gas_pump'
}


# Load the annotations provided with the dataset.

# In[6]:


df_annot = pd.read_csv(csv_path)
df_annot.head(3)


# Transform the annotations to fastdup supported format.
# 
# fastdup expects an annotation `DataFrame` that contains the following column:
# 
# + filename - contains the path to the image file
# + label - contains a label of the image
# + split - whether the image is subset of the training, validation or test dataset

# In[7]:


# take relevant columns
df_annot = df_annot[['path', 'noisy_labels_0']]

# rename columns to fastdup's column names
df_annot = df_annot.rename({'noisy_labels_0': 'label', 'path': 'filename'}, axis='columns')

# append datadir
df_annot['filename'] = df_annot['filename'].apply(lambda x: data_dir + x)

# create split column
df_annot['split'] = df_annot['filename'].apply(lambda x: x.split("/")[1])

# map label ids to regular labels
df_annot['label'] = df_annot['label'].map(label_map)

# show formated annotations
df_annot


# ## Run fastdup
# 
# With the images and annotations ready, we can proceed with running an analysis on the data.

# + `input_dir` is the path to the downloaded images
# + `work_dir` is the path to store the artifacts from the analysis (optional)

# In[8]:


fd = fastdup.create(input_dir=data_dir) 
fd.run(annotations=df_annot, ccthreshold=0.9, threshold=0.8)


# ## Outliers
# 
# Visualize outliers from the dataset.

# In[9]:


fd.vis.outliers_gallery()


# Show outliers image data.

# In[10]:


fd.outliers().head(5)


# ## Comparing Labels of Similar Images
# Find possible mislabels by comparing a query image to other images in the dataset.

# In[11]:


fd.vis.similarity_gallery() 


# ## Similar Image Pairs
# 
# Find similar image pairs within and across the train and validation subfolders. Pairs may include train-train, train-val, val-train, and val-val.

# In[12]:


fd.vis.duplicates_gallery()


# Show similar image pairs.

# In[13]:


fd.similarity().head(5)


# ## Image Clusters

# In[14]:


fd.vis.component_gallery()


# You can also visualize clusters with specific labels using the `slice` parameter. For example let's visualize clusters with the `chain_saw` label

# In[15]:


fd.vis.component_gallery(slice='chain_saw')


# ## Connected Components

# In[16]:


cc_df, _ = fd.connected_components()
cc_df.sort_values('count', ascending=False).head(5)


# We can also get metadata for individual images using their `fastdup_id` available in `fd.annotations()`

# In[17]:


fd[349]


# ## Wrap Up
# 
# Next, feel free to check out other tutorials -
# 
# + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
# + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
# + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
# + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 

# 
# ## VL Profiler
# If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. 
# 
# [Sign up](https://app.visual-layer.com) now, it's free.
# 
# [![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)
# 
# As usual, feedback is welcome! 
# 
# Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).