#!/usr/bin/env python
# coding: utf-8

# [![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)

# # Clean Image Folder
# 
# [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb)
# [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb)
# 
# This notebook shows how you can use fastdup analyze an image folder from potential issues and export a list of problematic files for further action.
# 
# By the end of this notebook you will learn how to:
# 
# + Find various dataset issues with fastdup.
# + Export a list of problematic images for further action.

# ## Installation
# 
# If you're new, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/cleaning-image-dataset.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb).  
# 
# Let's start with the installation:

# In[1]:


get_ipython().system('pip install fastdup -Uq')


# Now, test the installation by printing out the version. If there's no error message, we are ready to go!

# In[2]:


import fastdup
fastdup.__version__


# ## Download Dataset
# 
# In this notebook let's use a widely available and relatively well curated [Food-101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) dataset.
# 
# The Food-101 dataset consists of 101 food classes with 1,000 images per class. That is a total of 101,000 images.
# 
# Let's download only from the dataset and extract them into our local directory:

# In[ ]:


get_ipython().system('wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz')
get_ipython().system('tar -xf food-101.tar.gz')


# ## Run fastdup

# Once the extraction completes, we can run fastdup on the images.
# 
# For that let's create a `fastdup` object and specify the input directory which points to the folder of images.

# In[3]:


fd = fastdup.create(input_dir="food-101/images/")


# > **NOTE**: If you're running this example on Google Colab, we recommend running with `num_images=40000` in the following cell. This limits fastdup to run on 40000 images instead of the entire dataset which takes shorter time to complete on Google Colab.

# In[4]:


# fd.run(num_images=40000, ccthreshold=0.9)   # runs fastdup on a subset of 40000 images from the dataset
fd.run(ccthreshold=0.9)                       # runs fastdup on the entire dataset


# > **Note**: `ccthreshold` is a parameter to in the connected components algorithm. Read more [here](https://visual-layer.readme.io/docs/dataset-cleanup#threshold-for-similarity-clusters) on how to set an appropriate value for your dataset.

# Get a summary of the run showing potentially problematic files.

# In[5]:


fd.summary()


# ## Broken Images
# 
# The lowest hanging fruit is to find a list of broken images and remove them from your dataset. These images are most probably corrupted file and could not be loaded.

# To get the broken images simply run

# In[6]:


broken_images = fd.invalid_instances()
broken_images


# This dataset is a carefully curated, so we did not find any broken images. Which is great!

# ## List of Broken Images
# If there are broken images however, you can easily get a list of the images.

# In[7]:


list_of_broken_images = broken_images['filename'].to_list()
list_of_broken_images


# ## Duplicate Image Pairs
# 
# Show a gallery of duplicate image pairs. Distance of `1.0` indicate that the image pairs are exact copies.

# In[8]:


fd.vis.duplicates_gallery(num_images=5)


# ## Image Clusters
# 
# Visualize image clusters from the dataset.
# 
# > **Note**: Setting `num_images=5` shows a gallery of with 5 rows. Change this value to view more/less.

# In[9]:


fd.vis.component_gallery(num_images=5)


# ## List of Duplicates
# Now let's single out all duplicates and near-duplicates by running using the connected components function:
# 
# 

# In[10]:


connected_components_df , _ = fd.connected_components()
connected_components_df.head()


# Let's now write a utility function to get the clusters:

# In[11]:


# a function to group connected components
def get_clusters(df, sort_by='count', min_count=2, ascending=False):
    # columns to aggregate
    agg_dict = {'filename': list, 'mean_distance': max, 'count': len}

    if 'label' in df.columns:
        agg_dict['label'] = list
    
    # filter by count
    df = df[df['count'] >= min_count]
    
    # group and aggregate columns
    grouped_df = df.groupby('component_id').agg(agg_dict)
    
    # sort
    grouped_df = grouped_df.sort_values(by=[sort_by], ascending=ascending)
    return grouped_df


# In[12]:


clusters_df = get_clusters(connected_components_df)
clusters_df.head()


# The above shows the component (clusters) with the highest duplicates/near-duplicates.

# Now let's keep one image from each cluster and remove the rest:
# 
# 

# In[13]:


# First sample from each cluster that is kept
cluster_images_to_keep = []
list_of_duplicates = []

for cluster_file_list in clusters_df.filename:
    # keep first file, discard rest
    keep = cluster_file_list[0]
    discard = cluster_file_list[1:]
    
    cluster_images_to_keep.append(keep)
    list_of_duplicates.extend(discard)

print(f"Found {len(set(list_of_duplicates))} highly similar images to discard")


# In[14]:


list_of_duplicates


# ## Outliers
# 
# Visualize a gallery of outliers. Lower `Distance` value indicates higher chances of outliers.

# In[15]:


fd.vis.outliers_gallery(num_images=5)


# # List of Outliers
# Let's first get the outliers `DataFrame`:

# In[17]:


outlier_df = fd.outliers()
outlier_df.head()


# Let's treat all images with `distance<0.68` as outliers.

# In[18]:


list_of_outliers = outlier_df[outlier_df.distance < 0.68].filename_outlier.tolist()
list_of_outliers


# ## Dark, Bright and Blurry Images
# 
# Visualize image with statistical metrics.

# Visualize dark images from the dataset in ascending order.

# In[19]:


fd.vis.stats_gallery(metric='dark', num_images=5)


# # List of Dark Images
# Get a `DataFrame` of image statistics.

# In[20]:


stats_df = fd.img_stats()


# If an image has a mean<13 then we conclude it's a dark image:

# In[21]:


dark_images = stats_df[stats_df['mean'] < 13]  
dark_images


# To get a list of the dark images:

# In[22]:


list_of_dark_images = dark_images['filename'].to_list()
list_of_dark_images


# # List of Bright Images

# Visualize bright images from the dataset in descending order.

# In[23]:


fd.vis.stats_gallery(metric='bright', num_images=5)


# Let's set that if `mean>220.5` we will conclude it's a bright image. You can set your own mean threshold depending on your data.

# In[24]:


bright_images = stats_df[stats_df['mean'] > 220.5]
bright_images.head()


# Get a list of bright images

# In[25]:


list_of_bright_images = bright_images['filename'].to_list()
list_of_bright_images


# # List of Blurry Images

# Visualize blurry images from the dataset in ascending order.

# In[26]:


fd.vis.stats_gallery(metric='blur', num_images=5)


# In[27]:


blurry_images = stats_df[stats_df['blur'] < 50]
blurry_images.head()


# Get list of blurry images

# In[28]:


list_of_blurry_images = blurry_images['filename'].to_list()
list_of_blurry_images


# ## Summary
# 
# Lets print out a summary of the list of files we got from above.

# In[29]:


print(f"Broken: {len(list_of_broken_images)}")
print(f"Duplicates: {len(list_of_duplicates)}")
print(f"Outliers: {len(list_of_outliers)}")
print(f"Dark: {len(list_of_dark_images)}")
print(f"Bright: {len(list_of_bright_images)}")
print(f"Blurry: {len(list_of_blurry_images)}")

problem_images = list_of_duplicates + list_of_broken_images + list_of_outliers + list_of_dark_images + list_of_bright_images + list_of_blurry_images

print(f"Total unique images: {len(set(problem_images))}")


# ## Wrap Up
# 
# That's a wrap! In this notebook we showed how you can run fastdup on a dataset or any folder of images. 
# 
# We've seen how to use fastdup to find:
# 
# + Find various dataset issues with fastdup.
# + Export a list of problematic images for further action.
# 
# For each problem we got a list of file names for further action. Depending on use cases, you might choose to delete the image, relabel them or simply move the image elsewhere.
# 
# Next, feel free to check out other tutorials -
# 
# + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
# + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
# + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
# + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 
# 

# 
# ## VL Profiler
# If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. 
# 
# [Sign up](https://app.visual-layer.com) now, it's free.
# 
# [![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)
# 
# As usual, feedback is welcome! 
# 
# Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).