#!/usr/bin/env python # coding: utf-8 # [![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com) # # Clean Image Folder # # [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb) # [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb) # # This notebook shows how you can use fastdup analyze an image folder from potential issues and export a list of problematic files for further action. # # By the end of this notebook you will learn how to: # # + Find various dataset issues with fastdup. # + Export a list of problematic images for further action. # ## Installation # # If you're new, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/cleaning-image-dataset.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb). # # Let's start with the installation: # In[1]: get_ipython().system('pip install fastdup -Uq') # Now, test the installation by printing out the version. If there's no error message, we are ready to go! # In[2]: import fastdup fastdup.__version__ # ## Download Dataset # # In this notebook let's use a widely available and relatively well curated [Food-101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) dataset. # # The Food-101 dataset consists of 101 food classes with 1,000 images per class. That is a total of 101,000 images. # # Let's download only from the dataset and extract them into our local directory: # In[ ]: get_ipython().system('wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz') get_ipython().system('tar -xf food-101.tar.gz') # ## Run fastdup # Once the extraction completes, we can run fastdup on the images. # # For that let's create a `fastdup` object and specify the input directory which points to the folder of images. # In[3]: fd = fastdup.create(input_dir="food-101/images/") # > **NOTE**: If you're running this example on Google Colab, we recommend running with `num_images=40000` in the following cell. This limits fastdup to run on 40000 images instead of the entire dataset which takes shorter time to complete on Google Colab. # In[4]: # fd.run(num_images=40000, ccthreshold=0.9) # runs fastdup on a subset of 40000 images from the dataset fd.run(ccthreshold=0.9) # runs fastdup on the entire dataset # > **Note**: `ccthreshold` is a parameter to in the connected components algorithm. Read more [here](https://visual-layer.readme.io/docs/dataset-cleanup#threshold-for-similarity-clusters) on how to set an appropriate value for your dataset. # Get a summary of the run showing potentially problematic files. # In[5]: fd.summary() # ## Broken Images # # The lowest hanging fruit is to find a list of broken images and remove them from your dataset. These images are most probably corrupted file and could not be loaded. # To get the broken images simply run # In[6]: broken_images = fd.invalid_instances() broken_images # This dataset is a carefully curated, so we did not find any broken images. Which is great! # ## List of Broken Images # If there are broken images however, you can easily get a list of the images. # In[7]: list_of_broken_images = broken_images['filename'].to_list() list_of_broken_images # ## Duplicate Image Pairs # # Show a gallery of duplicate image pairs. Distance of `1.0` indicate that the image pairs are exact copies. # In[8]: fd.vis.duplicates_gallery(num_images=5) # ## Image Clusters # # Visualize image clusters from the dataset. # # > **Note**: Setting `num_images=5` shows a gallery of with 5 rows. Change this value to view more/less. # In[9]: fd.vis.component_gallery(num_images=5) # ## List of Duplicates # Now let's single out all duplicates and near-duplicates by running using the connected components function: # # # In[10]: connected_components_df , _ = fd.connected_components() connected_components_df.head() # Let's now write a utility function to get the clusters: # In[11]: # a function to group connected components def get_clusters(df, sort_by='count', min_count=2, ascending=False): # columns to aggregate agg_dict = {'filename': list, 'mean_distance': max, 'count': len} if 'label' in df.columns: agg_dict['label'] = list # filter by count df = df[df['count'] >= min_count] # group and aggregate columns grouped_df = df.groupby('component_id').agg(agg_dict) # sort grouped_df = grouped_df.sort_values(by=[sort_by], ascending=ascending) return grouped_df # In[12]: clusters_df = get_clusters(connected_components_df) clusters_df.head() # The above shows the component (clusters) with the highest duplicates/near-duplicates. # Now let's keep one image from each cluster and remove the rest: # # # In[13]: # First sample from each cluster that is kept cluster_images_to_keep = [] list_of_duplicates = [] for cluster_file_list in clusters_df.filename: # keep first file, discard rest keep = cluster_file_list[0] discard = cluster_file_list[1:] cluster_images_to_keep.append(keep) list_of_duplicates.extend(discard) print(f"Found {len(set(list_of_duplicates))} highly similar images to discard") # In[14]: list_of_duplicates # ## Outliers # # Visualize a gallery of outliers. Lower `Distance` value indicates higher chances of outliers. # In[15]: fd.vis.outliers_gallery(num_images=5) # # List of Outliers # Let's first get the outliers `DataFrame`: # In[17]: outlier_df = fd.outliers() outlier_df.head() # Let's treat all images with `distance<0.68` as outliers. # In[18]: list_of_outliers = outlier_df[outlier_df.distance < 0.68].filename_outlier.tolist() list_of_outliers # ## Dark, Bright and Blurry Images # # Visualize image with statistical metrics. # Visualize dark images from the dataset in ascending order. # In[19]: fd.vis.stats_gallery(metric='dark', num_images=5) # # List of Dark Images # Get a `DataFrame` of image statistics. # In[20]: stats_df = fd.img_stats() # If an image has a mean<13 then we conclude it's a dark image: # In[21]: dark_images = stats_df[stats_df['mean'] < 13] dark_images # To get a list of the dark images: # In[22]: list_of_dark_images = dark_images['filename'].to_list() list_of_dark_images # # List of Bright Images # Visualize bright images from the dataset in descending order. # In[23]: fd.vis.stats_gallery(metric='bright', num_images=5) # Let's set that if `mean>220.5` we will conclude it's a bright image. You can set your own mean threshold depending on your data. # In[24]: bright_images = stats_df[stats_df['mean'] > 220.5] bright_images.head() # Get a list of bright images # In[25]: list_of_bright_images = bright_images['filename'].to_list() list_of_bright_images # # List of Blurry Images # Visualize blurry images from the dataset in ascending order. # In[26]: fd.vis.stats_gallery(metric='blur', num_images=5) # In[27]: blurry_images = stats_df[stats_df['blur'] < 50] blurry_images.head() # Get list of blurry images # In[28]: list_of_blurry_images = blurry_images['filename'].to_list() list_of_blurry_images # ## Summary # # Lets print out a summary of the list of files we got from above. # In[29]: print(f"Broken: {len(list_of_broken_images)}") print(f"Duplicates: {len(list_of_duplicates)}") print(f"Outliers: {len(list_of_outliers)}") print(f"Dark: {len(list_of_dark_images)}") print(f"Bright: {len(list_of_bright_images)}") print(f"Blurry: {len(list_of_blurry_images)}") problem_images = list_of_duplicates + list_of_broken_images + list_of_outliers + list_of_dark_images + list_of_bright_images + list_of_blurry_images print(f"Total unique images: {len(set(problem_images))}") # ## Wrap Up # # That's a wrap! In this notebook we showed how you can run fastdup on a dataset or any folder of images. # # We've seen how to use fastdup to find: # # + Find various dataset issues with fastdup. # + Export a list of problematic images for further action. # # For each problem we got a list of file names for further action. Depending on use cases, you might choose to delete the image, relabel them or simply move the image elsewhere. # # Next, feel free to check out other tutorials - # # + โšก [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here! # + ๐Ÿงน [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start. # + ๐Ÿ–ผ [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go! # + ๐ŸŽ [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. # # # ## VL Profiler # If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. # # [Sign up](https://app.visual-layer.com) now, it's free. # # [![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com) # # As usual, feedback is welcome! # # Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).