#!/usr/bin/env python # coding: utf-8 # [![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com) # # Analyzing Image Classification Dataset # # [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb) # [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb) # # This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze an image classification dataset for: # # + Duplicates # + Outliers # + Wrong labels # + Image clusters # # # > **Note** - No GPU needed! You can run this notebook on a CPU-only instance. # # # ## Installation # # First let's install [fastdup](https://github.com/visual-layer/fastdup) from PyPI with: # In[1]: get_ipython().system('pip install -Uq fastdup') # Now, test the installation. If there's no error message, we are ready to go. # In[2]: import fastdup fastdup.__version__ # ## Download Dataset # # We will analyze the [Imagenette](https://github.com/fastai/imagenette) dataset - a subset of 10 easily classified classes from Imagenet (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute). # In[ ]: get_ipython().system('wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz') get_ipython().system('tar -xf imagenette2-160.tgz') # ## Load and Format Annotations # In[3]: import pandas as pd # In[4]: data_dir = 'imagenette2-160/' csv_path = 'imagenette2-160/noisy_imagenette.csv' # In[5]: label_map = { 'n02979186': 'cassette_player', 'n03417042': 'garbage_truck', 'n01440764': 'tench', 'n02102040': 'English_springer', 'n03028079': 'church', 'n03888257': 'parachute', 'n03394916': 'French_horn', 'n03000684': 'chain_saw', 'n03445777': 'golf_ball', 'n03425413': 'gas_pump' } # Load the annotations provided with the dataset. # In[6]: df_annot = pd.read_csv(csv_path) df_annot.head(3) # Transform the annotations to fastdup supported format. # # fastdup expects an annotation `DataFrame` that contains the following column: # # + filename - contains the path to the image file # + label - contains a label of the image # + split - whether the image is subset of the training, validation or test dataset # In[7]: # take relevant columns df_annot = df_annot[['path', 'noisy_labels_0']] # rename columns to fastdup's column names df_annot = df_annot.rename({'noisy_labels_0': 'label', 'path': 'filename'}, axis='columns') # append datadir df_annot['filename'] = df_annot['filename'].apply(lambda x: data_dir + x) # create split column df_annot['split'] = df_annot['filename'].apply(lambda x: x.split("/")[1]) # map label ids to regular labels df_annot['label'] = df_annot['label'].map(label_map) # show formated annotations df_annot # ## Run fastdup # # With the images and annotations ready, we can proceed with running an analysis on the data. # + `input_dir` is the path to the downloaded images # + `work_dir` is the path to store the artifacts from the analysis (optional) # In[8]: fd = fastdup.create(input_dir=data_dir) fd.run(annotations=df_annot, ccthreshold=0.9, threshold=0.8) # ## Outliers # # Visualize outliers from the dataset. # In[9]: fd.vis.outliers_gallery() # Show outliers image data. # In[10]: fd.outliers().head(5) # ## Comparing Labels of Similar Images # Find possible mislabels by comparing a query image to other images in the dataset. # In[11]: fd.vis.similarity_gallery() # ## Similar Image Pairs # # Find similar image pairs within and across the train and validation subfolders. Pairs may include train-train, train-val, val-train, and val-val. # In[12]: fd.vis.duplicates_gallery() # Show similar image pairs. # In[13]: fd.similarity().head(5) # ## Image Clusters # In[14]: fd.vis.component_gallery() # You can also visualize clusters with specific labels using the `slice` parameter. For example let's visualize clusters with the `chain_saw` label # In[15]: fd.vis.component_gallery(slice='chain_saw') # ## Connected Components # In[16]: cc_df, _ = fd.connected_components() cc_df.sort_values('count', ascending=False).head(5) # We can also get metadata for individual images using their `fastdup_id` available in `fd.annotations()` # In[17]: fd[349] # ## Wrap Up # # Next, feel free to check out other tutorials - # # + โšก [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here! # + ๐Ÿงน [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start. # + ๐Ÿ–ผ [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go! # + ๐ŸŽ [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. # # ## VL Profiler # If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. # # [Sign up](https://app.visual-layer.com) now, it's free. # # [![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com) # # As usual, feedback is welcome! # # Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).