#!/usr/bin/env python # coding: utf-8 #
# # # # # vl logo. # # #
#
# # Logo # # # Logo # # # Logo # # # Logo # # # Logo # #
# # Analyze Torchvision Datasets # # [![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-torchvision-datasets.ipynb) # [![Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-torchvision-datasets.ipynb) # [![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/analyzing-torchvision-datasets) # # This notebook shows how you can analyze [Torchvision Datasets](https://pytorch.org/vision/main/datasets.html) for issues using fastdup. # ## Installation # # First, let's install the necessary packages. # In[ ]: get_ipython().system('pip install -Uq fastdup torchvision') # Now, test the installation. If there's no error message, we are ready to go. # In[1]: import fastdup fastdup.__version__ # ## Download Dataset # Torchvision provides many built-in datasets in the `torchvision.datasets` module. The datasets span across various tasks such as image classification, object detection, and segmentation to name a few. # # Let's download the [Caltech 256](https://data.caltech.edu/records/nyy15-4j048) dataset to our local directory. # # Caltech 256 dataset consists of 256 object categories containing a total of 30607 images for image classification. # In[2]: from torchvision.datasets import Caltech256 caltech256 = Caltech256(root='./', download=True) # The datasets is downloaded into the `caltech256` folder in the root directory. # In[3]: caltech256.root # ## Construct Annotation DataFrame # Although you can run fasdup without the annotations, specifying the labels lets us do more analysis with fastdup such as inspecting mislabels. # Since the dataset is labeled, let's make use of the labels and feed them into fastdup. # # fastdup expects the labels to be formatted into a Pandas `DataFrame` with the columns `filename` and `label`. # Let's loop over the directory recursively search for the filenames and labels, and format them into a DataFrame. # In[4]: import glob import os import pandas as pd # Define the path path = "caltech256/" # Define patterns for tif image found in the dataset patterns = ['*jpg', '*jpeg'] # Use glob to get all image filenames for both extensions filenames = [f for pattern in patterns for f in glob.glob(path + '**/' + pattern, recursive=True)] # Extract the parent folder name for each filename label = [os.path.basename(os.path.dirname(filename)) for filename in filenames] # Convert to a pandas DataFrame and add the title label column df = pd.DataFrame({ 'filename': filenames, 'label': label }) df # ## Run fastdup # One the dataset download completes, analyze the image folder where the dataset is stored. # # Point `input_dir` to the directory where the images are stored. # In[5]: fd = fastdup.create(input_dir="caltech256") fd.run(annotations=df) # ## View Galleries # # You can use all of fastdup gallery methods to view duplicates, clusters, etc. # # ```python # fd.vis.duplicates_gallery() # create a visual gallery of duplicates # fd.vis.outliers_gallery() # create a visual gallery of anomalies # fd.vis.component_gallery() # create a visualization of connected components # fd.vis.stats_gallery() # create a visualization of images statistics (e.g. blur) # fd.vis.similarity_gallery() # create a gallery of similar images # ``` # Lets view some of the image clusters in the dataset. # In[6]: fd.vis.component_gallery() # And also inspect duplicates. # In[7]: fd.vis.duplicates_gallery() # You can also see potential mislabels. # In[8]: fd.vis.similarity_gallery(slice='diff') # ## Wrap Up # In this tutorial, we showed how you can analyze datasets from Torchvision Datasets using fastdup. # # Next, feel free to check out other tutorials - # # + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here! # + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start. # + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go! # + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. # # ## VL Profiler - A faster and easier way to diagnose and visualize dataset issues # # If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. # # VL Profiler is free to get started. Upload up to 1,000,000 images for analysis at zero cost! # # [Sign up](https://app.visual-layer.com) now. # # [![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/github_banner_profiler.gif)](https://app.visual-layer.com) # # As usual, feedback is welcome! Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues). #
# # # # # vl logo. # #
# GitHub • # Join Slack Community • # Discussion Forum #
# #
# Blog • # Documentation • # About Us #
# #
# LinkedIn • # Twitter #