# # Quickstart - Analyze Dataset for Potential Issues
#
# [![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=google-colab&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quickstart.ipynb)
# [![Open in Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=kaggle&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/quickstart.ipynb)
# [![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/quickstart)
#
# This notebook shows how to quickly analyze an image dataset for potential issues using [fastdup](https://github.com/visual-layer/fastdup). We'll take you on a high-level tour showcasing the core functions of fastdup in the shortest time.
#
# By the end of this notebook, you will learn how to find out if your dataset has issues such as:
#
# + Broken images.
# + Duplicates/near-duplicates.
# + Outliers.
# + Dark/bright/blurry images.
#
# We'll also visualize clusters of visually similar images to provide a bird's-eye view and help you understand the data's structure for further analysis.
# ## Installation
# First, let's start with the installation:
#
# > ✅ **Tip** - If you're new to fastdup, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/quick-dataset-analysis.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb).
#
#
# In[ ]:
get_ipython().system('pip install fastdup -Uq')
# Now, test the installation by printing out the version. If there's no error message, we are ready to go!
# In[1]:
import fastdup
fastdup.__version__
# ## Download Dataset
#
# For demonstration, we will use a generally curated [Oxford IIIT Pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/). Feel free to swap this dataset with your own.
#
# The dataset consists of images and annotations for 37 category pets with roughly 200 images for each class.
#
# > 🗒 **Note** - fastdup works on both unlabeled and labeled images. But for now, we are only interested in finding issues in the images and not the annotations.
# > If you're interested in finding annotation issues, head to:
# > + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)
# > + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb).
#
#
# Let's download only from the dataset and extract them into the local directory:
# In[ ]:
get_ipython().system('wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz -O images.tar.gz')
get_ipython().system('tar xf images.tar.gz')
# ## Run fastdup
#
# Once the extraction completes, we can run fastdup on the images.
#
# For that let's initialize fastdup and specify the input directory which points to the folder of images.
# In[2]:
fd = fastdup.create(input_dir="images/")
# > 🗒 **Note** - The `.create` method also has an optional `work_dir` parameter which specifies the directory to store artifacts from the run.
#
# In other words you can run `fastdup.create(input_dir="images/", work_dir="my_work_dir/")` if you'd like to store the artifacts in a `my_work_dir`.
#
# Now, let's run fastdup.
# In[14]:
fd.run()
# ## View Run Summary
#
# After the run is completed, you can optionally view the summary with:
# In[15]:
fd.summary()
# ## Invalid Images
# From the summary above, we see there are a few invalid images. These are broken images that cannot be read.
#
# You can get a list of broken images with:
# In[5]:
fd.invalid_instances()
# ## Duplicate/Near-duplicates
#
# One of the lowest hanging fruits in cleaning a dataset is finding and eliminating duplicates.
#
# fastdup provides a handy way of visualizing duplicates/near-duplicates using the `duplicates_gallery` method. The `Distance` value indicates how visually similar are the image pairs in the gallery. A `Distance` of `1.0` indicates an exact copy and vice-versa.
# In[6]:
fd.vis.duplicates_gallery()
# ## Outliers
#
# Similar to duplicate pairs, you can visualize potential outliers in your dataset with:
# In[7]:
fd.vis.outliers_gallery()
# ## Dark, Bright and Blurry Images
#
# fastdup also lets you visualize images from your dataset using statistical metrics.
#
# For example, with `metric='dark'` we can visualize the darkest images from the dataset.
# In[8]:
fd.vis.stats_gallery(metric='dark')
# In[9]:
fd.vis.stats_gallery(metric='bright')
# In[10]:
fd.vis.stats_gallery(metric='blur')
# ## Visualize Image Clusters
#
# One of fastdup's coolest features is visualizing image clusters. In the previous section, we saw how to visualize similar image pairs. In this section, we group similar-looking images (or even duplicates) as a cluster and visualize them in the gallery.
#
# To do so, run:
#
#
# > **Note**: fastdup uses default parameter values when creating image clusters. Depending on your data and use case, the best value may vary. Read more [here](https://visual-layer.readme.io/docs/dataset-cleanup) on how to change parameter values to cluster images.
# In[11]:
fd.vis.component_gallery()
# ## Interactive Exploration
# In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.
#
# To explore the dataset and issues interactively in a browser, run:
# In[ ]:
fd.explore()
# > 🗒 **Note** - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.
#
# You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.
#
#
# ![image.png](https://vl-blog.s3.us-east-2.amazonaws.com/fastdup_assets/cloud_preview.gif)
# ## Wrap Up
#
# That's a wrap! In this notebook we showed how you can run fastdup on a dataset or any folder of images.
#
# We've seen how to use fastdup to find:
#
# + Broken images.
# + Duplicate/near-duplicates.
# + Outliers.
# + Dark, bright and blurry images.
# + Image clusters.
#
# Next, feel free to check out other tutorials -
#
# + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
# + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
# + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
# + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.
#
# As usual, feedback is welcome! Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).
#
#