#!/usr/bin/env python # coding: utf-8 # [![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com) # # Analyzing Kaggle Datasets # # [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb) # [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-kaggle-datasets.ipynb) # # This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze any computer vision dataset from [Kaggle](https://kaggle.com). # ## Install Kaggle API # # To load data programmatically from Kaggle, we will need to install the [Kaggle API](https://github.com/Kaggle/kaggle-api). The API lets us pull data from Kaggle using Python. # # To install the API, run: # In[1]: get_ipython().system('pip install -Uq kaggle') # Note: to use the Kaggle API, you'll need to sign up for a Kaggle account at https://www.kaggle.com/ . # # Go to the 'Account' tab and select 'Create API Token'. This will trigger the download of `kaggle.json`, a file containing your API credentials. # # Place this file in the location `~/.kaggle/kaggle.json` (on Windows in the location `C:\Users\\.kaggle\kaggle.json`) # # Fore more information on the Kaggle API, click [here](https://github.com/Kaggle/kaggle-api#api-credentials). # If the set up is done correctly, you should be able to run the kaggle commands on your terminal. For instance, to list kaggle datasets that have the term "computer vision" , run: # In[2]: get_ipython().system('kaggle datasets list -s "computer vision"') # See more commands [here](https://github.com/Kaggle/kaggle-api#commands). # # Optionally, you can also browse the Kaggle webpage to see the dataset you're interested to download. # ## Download Dataset # Let's say we're interested to analyze the [RVL-CDIP Test Dataset](https://www.kaggle.com/datasets/pdavpoojan/the-rvlcdip-dataset-test). You can head to the dataset page and click on "Copy API command" and paste it in your terminal. # # ![image.png](attachment:4ea6f203-55bd-4ca7-817d-ad2a16721ed0.png) # Let's run the command here, which will trigger a download of the RVL-CDIP test dataset into our current working directory. # In[ ]: get_ipython().system('kaggle datasets download -d pdavpoojan/the-rvlcdip-dataset-test') # Once done, we should have a `the-rvlcdip-dataset-test.zip` in the current directory. # # Let's unzip the file to prepare it for further analysis with fastdup in the next section. # In[3]: get_ipython().system('unzip -q the-rvlcdip-dataset-test.zip') # Once completed, we should have a folder with the name `test/`, which contains all the images from the dataset. # ## Install fastdup # # Next, install fastdup and verify the installation. # In[4]: get_ipython().system('pip install -Uq fastdup') # Now, test the installation. If there's no error message, we are ready to go. # In[5]: import fastdup fastdup.__version__ # ## Run fastdup # To run fastdup, we simply point `input_dir` to the folder containing the images from the dataset. # In[6]: fd = fastdup.create(input_dir='test') fd.run() # ## Inspect Issues # From the summary above, we have 1 corrupted image. Let's get some more details: # In[7]: fd.invalid_instances() # There are several other methods we can use to inspect and visualize the issues found: # # ```python # fd.vis.duplicates_gallery() # create a visual gallery of duplicates # fd.vis.outliers_gallery() # create a visual gallery of anomalies # fd.vis.component_gallery() # create a visualization of connected components # fd.vis.stats_gallery() # create a visualization of images statistics (e.g. blur) # fd.vis.similarity_gallery() # create a gallery of similar images # ``` # In[8]: fd.vis.duplicates_gallery() # In[9]: fd.vis.component_gallery() # In[11]: fd.vis.stats_gallery(metric='dark') # In[12]: fd.vis.stats_gallery(metric='bright') # ## Wrap Up # # That's a wrap! In this notebook we showed how you load dataset from Kaggle and analyze it using fastdup. You can use similar methods to run on other similar datasets on [Kaggle](https://kaggle.com). # # Try it out and let us know what issues you find. # # # Next, feel free to check out other tutorials - # # + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here! # + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start. # + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go! # + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. # # # VL Profiler # If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. # # [Sign up](https://app.visual-layer.com) now, it's free. # # [![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com) # # As usual, feedback is welcome! # # Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).