#!/usr/bin/env python # coding: utf-8 #
# # # # # vl logo. # #
# GitHub • # Join Discord Community • # Discussion Forum #
# #
# Blog • # Documentation • # About Us #
# # #
# #
#
# # site # # blog # # github # # slack # # linkedin # # youtube # # twitter #
#
# # Image Search in Large Datasets # # [![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=google-colab&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/image-search.ipynb) # [![Open in Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=kaggle&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/image-search.ipynb) # [![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/image-search) # # With the ever increasing data generated every day, it's important to have efficient ways to search through large image dataset to find the ones you need. # # If you only have a CPU only machine and want to search through a large dataset using image as queries, this tutorial is for you. # # We will walk you through how to use fastdup to search through thousands of images and find similar looking images to your query image. # ## Installation # In[ ]: get_ipython().system('pip install fastdup -Uq') # In[1]: import fastdup fastdup.__version__ # ## Shoppee Product Match Dataset # # In this notebook we will use the a dataset from [Shopee Product Match Kaggle Competition](https://www.kaggle.com/competitions/shopee-product-matching/data). In this competition, participants must determine if two products are the same by their images. # # Head to Kaggle and download the dataset into your local directory. You should have a folder named `shopee-product-matching` in your current working directory. # With the dataset downloaded, let's randomly pick a few images and preview them. # In[2]: sample_images=get_ipython().getoutput("find shopee-product-matching/ -name '*.jpg'") ret = fastdup.generate_sprite_image(sample_images, 55, ".")[0] # In[3]: from IPython.display import Image Image(filename=ret) # ## Run fastdup # # Point `input_dir` to the location you store the images. # In[4]: input_dir = "./shopee-product-matching" work_dir = "./my-fastdup-workdir" fastdup.run(input_dir, work_dir) # ## Restart Runtime # # Once the run is complete you can terminate the session and use the generated arfifacts to run an image search. # # Let's restart the kernel to simulate a different session. # In[5]: import IPython app = IPython.Application.instance() app.kernel.do_shutdown(True) # ## Initialize Search Parameters # # To start searching we must first initialize the search parameters. # # The first positional argument is `k` - The number of nearest neighbors to search for. # # In this case we want to search for 10 nearest neighbor. Feel free to experiment with your own number of `k`. # In[2]: import fastdup work_dir = "./my-fastdup-workdir" fastdup.init_search(10, work_dir=work_dir) # ## Search with a Query Image # # Let's use our own image and find out if there are matches in the shopee dataset. # In[4]: from IPython.display import Image Image(filename="shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg") # Specify the query image filename and search for similar images in the images directory. # In[5]: df = fastdup.search("shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg") # Inspect the search result. # # The `distance` value indicate how similar is your query image to the other image. # # A `distance` of `1.0` indicates the images are exact duplicates. The lower the value, the less similar the images are. # In[6]: df # You can repeat the search as many times as you wish as long as the model is loaded in memory. # # Let's try to search using another query image. # In[7]: Image(filename="shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg") # In[8]: df2 = fastdup.search("shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg") # In[9]: df2 # ## Visualize Results # # This step is optional. fastdup provides a convenient way to visualize your search results for duplicate and similar looking images. # In[10]: fastdup.create_duplicates_gallery(df, work_dir, input_dir="./shopee-product-matching") # In[12]: from IPython.display import HTML HTML(filename="./my-fastdup-workdir/duplicates.html") # In[14]: fastdup.create_similarity_gallery(df, work_dir, input_dir="./shopee-product-matching", min_items=3) HTML(filename="./my-fastdup-workdir/similarity.html") # Feel free to repeat the search using other images and visualize them. # ## Interactive Exploration # In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset. # # To explore the dataset and issues interactively in a browser, run: # In[ ]: fd.explore() # > 🗒 **Note** - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells. # # You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface. # # # ![image.png](https://vl-blog.s3.us-east-2.amazonaws.com/fastdup_assets/cloud_preview.gif) # ## Wrap up # Congratulations! You've made it to the end of the tutorial! # # Image similarity search is an incredibly powerful tookit to have in your arsenal as a machine learning practitioner. # # For example, if your model is not performing well on a particular category of images, you could use image search to find more examples of that category and add them to your training data. # # Next, feel free to check out other tutorials - # # + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here! # + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start. # + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go! # + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. #
#
# # site # # blog # # github # # slack # # linkedin # # youtube # # twitter #
#
#
# logo #
Copyright © 2024 Visual Layer. All rights reserved.
#
# #