# # Image Search in Large Datasets
#
# [![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=google-colab&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/image-search.ipynb)
# [![Open in Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=kaggle&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/image-search.ipynb)
# [![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/image-search)
#
# With the ever increasing data generated every day, it's important to have efficient ways to search through large image dataset to find the ones you need.
#
# If you only have a CPU only machine and want to search through a large dataset using image as queries, this tutorial is for you.
#
# We will walk you through how to use fastdup to search through thousands of images and find similar looking images to your query image.
# ## Installation
# In[ ]:
get_ipython().system('pip install fastdup -Uq')
# In[1]:
import fastdup
fastdup.__version__
# ## Shoppee Product Match Dataset
#
# In this notebook we will use the a dataset from [Shopee Product Match Kaggle Competition](https://www.kaggle.com/competitions/shopee-product-matching/data). In this competition, participants must determine if two products are the same by their images.
#
# Head to Kaggle and download the dataset into your local directory. You should have a folder named `shopee-product-matching` in your current working directory.
# With the dataset downloaded, let's randomly pick a few images and preview them.
# In[2]:
sample_images=get_ipython().getoutput("find shopee-product-matching/ -name '*.jpg'")
ret = fastdup.generate_sprite_image(sample_images, 55, ".")[0]
# In[3]:
from IPython.display import Image
Image(filename=ret)
# ## Run fastdup
#
# Point `input_dir` to the location you store the images.
# In[4]:
input_dir = "./shopee-product-matching"
work_dir = "./my-fastdup-workdir"
fastdup.run(input_dir, work_dir)
# ## Restart Runtime
#
# Once the run is complete you can terminate the session and use the generated arfifacts to run an image search.
#
# Let's restart the kernel to simulate a different session.
# In[5]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
# ## Initialize Search Parameters
#
# To start searching we must first initialize the search parameters.
#
# The first positional argument is `k` - The number of nearest neighbors to search for.
#
# In this case we want to search for 10 nearest neighbor. Feel free to experiment with your own number of `k`.
# In[2]:
import fastdup
work_dir = "./my-fastdup-workdir"
fastdup.init_search(10, work_dir=work_dir)
# ## Search with a Query Image
#
# Let's use our own image and find out if there are matches in the shopee dataset.
# In[4]:
from IPython.display import Image
Image(filename="shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg")
# Specify the query image filename and search for similar images in the images directory.
# In[5]:
df = fastdup.search("shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg")
# Inspect the search result.
#
# The `distance` value indicate how similar is your query image to the other image.
#
# A `distance` of `1.0` indicates the images are exact duplicates. The lower the value, the less similar the images are.
# In[6]:
df
# You can repeat the search as many times as you wish as long as the model is loaded in memory.
#
# Let's try to search using another query image.
# In[7]:
Image(filename="shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg")
# In[8]:
df2 = fastdup.search("shopee-product-matching/test_images/0007585c4d0f932859339129f709bfdc.jpg")
# In[9]:
df2
# ## Visualize Results
#
# This step is optional. fastdup provides a convenient way to visualize your search results for duplicate and similar looking images.
# In[10]:
fastdup.create_duplicates_gallery(df, work_dir, input_dir="./shopee-product-matching")
# In[12]:
from IPython.display import HTML
HTML(filename="./my-fastdup-workdir/duplicates.html")
# In[14]:
fastdup.create_similarity_gallery(df, work_dir, input_dir="./shopee-product-matching", min_items=3)
HTML(filename="./my-fastdup-workdir/similarity.html")
# Feel free to repeat the search using other images and visualize them.
# ## Interactive Exploration
# In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.
#
# To explore the dataset and issues interactively in a browser, run:
# In[ ]:
fd.explore()
# > 🗒 **Note** - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.
#
# You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.
#
#
# ![image.png](https://vl-blog.s3.us-east-2.amazonaws.com/fastdup_assets/cloud_preview.gif)
# ## Wrap up
# Congratulations! You've made it to the end of the tutorial!
#
# Image similarity search is an incredibly powerful tookit to have in your arsenal as a machine learning practitioner.
#
# For example, if your model is not performing well on a particular category of images, you could use image search to find more examples of that category and add them to your training data.
#
# Next, feel free to check out other tutorials -
#
# + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
# + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
# + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
# + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.
#