#!/usr/bin/env python
# coding: utf-8

# [![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)

# # Investigating BLIP model performance with fastdup
# 
# [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/blip_laion_captions.ipynb)
# [![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/blip_laion_captions.ipynb)
# 
# This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze model captions generated with BLIP.

# ## Generate Labels With BLIP

# First, we will initiate the BLIP model using the transformers library.

# In[28]:


from tqdm import tqdm
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import cv2
import numpy as np    
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
 
def generate_blip_labels(filenames, kwargs):
    #print('got files', filenames)
    try:
       
        preds = []
        images = []
        for image_path in filenames:
            i_image = Image.open(image_path)
            if i_image is not None:
                i_image = cv2.cvtColor(np.array(i_image), cv2.COLOR_BGR2RGB)
                im_pil = Image.fromarray(i_image)
                images.append(im_pil)
            else:
                print('Non image' + image_path)
                
        inputs = processor(images, return_tensors="pt")
        out = model.generate(**inputs)
        for i in range(len(out)):
            preds.append((processor.decode(out[i], skip_special_tokens=True)))
        return preds
    except Exception as e:
        print(e)
        #fastdup_capture_exception("Auto caption image blip", e)
        return None


# Next, we will load in a LAION image dataset and generate captions for the images.

# In[29]:


files=get_ipython().getoutput("find laion_10K/ -name '*.jpg'")


# In[30]:


from tqdm import tqdm
images = []
out = []
for i in tqdm(range(len(files)//10)):
    curfiles = files[i*10:i*10+10]
    curout = generate_blip_labels(curfiles, {})
    images.extend(curfiles)
    out.extend(curout)
import pandas as pd
import fastdup
df = pd.DataFrame({'from':images, 'to':images, 'label':out})
df.to_csv('all_labels')


# In[36]:


df.head()


# ## Install fastdup
# 
# Next, install fastdup and verify the installation.

# In[ ]:


get_ipython().system('pip install -Uq fastdup')


# Now, test the installation. If there's no error message, we are ready to go.

# In[1]:


import fastdup
fastdup.__version__


# # Run fastdup to cluster images

# To run fastdup, simply point `input_dir` to the folder containing images from the dataset. 

# In[40]:


fd = fastdup.create(input_dir='laion_10k', work_dir='laion_10k_out')
fd.run(overwrite=True)


# In[47]:


comps = fastdup.find_top_components('laion_10k_out')


# In[52]:


df['from'] = df['from'].apply(lambda x: x.replace('//', '/'))
df.head()


# In[54]:


label_dict =  pd.Series(df.label.values,index=df2['from']).to_dict()


# In[65]:


comps['label'] = comps['files'].apply(lambda x: [label_dict.get(y.replace('laion_10k', 'laion_10K'),'N/A') for y in x])


# In[66]:


comps.head()


# In[79]:


from IPython.display import HTML
fd.vis.component_gallery(external_df=comps, label_col='baby') 


# # Wrap Up
# 
# Next, feel free to check out other tutorials -
# 
# + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
# + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
# + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
# + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 

# 
# ## VL Profiler
# If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. 
# 
# [Sign up](https://app.visual-layer.com) now, it's free.
# 
# [![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)
# 
# As usual, feedback is welcome! 
# 
# Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).

#