Optical Character Recognition¶

This notebook shows how you can use fastdup to extract optical characters on your video/image dataset.

We will be using fastdup with PaddleOCR - A lightweight, multilingual optical character recognition package.

In [1]:

!pip install -Uqq fastdup paddleocr paddlepaddle gdown numpy

In [2]:

import fastdup
fastdup.__version__

/usr/bin/dpkg

Out[2]:

'1.34'

If you're running this notebook in Google Colab, uncomment the following cell and run before proceeding to the next cell.

In [ ]:

# !wget http://nz2.archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2.19_amd64.deb
# !sudo dpkg -i libssl1.1_1.1.1f-1ubuntu2.19_amd64.deb

Download Video Dataset¶

We'll use a subset of the TikTok Trending Videos dataset from Kaggle.

In [3]:

!gdown https://drive.google.com/uc?id=1FsBQTZKsfApEn99g_BhkTbdz0SmQgkKv

Downloading...
From (uriginal): https://drive.google.com/uc?id=1FsBQTZKsfApEn99g_BhkTbdz0SmQgkKv
From (redirected): https://drive.google.com/uc?id=1FsBQTZKsfApEn99g_BhkTbdz0SmQgkKv&confirm=t&uuid=2e6206bc-cbe1-4642-a0f9-f849a42c4b6e
To: /media/dnth/Active-Projects/fastdup/examples/tiktok-trending-subset.zip
100%|██████████████████████████████████████| 61.3M/61.3M [00:06<00:00, 9.67MB/s]

Now, unzip the dataset into our local directory. You'll find a folder name tiktok-trending-subset that has all the trending clips.

In [4]:

!unzip -q tiktok-trending-subset.zip

Extract frames¶

To run fastdup, we will need to extract the clips in the tiktok-trending-subset folder into frames and store them in another folder, let's name the folder frames. fastdup provides a convenience function for that.

In [5]:

fastdup.extract_video_frames('tiktok-trending-subset', 'frames')

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-08-09 16:17:49 [INFO] Going to loop over dir tiktok-trending-subset
2023-08-09 16:17:49 [INFO] Found total 19 videos to run on, 19 train, 0 test, name list 19, counter 19

Out[5]:

Run fastdup¶

With the extracted frames, we can run fastdup to analyze them.

To use the optical character recognition feature, specify bounding_box='ocr' in the run method.

For demonstration, we'll specify num_images=1000 in the run method which limits the run to only 1000 images. Feel free to specify a different value or omitting this parameter altogether to run on the entire dataset.

In [6]:

fd = fastdup.create(input_dir='./frames')
fd.run(bounding_box='ocr')

Warning: fastdup create() without work_dir argument, output is stored in a folder named work_dir in your current working path.
FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-08-09 16:17:52 [INFO] Going to loop over dir frames
2023-08-09 16:17:52 [INFO] Found total 54 images to run on, 54 train, 0 test, name list 54, counter 54 
FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-08-09 16:18:45 [INFO] Going to loop over dir /tmp/crops_input.csv
2023-08-09 16:18:45 [INFO] Found total 259 images to run on, 259 train, 0 test, name list 259, counter 259 
2023-08-09 16:18:46 [INFO] Found total 259 images to run onstimated: 0 Minutes
Finished histogram 0.280
Finished bucket sort 0.294
2023-08-09 16:18:46 [INFO] 31) Finished write_index() NN model
2023-08-09 16:18:46 [INFO] Stored nn model index file work_dir/nnf.index
2023-08-09 16:18:46 [INFO] Total time took 1043 ms
2023-08-09 16:18:46 [INFO] Found a total of 83 fully identical images (d>0.990), which are 16.02 %
2023-08-09 16:18:46 [INFO] Found a total of 57 nearly identical images(d>0.980), which are 11.00 %
2023-08-09 16:18:46 [INFO] Found a total of 394 above threshold images (d>0.900), which are 76.06 %
2023-08-09 16:18:46 [INFO] Found a total of 25 outlier images         (d<0.050), which are 4.83 %
2023-08-09 16:18:46 [INFO] Min distance found 0.448 max distance 0.999
2023-08-09 16:18:46 [INFO] Running connected components for ccthreshold 0.960000 
.0
 ########################################################################################

Dataset Analysis Summary: 

    Dataset contains 259 objects
    Valid objects are 88.00% (259) of the data, invalid are 0.00% (0) of the data
    Similarity:  48.25% (142) belong to 11 similarity clusters (components).
    39.75% (117) images do not belong to any similarity cluster.
    Largest cluster has 58 (19.71%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.96).

    Outliers: 5.10% (15) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.

########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################

Out[6]:

Extracted OCR Information¶

In [7]:

stats = fd.img_stats()
stats.head()

Out[7]:

	index	img_w	img_h	unique	blur	mean	min	max	stdv	file_size	contrast	filename	crop_filename	x1	y1	x2	y2	x3	y3	x4	y4	confidence	label	is_valid
0	0	24	14	183	345.0356	178.0270	51.0	255.0	57.0515	916	0.6667	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg	work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_436_953_456_953_456_965_436_965.jpg	436	953	456	953	456	965	436	965	0.662398	1	True
1	1	112	36	195	688.3301	180.5150	41.0	255.0	47.7177	2614	0.7230	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg	work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_466_945_560_945_560_975_466_975.jpg	466	945	560	945	560	975	466	975	0.785349	TikTok	True
2	2	108	28	186	739.3006	164.1137	70.0	255.0	40.6979	2074	0.5692	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg	work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_472_983_562_986_562_1008_471_1006.jpg	472	983	562	986	562	1008	471	1006	0.989322	@caitlinjs	True
3	3	106	35	159	208.8422	222.4114	48.0	255.0	22.7722	2404	0.6832	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg	work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_61_26_149_26_149_55_61_55.jpg	61	26	149	26	149	55	61	55	0.829294	TikTOK	True
4	4	159	25	95	360.8476	215.8326	0.0	255.0	29.6220	1811	1.0000	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg	work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_12_66_146_66_146_87_12_87.jpg	12	66	146	66	146	87	12	87	0.990695	@marc.koolen	True

That's a lot of information. Let's slice the table to only include the columns we need.

In [8]:

df = stats[["filename", "crop_filename", "label"]]
df

Out[8]:

	filename	crop_filename	label
0	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg	work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_436_953_456_953_456_965_436_965.jpg	1
1	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg	work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_466_945_560_945_560_975_466_975.jpg	TikTok
2	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg	work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_472_983_562_986_562_1008_471_1006.jpg	@caitlinjs
3	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg	work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_61_26_149_26_149_55_61_55.jpg	TikTOK
4	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg	work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_12_66_146_66_146_87_12_87.jpg	@marc.koolen
...	...	...	...
254	frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg	work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_78_604_308_604_308_622_78_622.jpg	haveadecentamountof
255	frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg	work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_78_628_317_628_317_650_78_650.jpg	bruisingandpainwithmy
256	frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg	work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_77_654_168_654_168_672_77_672.jpg	newlever
257	frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg	work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_433_946_561_946_561_976_433_976.jpg	JTikTok
258	frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg	work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_376_984_564_986_564_1008_376_1007.jpg	@timmytimmadome

259 rows × 3 columns

Let's use a handy package itables to visualize the images in table interactively.

In [9]:

!pip install -Uqq itables

Now, we run the function below to convert the path into HTML tags so we can preview the images here.

In [10]:

import base64
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)

# Convert the image file paths into HTML tag for preview
def to_img_tag(path):
    if isinstance(path, str):
        with open(path, 'rb') as f:
            image_data = f.read()
            base64_image = base64.b64encode(image_data).decode('utf-8')
        return '<img src="data:image/png;base64,' + base64_image + '" width="150" >'
    else:
        return path

df["filename_preview"] = df["filename"].apply(to_img_tag)
df["crop_filename_preview"] = df["crop_filename"].apply(to_img_tag)

df = df.loc[:,["filename","filename_preview", "crop_filename", "crop_filename_preview",  "label"],]

In [11]:

show(df, classes="display compact")

filename	filename_preview	crop_filename	crop_filename_preview	label
Loading... (need help?)

Duplicate/Near-duplicate Detections¶

In [12]:

fd.vis.duplicates_gallery()

100%|█| 20/20 [00:00<00:00, 468.54it

Stored similarity visual view in  work_dir/galleries/duplicates.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################

Duplicates Report

Info
Distance	0.999353
From	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000004.jpg_79_605_308_605_308_623_79_623.jpg
To	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000003.jpg_79_605_308_605_308_623_79_623.jpg
From_Label	haveadecentamountof
To_Label	haveadecentamountof

Info
Distance	0.998665
From	/crops/tmptiktok-trending-subset6875436892226178305.mp4output_000004.jpg_153_239_424_242_423_272_152_269.jpg
To	/crops/tmptiktok-trending-subset6875436892226178305.mp4output_000002.jpg_153_239_424_242_423_272_152_269.jpg
From_Label	Someofthemost
To_Label	Someofthemost

Info
Distance	0.998448
From	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000004.jpg_77_185_316_186_315_205_77_204.jpg
To	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000002.jpg_77_185_316_186_315_205_77_204.jpg
From_Label	feelheavierthanthenom
To_Label	feelheavierthanthenom

Info
Distance	0.998334
From	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000005.jpg_78_531_277_531_277_546_78_546.jpg
To	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000002.jpg_78_531_277_531_277_546_78_546.jpg
From_Label	Reply to dolphinarmss comment
To_Label	Reply to dolphinarmss comment

Info
Distance	0.998313
From	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000005.jpg_78_137_271_137_271_151_78_151.jpg
To	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000002.jpg_78_137_271_137_271_151_78_151.jpg
From_Label	Replytojessca4765scomment
To_Label	Replytojessca4765scomment

Info
Distance	0.998252
From	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000001.jpg_77_185_316_186_315_205_77_204.jpg
To	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000006.jpg_77_185_316_186_315_205_77_204.jpg
From_Label	feelheavierthanthenom
To_Label	feelheavierthanthenom

Info
Distance	0.998226
From	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000005.jpg_77_627_317_628_317_651_77_650.jpg
To	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000001.jpg_77_627_317_628_317_651_77_650.jpg
From_Label	bruisingandpainwithmy
To_Label	bruisingandpainwithmy

Info
Distance	0.998044
From	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000005.jpg_79_605_309_605_309_623_79_623.jpg
To	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000002.jpg_79_605_309_605_309_623_79_623.jpg
From_Label	haveadecentamountof
To_Label	haveadecentamountof

Info
Distance	0.997594
From	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000004.jpg_78_531_277_531_277_546_78_546.jpg
To	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000003.jpg_78_531_277_531_277_546_78_546.jpg
From_Label	Reply to dolphinarmss comment
To_Label	Reply to dolphinarmss comment

Info
Distance	0.99756
From	/crops/tmptiktok-trending-subset6875436892226178305.mp4output_000003.jpg_153_239_424_242_423_272_152_269.jpg
To	/crops/tmptiktok-trending-subset6875436892226178305.mp4output_000004.jpg_153_239_424_242_423_272_152_269.jpg
From_Label	Someofthemost
To_Label	Someofthemost

Out[12]:

Outliers¶

Let's visualize the outliers in the OCR detections.

In [13]:

fd.vis.outliers_gallery(load_crops=True)

100%|█| 20/20 [00:00<00:00, 27603.19

Stored outliers visual view in  work_dir/galleries/outliers.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################

Outliers Report

Showing image outliers, one per row

Info
Distance	0.900267
Path	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_162_662_384_662_384_723_162_723.jpg
label	HEAL TOE

Info
Distance	0.901725
Path	/crops/tmptiktok-trending-subset6875651291343883522.mp4output_000002.jpg_457_943_561_945_561_976_456_974.jpg
label	TikTok

Info
Distance	0.901787
Path	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000002.jpg_464_940_559_940_559_975_464_975.jpg
label	TiKTOK

Info
Distance	0.902096
Path	/crops/tmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_12_66_146_66_146_87_12_87.jpg
label	@marc.koolen

Info
Distance	0.902096
Path	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_13_67_119_69_118_89_13_86.jpg
label	@catocade

Info
Distance	0.905003
Path	/crops/tmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_61_26_149_26_149_55_61_55.jpg
label	TikTOK

Info
Distance	0.905566
Path	/crops/tmptiktok-trending-subset6875651291343883522.mp4output_000003.jpg_399_984_562_987_562_1009_399_1007.jpg
label	@martijn.schilder

Info
Distance	0.906044
Path	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_384_670_470_668_470_690_385_692.jpg
label	Thankyoul

Info
Distance	0.906044
Path	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_344_636_507_632_507_655_344_658.jpg
label	studying aroundyou.

Info
Distance	0.906131
Path	/crops/tmptiktok-trending-subset6875317312082201857.mp4output_000001.jpg_13_68_203_68_203_86_13_86.jpg
label	@sandrovanmunster

Info
Distance	0.907005
Path	/crops/tmptiktok-trending-subset6875342937002085633.mp4output_000001.jpg_12_66_122_66_122_87_12_87.jpg
label	@irmaknol1

Info
Distance	0.90798
Path	/crops/tmptiktok-trending-subset6875528457388903681.mp4output_000003.jpg_893_538_1011_538_1011_559_893_559.jpg
label	@ashena501

Info
Distance	0.908586
Path	/crops/tmptiktok-trending-subset6875621663564680450.mp4output_000002.jpg_121_195_439_195_439_221_121_221.jpg
label	whenwefirststartedvs

Info
Distance	0.908878
Path	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_433_946_561_946_561_976_433_976.jpg
label	JTikTok

Info
Distance	0.90898
Path	/crops/tmptiktok-trending-subset6875749962681044230.mp4output_000002.jpg_347_262_563_210_575_267_361_318.jpg
label	HEYNOOK

Info
Distance	0.90898
Path	/crops/tmptiktok-trending-subset6875749962681044230.mp4output_000001.jpg_346_259_565_210_575_267_359_316.jpg
label	HEYNOOK

Info
Distance	0.909159
Path	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000005.jpg_457_945_561_945_561_975_457_975.jpg
label	TikTok

Info
Distance	0.91165
Path	/crops/tmptiktok-trending-subset6875639469563759873.mp4output_000003.jpg_454_987_563_987_563_1008_454_1008.jpg
label	@justinvanr

Info
Distance	0.911661
Path	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000002.jpg_162_852_381_852_381_913_162_913.jpg
label	TUTORIAL

Info
Distance	0.911806
Path	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_59_26_149_26_149_55_59_55.jpg
label	TikTOK

Out[13]:

Blurry Detections¶

In [15]:

fd.vis.stats_gallery(metric='blur', load_crops=True)

100%|█| 20/20 [00:00<00:00, 216.36it

Stored blur visual view in  work_dir/galleries/blur.html
########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################

Blurry Image Report

Showing example images, sort by ascending order

Info
blur	345.0356
filename	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg
label	N/A

Info
blur	688.3301
filename	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg
label	N/A

Info
blur	739.3006
filename	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg
label	N/A

Info
blur	208.8422
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg
label	N/A

Info
blur	360.8476
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg
label	N/A

Info
blur	1321.6235
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg
label	N/A

Info
blur	737.07
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg
label	N/A

Info
blur	364.9784
filename	frames/tmp/tiktok-trending-subset6875342937002085633.mp4/output_000001.jpg
label	N/A

Info
blur	486.7106
filename	frames/tmp/tiktok-trending-subset6875342937002085633.mp4/output_000001.jpg
label	N/A

Info
blur	1193.521
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000003.jpg
label	N/A

Info
blur	895.0907
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000003.jpg
label	N/A

Info
blur	662.8271
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000003.jpg
label	N/A

Info
blur	1708.792
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000003.jpg
label	N/A

Info
blur	3642.5037
filename	frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000003.jpg
label	N/A

Info
blur	253.2622
filename	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000001.jpg
label	N/A

Info
blur	139.6932
filename	frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000001.jpg
label	N/A

Info
blur	1860.1863
filename	frames/tmp/tiktok-trending-subset6875405441472498949.mp4/output_000004.jpg
label	N/A

Info
blur	8614.8867
filename	frames/tmp/tiktok-trending-subset6875405441472498949.mp4/output_000004.jpg
label	N/A

Info
blur	8035.5918
filename	frames/tmp/tiktok-trending-subset6875405441472498949.mp4/output_000004.jpg
label	N/A

Info
blur	3628.0278
filename	frames/tmp/tiktok-trending-subset6875405441472498949.mp4/output_000004.jpg
label	N/A

Out[15]:

Detection Clusters¶

In [16]:

fd.vis.component_gallery()

TikTok

100%|█| 20/20 [00:00<00:00, 713.79it

Finished OK. Components are stored as image files work_dir/galleries/components_[index].jpg
Stored components visual view in  work_dir/galleries/components.html
Execution time in seconds 0.1

########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################

Components Report

Showing groups of similar images

Info
component	18
num_images	15
mean_distance	0.9615

Label
feelheavierthanthenom	7
haveadecentamountof	6
wouldyouratherliveInthe	2

Info
component	44
num_images	11
mean_distance	0.963

Label
duringsquatsanddeadlift?I	6
badasscharacterspt29	3
Someonesgetting that	1
badass characterspt29	1

Info
component	17
num_images	7
mean_distance	0.9659

Label
whydoesthemetalplates	7

Info
component	124
num_images	7
mean_distance	0.9626

Label
Reply to dolphinarmss comment	6
Replytot.and.dscomment	1

Info
component	16
num_images	7
mean_distance	0.9828

Label
Replytojessca4765scomment	6
Reply tojessca4765scomment	1

Info
component	129
num_images	6
mean_distance	0.9624

Label
Flat back benching is like	2
locking out your knees	2
Stupid and dangerous.	1
Stupidanddangerous.	1

Info
component	146
num_images	6
mean_distance	0.9696

Label
newlever	6

Info
component	144
num_images	6
mean_distance	0.9785

Label
Wheredoyouputyourbelt	3
OWhere doyouputyour belt	1
Owhere doyouputyourbelt	1
Where doyouputyourbelt	1

Info
component	19
num_images	5
mean_distance	0.9611

Label
metalplates?	5

Info
component	148
num_images	5
mean_distance	0.9602

Label
@timmytimmadome	5

Info
component	71
num_images	4
mean_distance	0.9778

Label
Someofthemost	4

Info
component	72
num_images	4
mean_distance	0.9685

Label
Chikara Ennoshita	2
ChikaraEnnoshita	2

Info
component	149
num_images	3
mean_distance	0.9914

Label
bruisingandpainwithmy	3

Info
component	77
num_images	3
mean_distance	0.9744

Label
@buutterrr	3

Info
component	73
num_images	3
mean_distance	0.9873

Label
TikTok	3

Info
component	110
num_images	3
mean_distance	0.9611

Label
TikTok	3

Info
component	145
num_images	3
mean_distance	0.9814

Label
bruisingandpainwithmy	3

Info
component	147
num_images	3
mean_distance	0.9661

Label
TikTok	3

Info
component	33
num_images	3
mean_distance	0.96

Label
TikTok	3

Info
component	130
num_images	2
mean_distance	0.9718

Label
during a leg press.	2

Out[16]:

Similarity Gallery¶

We can compare the detections with two of it's nearest neighbors to see if there are discrepencies in labels.

In [17]:

fd.vis.similarity_gallery()

Warning: you are running create_similarity_gallery() without providing get_label_func so similarities are not computed between different classes. It is recommended to run this report with labels. Without labels this report output is similar to create_duplicate_gallery()

100%|█| 20/20 [00:00<00:00, 322.32it

Stored similar images visual view in  work_dir/galleries/similarity.html

########################################################################################
Would you like to see awesome visualizations for some of the most popular academic datasets?
Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup
########################################################################################

Similarity Report

Info From
label	HEAL TOE
from	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_162_662_384_662_384_723_162_723.jpg

Info To
0.964309	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000002.jpg_164_663_385_663_385_723_164_723.jpg	HEALTOE
0.900267	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_162_853_381_853_381_913_162_913.jpg	TUTORIAL

Query Image

Similar

Info From
label	TikTok
from	/crops/tmptiktok-trending-subset6875651291343883522.mp4output_000002.jpg_457_943_561_945_561_976_456_974.jpg

Info To
0.906236	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_433_946_561_946_561_976_433_976.jpg	JTikTok
0.901725	/crops/tmptiktok-trending-subset6875317312082201857.mp4output_000001.jpg_56_25_151_25_151_55_56_55.jpg	TikTok

Query Image

Similar

Info From
label	TiKTOK
from	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000002.jpg_464_940_559_940_559_975_464_975.jpg

Info To
0.924796	/crops/tmptiktok-trending-subset6875373441432816898.mp4output_000001.jpg_59_26_151_26_151_55_59_55.jpg	TikToK
0.901787	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_59_25_150_25_150_55_59_55.jpg	TikTok

Query Image

Similar

Info From
label	@marc.koolen
from	/crops/tmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_12_66_146_66_146_87_12_87.jpg

Info To
0.902096	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_13_67_119_69_118_89_13_86.jpg	@catocade

Query Image

Similar

Info From
label	@catocade
from	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_13_67_119_69_118_89_13_86.jpg

Info To
0.925524	/crops/tmptiktok-trending-subset6875373441432816898.mp4output_000001.jpg_11_63_105_66_104_90_10_86.jpg	@caitlinjs
0.902096	/crops/tmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_12_66_146_66_146_87_12_87.jpg	@marc.koolen

Query Image

Similar

Info From
label	TikTOK
from	/crops/tmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_61_26_149_26_149_55_61_55.jpg

Info To
0.925996	/crops/tmptiktok-trending-subset6875373441432816898.mp4output_000001.jpg_59_26_151_26_151_55_59_55.jpg	TikToK
0.905003	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_59_25_150_25_150_55_59_55.jpg	TikTok

Query Image

Similar

Info From
label	@martijn.schilder
from	/crops/tmptiktok-trending-subset6875651291343883522.mp4output_000003.jpg_399_984_562_987_562_1009_399_1007.jpg

Info To
0.922588	/crops/tmptiktok-trending-subset6875749962681044230.mp4output_000002.jpg_425_1044_563_1046_562_1066_425_1064.jpg	@reece.lowery
0.905566	/crops/tmptiktok-trending-subset6875317312082201857.mp4output_000001.jpg_13_68_203_68_203_86_13_86.jpg	@sandrovanmunster

Query Image

Similar

Info From
label	Thankyoul
from	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_384_670_470_668_470_690_385_692.jpg

Info To
0.906044	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_344_636_507_632_507_655_344_658.jpg	studying aroundyou.

Query Image

Similar

Info From
label	studying aroundyou.
from	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_344_636_507_632_507_655_344_658.jpg

Info To
0.935684	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_347_605_503_599_503_618_348_623.jpg	Please respect those
0.906044	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_384_670_470_668_470_690_385_692.jpg	Thankyoul

Query Image

Similar

Info From
label	@sandrovanmunster
from	/crops/tmptiktok-trending-subset6875317312082201857.mp4output_000001.jpg_13_68_203_68_203_86_13_86.jpg

Info To
0.919321	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000004.jpg_377_987_564_987_564_1008_377_1008.jpg	@timmytimmadome
0.906131	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000003.jpg_376_986_564_988_564_1009_376_1007.jpg	@timmytimmadome

Query Image

Similar

Info From
label	@irmaknol1
from	/crops/tmptiktok-trending-subset6875342937002085633.mp4output_000001.jpg_12_66_122_66_122_87_12_87.jpg

Info To
0.91165	/crops/tmptiktok-trending-subset6875639469563759873.mp4output_000003.jpg_454_987_563_987_563_1008_454_1008.jpg	@justinvanr
0.907005	/crops/tmptiktok-trending-subset6875639469563759873.mp4output_000002.jpg_454_987_563_987_563_1008_454_1008.jpg	@justinvanr

Query Image

Similar

Info From
label	@ashena501
from	/crops/tmptiktok-trending-subset6875528457388903681.mp4output_000003.jpg_893_538_1011_538_1011_559_893_559.jpg

Info To
0.975834	/crops/tmptiktok-trending-subset6875528457388903681.mp4output_000002.jpg_894_538_1011_538_1011_559_894_559.jpg	@ashena501
0.90798	/crops/tmptiktok-trending-subset6875528457388903681.mp4output_000004.jpg_894_538_1013_538_1013_559_894_559.jpg	@ashena501

Query Image

Similar

Info From
label	whenwefirststartedvs
from	/crops/tmptiktok-trending-subset6875621663564680450.mp4output_000002.jpg_121_195_439_195_439_221_121_221.jpg

Info To
0.988705	/crops/tmptiktok-trending-subset6875621663564680450.mp4output_000001.jpg_121_195_439_195_439_221_121_221.jpg	whenwefirststartedvs
0.908586	/crops/tmptiktok-trending-subset6875621663564680450.mp4output_000002.jpg_14_68_199_68_199_86_14_86.jpg	@elirose.equestrian

Query Image

Similar

Info From
label	JTikTok
from	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_433_946_561_946_561_976_433_976.jpg

Info To
0.92183	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000003.jpg_467_946_561_946_561_976_467_976.jpg	TikTok
0.908878	/crops/tmptiktok-trending-subset6875872124968439046.mp4output_000004.jpg_467_946_561_946_561_976_467_976.jpg	TikTok

Query Image

Similar

Info From
label	HEYNOOK
from	/crops/tmptiktok-trending-subset6875749962681044230.mp4output_000001.jpg_346_259_565_210_575_267_359_316.jpg

Info To
0.90898	/crops/tmptiktok-trending-subset6875749962681044230.mp4output_000002.jpg_347_262_563_210_575_267_361_318.jpg	HEYNOOK

Query Image

Similar

Info From
label	HEYNOOK
from	/crops/tmptiktok-trending-subset6875749962681044230.mp4output_000002.jpg_347_262_563_210_575_267_361_318.jpg

Info To
0.90898	/crops/tmptiktok-trending-subset6875749962681044230.mp4output_000001.jpg_346_259_565_210_575_267_359_316.jpg	HEYNOOK

Query Image

Similar

Info From
label	TikTok
from	/crops/tmptiktok-trending-subset6875405441472498949.mp4output_000005.jpg_457_945_561_945_561_975_457_975.jpg

Info To
0.909159	/crops/tmptiktok-trending-subset6875323773755657474.mp4output_000002.jpg_59_25_151_25_151_55_59_55.jpg	TikTok

Query Image

Similar

Info From
label	@justinvanr
from	/crops/tmptiktok-trending-subset6875639469563759873.mp4output_000003.jpg_454_987_563_987_563_1008_454_1008.jpg

Info To
0.969212	/crops/tmptiktok-trending-subset6875639469563759873.mp4output_000002.jpg_454_987_563_987_563_1008_454_1008.jpg	@justinvanr
0.91165	/crops/tmptiktok-trending-subset6875342937002085633.mp4output_000001.jpg_12_66_122_66_122_87_12_87.jpg	@irmaknol1

Query Image

Similar

Info From
label	TUTORIAL
from	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000002.jpg_162_852_381_852_381_913_162_913.jpg

Info To
0.985895	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000001.jpg_162_853_381_853_381_913_162_913.jpg	TUTORIAL
0.911661	/crops/tmptiktok-trending-subset6875453919879908614.mp4output_000002.jpg_106_762_438_762_438_816_106_816.jpg	SHUFFLE STEP

Query Image

Similar

Info From
label	TikTOK
from	/crops/tmptiktok-trending-subset6875468410612993286.mp4output_000001.jpg_59_26_149_26_149_55_59_55.jpg

Info To
0.911806	/crops/tmptiktok-trending-subset6875373441432816898.mp4output_000001.jpg_59_26_151_26_151_55_59_55.jpg	TikToK

Query Image

Similar

Out[17]:

	from	to	label	label2	distance
Loading... (need help?)

Wrap Up¶

In this notebook we show how you can run OCR models with fastdup and analyze the bounding boxes for issues.

Next, feel free to check out other tutorials -

⚡ Quickstart: Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
🧹 Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
🖼 Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
🎁 Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.

VL Profiler¶

If you prefer a no-code platform to inspect and visualize your dataset, try our free cloud product VL Profiler - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser.

Sign up now, it's free.

As usual, feedback is welcome!

Questions? Drop by our Slack channel or open an issue on GitHub.

GitHub • Join Slack community • Discussion Forum Blog • Documentation • About Us LinkedIn • Twitter