!pip install -Uqq fastdup paddleocr paddlepaddle gdown numpy
import fastdup
fastdup.__version__
/usr/bin/dpkg
'1.34'
If you're running this notebook in Google Colab, uncomment the following cell and run before proceeding to the next cell.
# !wget http://nz2.archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2.19_amd64.deb
# !sudo dpkg -i libssl1.1_1.1.1f-1ubuntu2.19_amd64.deb
We'll use a subset of the TikTok Trending Videos dataset from Kaggle.
!gdown https://drive.google.com/uc?id=1FsBQTZKsfApEn99g_BhkTbdz0SmQgkKv
Downloading... From (uriginal): https://drive.google.com/uc?id=1FsBQTZKsfApEn99g_BhkTbdz0SmQgkKv From (redirected): https://drive.google.com/uc?id=1FsBQTZKsfApEn99g_BhkTbdz0SmQgkKv&confirm=t&uuid=2e6206bc-cbe1-4642-a0f9-f849a42c4b6e To: /media/dnth/Active-Projects/fastdup/examples/tiktok-trending-subset.zip 100%|██████████████████████████████████████| 61.3M/61.3M [00:06<00:00, 9.67MB/s]
Now, unzip the dataset into our local directory. You'll find a folder name tiktok-trending-subset
that has all the trending clips.
!unzip -q tiktok-trending-subset.zip
To run fastdup, we will need to extract the clips in the tiktok-trending-subset
folder into frames and store them in another folder, let's name the folder frames
. fastdup provides a convenience function for that.
fastdup.extract_video_frames('tiktok-trending-subset', 'frames')
FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson. 2023-08-09 16:17:49 [INFO] Going to loop over dir tiktok-trending-subset 2023-08-09 16:17:49 [INFO] Found total 19 videos to run on, 19 train, 0 test, name list 19, counter 19
0
With the extracted frames, we can run fastdup to analyze them.
To use the optical character recognition feature, specify bounding_box='ocr'
in the run
method.
For demonstration, we'll specify num_images=1000
in the run
method which limits the run to only 1000 images. Feel free to specify a different value or omitting this parameter altogether to run on the entire dataset.
fd = fastdup.create(input_dir='./frames')
fd.run(bounding_box='ocr')
Warning: fastdup create() without work_dir argument, output is stored in a folder named work_dir in your current working path. FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson. 2023-08-09 16:17:52 [INFO] Going to loop over dir frames 2023-08-09 16:17:52 [INFO] Found total 54 images to run on, 54 train, 0 test, name list 54, counter 54 FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson. 2023-08-09 16:18:45 [INFO] Going to loop over dir /tmp/crops_input.csv 2023-08-09 16:18:45 [INFO] Found total 259 images to run on, 259 train, 0 test, name list 259, counter 259 2023-08-09 16:18:46 [INFO] Found total 259 images to run onstimated: 0 Minutes Finished histogram 0.280 Finished bucket sort 0.294 2023-08-09 16:18:46 [INFO] 31) Finished write_index() NN model 2023-08-09 16:18:46 [INFO] Stored nn model index file work_dir/nnf.index 2023-08-09 16:18:46 [INFO] Total time took 1043 ms 2023-08-09 16:18:46 [INFO] Found a total of 83 fully identical images (d>0.990), which are 16.02 % 2023-08-09 16:18:46 [INFO] Found a total of 57 nearly identical images(d>0.980), which are 11.00 % 2023-08-09 16:18:46 [INFO] Found a total of 394 above threshold images (d>0.900), which are 76.06 % 2023-08-09 16:18:46 [INFO] Found a total of 25 outlier images (d<0.050), which are 4.83 % 2023-08-09 16:18:46 [INFO] Min distance found 0.448 max distance 0.999 2023-08-09 16:18:46 [INFO] Running connected components for ccthreshold 0.960000 .0 ######################################################################################## Dataset Analysis Summary: Dataset contains 259 objects Valid objects are 88.00% (259) of the data, invalid are 0.00% (0) of the data Similarity: 48.25% (142) belong to 11 similarity clusters (components). 39.75% (117) images do not belong to any similarity cluster. Largest cluster has 58 (19.71%) images. For a detailed analysis, use `.connected_components()` (similarity threshold used is 0.9, connected component threshold used is 0.96). Outliers: 5.10% (15) of images are possible outliers, and fall in the bottom 5.00% of similarity values. For a detailed list of outliers, use `.outliers()`. ######################################################################################## Would you like to see awesome visualizations for some of the most popular academic datasets? Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup ########################################################################################
0
stats = fd.img_stats()
stats.head()
index | img_w | img_h | unique | blur | mean | min | max | stdv | file_size | contrast | filename | crop_filename | x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 | confidence | label | is_valid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 24 | 14 | 183 | 345.0356 | 178.0270 | 51.0 | 255.0 | 57.0515 | 916 | 0.6667 | frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg | work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_436_953_456_953_456_965_436_965.jpg | 436 | 953 | 456 | 953 | 456 | 965 | 436 | 965 | 0.662398 | 1 | True |
1 | 1 | 112 | 36 | 195 | 688.3301 | 180.5150 | 41.0 | 255.0 | 47.7177 | 2614 | 0.7230 | frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg | work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_466_945_560_945_560_975_466_975.jpg | 466 | 945 | 560 | 945 | 560 | 975 | 466 | 975 | 0.785349 | TikTok | True |
2 | 2 | 108 | 28 | 186 | 739.3006 | 164.1137 | 70.0 | 255.0 | 40.6979 | 2074 | 0.5692 | frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg | work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_472_983_562_986_562_1008_471_1006.jpg | 472 | 983 | 562 | 986 | 562 | 1008 | 471 | 1006 | 0.989322 | @caitlinjs | True |
3 | 3 | 106 | 35 | 159 | 208.8422 | 222.4114 | 48.0 | 255.0 | 22.7722 | 2404 | 0.6832 | frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg | work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_61_26_149_26_149_55_61_55.jpg | 61 | 26 | 149 | 26 | 149 | 55 | 61 | 55 | 0.829294 | TikTOK | True |
4 | 4 | 159 | 25 | 95 | 360.8476 | 215.8326 | 0.0 | 255.0 | 29.6220 | 1811 | 1.0000 | frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg | work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_12_66_146_66_146_87_12_87.jpg | 12 | 66 | 146 | 66 | 146 | 87 | 12 | 87 | 0.990695 | @marc.koolen | True |
That's a lot of information. Let's slice the table to only include the columns we need.
df = stats[["filename", "crop_filename", "label"]]
df
filename | crop_filename | label | |
---|---|---|---|
0 | frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg | work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_436_953_456_953_456_965_436_965.jpg | 1 |
1 | frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg | work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_466_945_560_945_560_975_466_975.jpg | TikTok |
2 | frames/tmp/tiktok-trending-subset6875373441432816898.mp4/output_000002.jpg | work_dir/crops/framestmptiktok-trending-subset6875373441432816898.mp4output_000002.jpg_472_983_562_986_562_1008_471_1006.jpg | @caitlinjs |
3 | frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg | work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_61_26_149_26_149_55_61_55.jpg | TikTOK |
4 | frames/tmp/tiktok-trending-subset6875323773755657474.mp4/output_000001.jpg | work_dir/crops/framestmptiktok-trending-subset6875323773755657474.mp4output_000001.jpg_12_66_146_66_146_87_12_87.jpg | @marc.koolen |
... | ... | ... | ... |
254 | frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg | work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_78_604_308_604_308_622_78_622.jpg | haveadecentamountof |
255 | frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg | work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_78_628_317_628_317_650_78_650.jpg | bruisingandpainwithmy |
256 | frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg | work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_77_654_168_654_168_672_77_672.jpg | newlever |
257 | frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg | work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_433_946_561_946_561_976_433_976.jpg | JTikTok |
258 | frames/tmp/tiktok-trending-subset6875872124968439046.mp4/output_000006.jpg | work_dir/crops/framestmptiktok-trending-subset6875872124968439046.mp4output_000006.jpg_376_984_564_986_564_1008_376_1007.jpg | @timmytimmadome |
259 rows × 3 columns
Let's use a handy package itables to visualize the images in table interactively.
!pip install -Uqq itables
Now, we run the function below to convert the path into HTML tags so we can preview the images here.
import base64
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)
# Convert the image file paths into HTML tag for preview
def to_img_tag(path):
if isinstance(path, str):
with open(path, 'rb') as f:
image_data = f.read()
base64_image = base64.b64encode(image_data).decode('utf-8')
return '<img src="data:image/png;base64,' + base64_image + '" width="150" >'
else:
return path
df["filename_preview"] = df["filename"].apply(to_img_tag)
df["crop_filename_preview"] = df["crop_filename"].apply(to_img_tag)
df = df.loc[:,["filename","filename_preview", "crop_filename", "crop_filename_preview", "label"],]
show(df, classes="display compact")
filename | filename_preview | crop_filename | crop_filename_preview | label |
---|---|---|---|---|
Loading... (need help?) |