This notebook is Part 2 of the enrichment notebook series where we utilize various zero-shot models to enrich the metadata of existing dataset.
If you haven't checkout out Part 1, we highly encourage you to go through it first before proceeding with this notebook.
In this notebook we show an end-to-end example on how you can enrich the metadata of your visual using open-source zero-shot models such Grounding DINO using the output we obtained from Part 1.
By the end of this notebook, you'll learn how to:
.json
format.First, let's install the necessary packages:
🗒 Note - We highly recommending running this notebook in CUDA enabled environment to reduce the run time.
!pip install -Uq fastdup mmengine mmdet groundingdino-py gdown
Now, test the installation. If there's no error message, we are ready to go.
import fastdup
fastdup.__version__
'1.53'
Download the coco-minitrain dataset - A curated mini training set consisting of 20% of COCO 2017 training dataset. The coco-minitrain consists of 25,000 images and annotations.
!gdown --fuzzy https://drive.google.com/file/d/1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK/view
!unzip -qq coco_minitrain_25k.zip
Apart from zero-shot recognition models, fastdup also supports zero-shot detection models like Grounding DINO (and more to come).
Grounding DINO is a powerful zero-shot detection model. It accepts image-text pair as inputs and outputs a bounding box.
import pandas as pd
# Dataframe we got from Part 1
data = {
'filename': [
'coco_minitrain_25k/images/val2017/000000382734.jpg',
'coco_minitrain_25k/images/val2017/000000508730.jpg',
'coco_minitrain_25k/images/val2017/000000202339.jpg',
'coco_minitrain_25k/images/val2017/000000460929.jpg',
'coco_minitrain_25k/images/val2017/000000181796.jpg',
'coco_minitrain_25k/images/val2017/000000052565.jpg',
'coco_minitrain_25k/images/val2017/000000503755.jpg',
'coco_minitrain_25k/images/val2017/000000477955.jpg',
'coco_minitrain_25k/images/val2017/000000562229.jpg',
'coco_minitrain_25k/images/val2017/000000528862.jpg',
],
'ram_tags': [
'bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white',
'baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy',
'bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk',
'beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap',
'bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass',
'beach . black . cattle . coast . cow . sea . photo . sand . shore . shoreline . stand . stare . water . white',
'catch . court . hand . goggles . necklace . play . racket . stand . sunglasses . tennis court . tennis player . tennis racket . wear . woman',
'attach . beach . catch . coast . fly . person . kite . man . sea . parachute . parasail . sand . sky . stand . string . surfboard . surfer . wetsuit',
'boy . child . ride . road . skateboard . skateboarder . stand . trick',
'animal . area . dirt field . enclosure . fence . field . giraffe . habitat . herd . lush . pen . savanna . stand . tree . walk . zoo'
]
}
df = pd.DataFrame(data)
df
filename | ram_tags | |
---|---|---|
0 | coco_minitrain_25k/images/val2017/000000382734.jpg | bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white |
1 | coco_minitrain_25k/images/val2017/000000508730.jpg | baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy |
2 | coco_minitrain_25k/images/val2017/000000202339.jpg | bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk |
3 | coco_minitrain_25k/images/val2017/000000460929.jpg | beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap |
4 | coco_minitrain_25k/images/val2017/000000181796.jpg | bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass |
5 | coco_minitrain_25k/images/val2017/000000052565.jpg | beach . black . cattle . coast . cow . sea . photo . sand . shore . shoreline . stand . stare . water . white |
6 | coco_minitrain_25k/images/val2017/000000503755.jpg | catch . court . hand . goggles . necklace . play . racket . stand . sunglasses . tennis court . tennis player . tennis racket . wear . woman |
7 | coco_minitrain_25k/images/val2017/000000477955.jpg | attach . beach . catch . coast . fly . person . kite . man . sea . parachute . parasail . sand . sky . stand . string . surfboard . surfer . wetsuit |
8 | coco_minitrain_25k/images/val2017/000000562229.jpg | boy . child . ride . road . skateboard . skateboarder . stand . trick |
9 | coco_minitrain_25k/images/val2017/000000528862.jpg | animal . area . dirt field . enclosure . fence . field . giraffe . habitat . herd . lush . pen . savanna . stand . tree . walk . zoo |
If you'd like to reproduce the above DataFrame, Part 1 notebook details the code you need to run.
We can now use the image tags from the above DataFrame in combination with Grounding DINO to further enrich the dataset with bounding boxes.
To run the enrichment on a DataFrame, use the fd.enrich
method and specify model='grounding-dino'
. By default fastdup loads the smaller variant (Swin-T) backbone for enrichment.
Also specify the DataFrame to run the enrichment on and the name of the column as the input to the Grounding DINO model. In this example, we take the text prompt from the ram_tags
column which we have computed earlier.
fd = fastdup.create(input_dir='./coco_minitrain_25k')
df = fd.enrich(task='zero-shot-detection',
model='grounding-dino',
input_df=df,
input_col='ram_tags'
)
Warning: fastdup create() without work_dir argument, output is stored in a folder named work_dir in your current working path.
INFO:fastdup.models.grounding_dino:Loading model checkpoint from - /home/dnth/groundingdino_swint_ogc.pth /home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). INFO:fastdup.models.grounding_dino:Model loaded on device - cuda
final text_encoder_type: bert-base-uncased
/home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/transformers/modeling_utils.py:768: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(
final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased
Once, done you'll notice that 3 new columns are appended into the DataFrame namely - grounding_dino_bboxes
, grounding_dino_scores
, and grounding_dino_labels
.
df
filename | ram_tags | grounding_dino_bboxes | grounding_dino_scores | grounding_dino_labels | |
---|---|---|---|---|---|
0 | coco_minitrain_25k/images/val2017/000000382734.jpg | bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white | [(94.36, 479.79, 236.6, 589.37), (4.92, 3.73, 475.19, 637.36), (95.94, 514.92, 376.53, 638.46), (41.91, 37.47, 425.01, 637.09), (115.27, 602.26, 164.17, 635.21)] | [0.5789, 0.3895, 0.4444, 0.3018, 0.3601] | [bath, bathroom, floor, glass door, drain] |
1 | coco_minitrain_25k/images/val2017/000000508730.jpg | baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy | [(3.58, 2.77, 635.13, 475.62), (30.91, 104.91, 301.75, 476.29), (68.59, 105.02, 266.22, 267.8), (359.26, 116.82, 576.6, 475.9), (374.37, 116.77, 557.19, 254.07), (466.9, 0.71, 638.7, 117.05), (266.95, 433.87, 291.04, 476.78), (466.53, 349.26, 525.87, 405.73), (350.62, 272.66, 571.98, 476.46)] | [0.5898, 0.3738, 0.3679, 0.3641, 0.362, 0.3482, 0.3804, 0.3755, 0.3742] | [bathroom, toddler, hair, toddler, hair, bathroom accessory, hairbrush, diaper, chair] |
2 | coco_minitrain_25k/images/val2017/000000202339.jpg | bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk | [(73.28, 256.74, 135.63, 374.42), (103.53, 105.23, 267.7, 410.18), (98.31, 33.85, 271.8, 434.72), (203.78, 63.88, 463.32, 298.29), (147.5, 106.62, 163.49, 172.9), (164.1, 52.93, 272.88, 152.68), (0.49, 0.76, 82.86, 333.41), (1.96, 2.22, 477.75, 636.07), (398.15, 281.2, 479.01, 545.03), (147.02, 106.66, 163.66, 227.86)] | [0.605, 0.4599, 0.47, 0.4068, 0.3591, 0.4449, 0.4371, 0.3444, 0.3043, 0.4775] | [shopping bag, business suit, man, city bus, tie, sign, bus, bus station, carry, tie] |
3 | coco_minitrain_25k/images/val2017/000000460929.jpg | beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap | [(288.11, 1.02, 423.49, 414.45), (178.93, 355.71, 327.01, 636.63), (214.56, 514.74, 280.53, 569.36), (234.05, 369.92, 286.25, 545.46), (1.41, 0.68, 478.3, 279.4), (5.18, 264.26, 476.49, 637.36), (170.14, 351.42, 356.79, 637.53), (18.26, 244.37, 415.31, 637.95), (211.98, 364.96, 287.86, 629.13), (295.23, 275.44, 399.61, 353.15), (1.46, 79.77, 477.84, 233.99)] | [0.5651, 0.3978, 0.3909, 0.3875, 0.3047, 0.3796, 0.4003, 0.3422, 0.4297, 0.3012, 0.3264] | [beer bottle, hot dog, mustard, tomato sauce, car, picnic table, hot dog, tinfoil, chili dog, condiment, car] |
4 | coco_minitrain_25k/images/val2017/000000181796.jpg | bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass | [(105.15, 0.55, 239.98, 190.16), (214.13, 60.52, 298.47, 154.0), (163.61, 136.44, 501.68, 358.82), (495.3, 47.58, 553.6, 98.79), (520.29, 27.48, 564.41, 58.55), (402.73, 177.07, 594.84, 222.61), (136.47, 45.8, 226.12, 98.31), (478.39, 31.42, 524.07, 67.47), (349.11, 27.75, 610.1, 119.71), (364.94, 264.18, 470.49, 335.61), (1.75, 1.75, 637.54, 358.16), (310.99, 48.6, 509.15, 102.31), (359.97, 245.94, 425.25, 299.72), (359.62, 0.36, 517.83, 28.27), (94.78, 165.47, 228.95, 209.8), (532.24, 0.4, 639.09, 80.23), (404.09, 156.78, 638.9, 344.62), (202.34, 144.76, 380.58, 285.73), (179.95, 138.21, 482.34, 358.16)] | [0.859, 0.6904, 0.5239, 0.5113, 0.5294, 0.4608, 0.7976, 0.4085, 0.4648, 0.4421, 0.5061, 0.372, 0.3952, 0.416, 0.3843, 0.3348, 0.3765, 0.3076, 0.3206] | [wine glass, cup, plate, cup, cup, fork, wine, cup, plate, vegetable, table, utensil, potato, platter, utensil, platter, silverware, meat, meal] |
5 | coco_minitrain_25k/images/val2017/000000052565.jpg | beach . black . cattle . coast . cow . sea . photo . sand . shore . shoreline . stand . stare . water . white | [(165.88, 174.02, 474.01, 388.07), (4.12, 4.98, 635.79, 453.29), (2.75, 371.86, 636.64, 455.92), (166.5, 173.98, 473.98, 388.32), (166.03, 172.94, 473.93, 388.82)] | [0.459, 0.5769, 0.3651, 0.3646, 0.3252] | [cattle, photo, beach, cow, cattle] |
6 | coco_minitrain_25k/images/val2017/000000503755.jpg | catch . court . hand . goggles . necklace . play . racket . stand . sunglasses . tennis court . tennis player . tennis racket . wear . woman | [(140.63, 274.89, 246.48, 396.44), (0.41, 377.04, 52.27, 638.84), (15.43, 70.65, 477.15, 635.92), (35.93, 276.65, 420.22, 638.17), (110.67, 159.16, 268.82, 206.84), (3.49, 500.51, 478.05, 638.73), (98.69, 144.98, 283.8, 224.12)] | [0.8102, 0.618, 0.5048, 0.5032, 0.4705, 0.4123, 0.3215] | [necklace, racket, tennis player, wear, sunglasses, tennis court, goggles] |
7 | coco_minitrain_25k/images/val2017/000000477955.jpg | attach . beach . catch . coast . fly . person . kite . man . sea . parachute . parasail . sand . sky . stand . string . surfboard . surfer . wetsuit | [(178.59, 434.5, 259.67, 612.91), (176.17, 476.49, 216.9, 569.34), (178.08, 461.01, 256.94, 608.4), (228.88, 82.1, 361.35, 463.9), (1.46, 1.78, 478.65, 529.36), (1.56, 572.82, 478.52, 639.0), (1.14, 519.61, 478.9, 627.36), (327.8, 10.03, 367.96, 89.58), (333.37, 501.37, 340.3, 511.15)] | [0.5728, 0.5696, 0.5161, 0.4331, 0.4934, 0.4239, 0.5173, 0.4353, 0.3184] | [person, surfboard, wetsuit, string, sky, beach, sea, parachute, kite] |
8 | coco_minitrain_25k/images/val2017/000000562229.jpg | boy . child . ride . road . skateboard . skateboarder . stand . trick | [(247.17, 495.74, 420.17, 582.6), (248.35, 120.74, 455.25, 573.51), (2.14, 269.6, 638.05, 636.89)] | [0.726, 0.4538, 0.3835] | [skateboard, skateboarder, road] |
9 | coco_minitrain_25k/images/val2017/000000528862.jpg | animal . area . dirt field . enclosure . fence . field . giraffe . habitat . herd . lush . pen . savanna . stand . tree . walk . zoo | [(42.47, 105.04, 79.05, 264.81), (162.63, 146.35, 204.98, 264.83), (183.47, 66.26, 221.07, 168.76), (253.46, 152.0, 317.8, 273.61), (231.09, 172.01, 255.38, 249.36), (343.14, 151.77, 367.43, 239.16), (252.07, 0.89, 355.58, 292.78), (397.64, 1.5, 498.46, 356.79), (319.2, 134.25, 342.34, 237.56), (2.0, 305.4, 495.8, 368.44), (2.73, 2.42, 497.21, 370.32), (117.65, 72.43, 141.59, 93.78), (319.68, 134.5, 342.2, 215.2)] | [0.681, 0.6958, 0.6788, 0.6266, 0.5801, 0.5907, 0.3993, 0.4188, 0.3322, 0.4242, 0.3804, 0.3182, 0.3121] | [giraffe, giraffe, giraffe, giraffe, giraffe, giraffe, tree, tree, giraffe, fence, zoo, animal, giraffe] |
Now let's plot the results of the enrichment using the plot_annotations
function.
from fastdup.models_utils import plot_annotations
plot_annotations(df,
image_col='filename', # column specifying image filenames
tags_col='ram_tags', # column specifying image labels
bbox_col='grounding_dino_bboxes', # column specifying bounding boxes
scores_col='grounding_dino_scores', # column specifying label scores
labels_col='grounding_dino_labels', # column specifying label text
num_rows=10 # the number of rows in the dataframe to plot
)
Let's suppose you'd like to search for specific objects in your dataset, you can create a column in the DataFrame specifying the objects of interest and run the .enrich
method.
Let's create a column in our DataFrame and name it custom_prompt
.
df["custom_prompt"] = "face . eye . hair . "
df
filename | ram_tags | grounding_dino_bboxes | grounding_dino_scores | grounding_dino_labels | custom_prompt | |
---|---|---|---|---|---|---|
0 | coco_minitrain_25k/images/val2017/000000382734.jpg | bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white | [(94.36, 479.79, 236.6, 589.37), (4.92, 3.73, 475.19, 637.36), (95.94, 514.92, 376.53, 638.46), (41.91, 37.47, 425.01, 637.09), (115.27, 602.26, 164.17, 635.21)] | [0.5789, 0.3895, 0.4444, 0.3018, 0.3601] | [bath, bathroom, floor, glass door, drain] | face . eye . hair . |
1 | coco_minitrain_25k/images/val2017/000000508730.jpg | baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy | [(3.58, 2.77, 635.13, 475.62), (30.91, 104.91, 301.75, 476.29), (68.59, 105.02, 266.22, 267.8), (359.26, 116.82, 576.6, 475.9), (374.37, 116.77, 557.19, 254.07), (466.9, 0.71, 638.7, 117.05), (266.95, 433.87, 291.04, 476.78), (466.53, 349.26, 525.87, 405.73), (350.62, 272.66, 571.98, 476.46)] | [0.5898, 0.3738, 0.3679, 0.3641, 0.362, 0.3482, 0.3804, 0.3755, 0.3742] | [bathroom, toddler, hair, toddler, hair, bathroom accessory, hairbrush, diaper, chair] | face . eye . hair . |
2 | coco_minitrain_25k/images/val2017/000000202339.jpg | bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk | [(73.28, 256.74, 135.63, 374.42), (103.53, 105.23, 267.7, 410.18), (98.31, 33.85, 271.8, 434.72), (203.78, 63.88, 463.32, 298.29), (147.5, 106.62, 163.49, 172.9), (164.1, 52.93, 272.88, 152.68), (0.49, 0.76, 82.86, 333.41), (1.96, 2.22, 477.75, 636.07), (398.15, 281.2, 479.01, 545.03), (147.02, 106.66, 163.66, 227.86)] | [0.605, 0.4599, 0.47, 0.4068, 0.3591, 0.4449, 0.4371, 0.3444, 0.3043, 0.4775] | [shopping bag, business suit, man, city bus, tie, sign, bus, bus station, carry, tie] | face . eye . hair . |
3 | coco_minitrain_25k/images/val2017/000000460929.jpg | beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap | [(288.11, 1.02, 423.49, 414.45), (178.93, 355.71, 327.01, 636.63), (214.56, 514.74, 280.53, 569.36), (234.05, 369.92, 286.25, 545.46), (1.41, 0.68, 478.3, 279.4), (5.18, 264.26, 476.49, 637.36), (170.14, 351.42, 356.79, 637.53), (18.26, 244.37, 415.31, 637.95), (211.98, 364.96, 287.86, 629.13), (295.23, 275.44, 399.61, 353.15), (1.46, 79.77, 477.84, 233.99)] | [0.5651, 0.3978, 0.3909, 0.3875, 0.3047, 0.3796, 0.4003, 0.3422, 0.4297, 0.3012, 0.3264] | [beer bottle, hot dog, mustard, tomato sauce, car, picnic table, hot dog, tinfoil, chili dog, condiment, car] | face . eye . hair . |
4 | coco_minitrain_25k/images/val2017/000000181796.jpg | bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass | [(105.15, 0.55, 239.98, 190.16), (214.13, 60.52, 298.47, 154.0), (163.61, 136.44, 501.68, 358.82), (495.3, 47.58, 553.6, 98.79), (520.29, 27.48, 564.41, 58.55), (402.73, 177.07, 594.84, 222.61), (136.47, 45.8, 226.12, 98.31), (478.39, 31.42, 524.07, 67.47), (349.11, 27.75, 610.1, 119.71), (364.94, 264.18, 470.49, 335.61), (1.75, 1.75, 637.54, 358.16), (310.99, 48.6, 509.15, 102.31), (359.97, 245.94, 425.25, 299.72), (359.62, 0.36, 517.83, 28.27), (94.78, 165.47, 228.95, 209.8), (532.24, 0.4, 639.09, 80.23), (404.09, 156.78, 638.9, 344.62), (202.34, 144.76, 380.58, 285.73), (179.95, 138.21, 482.34, 358.16)] | [0.859, 0.6904, 0.5239, 0.5113, 0.5294, 0.4608, 0.7976, 0.4085, 0.4648, 0.4421, 0.5061, 0.372, 0.3952, 0.416, 0.3843, 0.3348, 0.3765, 0.3076, 0.3206] | [wine glass, cup, plate, cup, cup, fork, wine, cup, plate, vegetable, table, utensil, potato, platter, utensil, platter, silverware, meat, meal] | face . eye . hair . |
5 | coco_minitrain_25k/images/val2017/000000052565.jpg | beach . black . cattle . coast . cow . sea . photo . sand . shore . shoreline . stand . stare . water . white | [(165.88, 174.02, 474.01, 388.07), (4.12, 4.98, 635.79, 453.29), (2.75, 371.86, 636.64, 455.92), (166.5, 173.98, 473.98, 388.32), (166.03, 172.94, 473.93, 388.82)] | [0.459, 0.5769, 0.3651, 0.3646, 0.3252] | [cattle, photo, beach, cow, cattle] | face . eye . hair . |
6 | coco_minitrain_25k/images/val2017/000000503755.jpg | catch . court . hand . goggles . necklace . play . racket . stand . sunglasses . tennis court . tennis player . tennis racket . wear . woman | [(140.63, 274.89, 246.48, 396.44), (0.41, 377.04, 52.27, 638.84), (15.43, 70.65, 477.15, 635.92), (35.93, 276.65, 420.22, 638.17), (110.67, 159.16, 268.82, 206.84), (3.49, 500.51, 478.05, 638.73), (98.69, 144.98, 283.8, 224.12)] | [0.8102, 0.618, 0.5048, 0.5032, 0.4705, 0.4123, 0.3215] | [necklace, racket, tennis player, wear, sunglasses, tennis court, goggles] | face . eye . hair . |
7 | coco_minitrain_25k/images/val2017/000000477955.jpg | attach . beach . catch . coast . fly . person . kite . man . sea . parachute . parasail . sand . sky . stand . string . surfboard . surfer . wetsuit | [(178.59, 434.5, 259.67, 612.91), (176.17, 476.49, 216.9, 569.34), (178.08, 461.01, 256.94, 608.4), (228.88, 82.1, 361.35, 463.9), (1.46, 1.78, 478.65, 529.36), (1.56, 572.82, 478.52, 639.0), (1.14, 519.61, 478.9, 627.36), (327.8, 10.03, 367.96, 89.58), (333.37, 501.37, 340.3, 511.15)] | [0.5728, 0.5696, 0.5161, 0.4331, 0.4934, 0.4239, 0.5173, 0.4353, 0.3184] | [person, surfboard, wetsuit, string, sky, beach, sea, parachute, kite] | face . eye . hair . |
8 | coco_minitrain_25k/images/val2017/000000562229.jpg | boy . child . ride . road . skateboard . skateboarder . stand . trick | [(247.17, 495.74, 420.17, 582.6), (248.35, 120.74, 455.25, 573.51), (2.14, 269.6, 638.05, 636.89)] | [0.726, 0.4538, 0.3835] | [skateboard, skateboarder, road] | face . eye . hair . |
9 | coco_minitrain_25k/images/val2017/000000528862.jpg | animal . area . dirt field . enclosure . fence . field . giraffe . habitat . herd . lush . pen . savanna . stand . tree . walk . zoo | [(42.47, 105.04, 79.05, 264.81), (162.63, 146.35, 204.98, 264.83), (183.47, 66.26, 221.07, 168.76), (253.46, 152.0, 317.8, 273.61), (231.09, 172.01, 255.38, 249.36), (343.14, 151.77, 367.43, 239.16), (252.07, 0.89, 355.58, 292.78), (397.64, 1.5, 498.46, 356.79), (319.2, 134.25, 342.34, 237.56), (2.0, 305.4, 495.8, 368.44), (2.73, 2.42, 497.21, 370.32), (117.65, 72.43, 141.59, 93.78), (319.68, 134.5, 342.2, 215.2)] | [0.681, 0.6958, 0.6788, 0.6266, 0.5801, 0.5907, 0.3993, 0.4188, 0.3322, 0.4242, 0.3804, 0.3182, 0.3121] | [giraffe, giraffe, giraffe, giraffe, giraffe, giraffe, tree, tree, giraffe, fence, zoo, animal, giraffe] | face . eye . hair . |
Now lets run the enrichment with the custom_prompt
column.
df = fd.enrich(task='zero-shot-detection',
model='grounding-dino',
input_df=df,
input_col='custom_prompt'
)
INFO:fastdup.models.grounding_dino:Loading model checkpoint from - /home/dnth/groundingdino_swint_ogc.pth
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). INFO:fastdup.models.grounding_dino:Model loaded on device - cuda
final text_encoder_type: bert-base-uncased
/home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/transformers/modeling_utils.py:768: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(
final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased final text_encoder_type: bert-base-uncased
df
filename | ram_tags | grounding_dino_bboxes | grounding_dino_scores | grounding_dino_labels | custom_prompt | |
---|---|---|---|---|---|---|
0 | coco_minitrain_25k/images/val2017/000000382734.jpg | bath . bathroom . doorway . drain . floor . glass door . room . screen door . shower . white | [] | [] | [] | face . eye . hair . |
1 | coco_minitrain_25k/images/val2017/000000508730.jpg | baby . bathroom . bathroom accessory . bin . boy . brush . chair . child . comb . diaper . hair . hairbrush . play . potty . sit . stool . tile wall . toddler . toilet bowl . toilet seat . toy | [(111.61, 183.91, 211.03, 300.08), (373.03, 117.48, 557.48, 255.6), (429.51, 205.87, 512.15, 275.5), (68.17, 105.78, 267.42, 265.82), (167.08, 234.47, 190.86, 247.08), (486.49, 222.14, 503.53, 232.73), (121.71, 238.59, 144.87, 249.47), (449.39, 223.3, 466.83, 232.74)] | [0.5332, 0.5661, 0.4495, 0.5786, 0.3995, 0.3641, 0.397, 0.3431] | [face, hair, face, hair, eye, eye, eye, eye] | face . eye . hair . |
2 | coco_minitrain_25k/images/val2017/000000202339.jpg | bus . bus station . business suit . carry . catch . city bus . pillar . man . shopping bag . sign . suit . tie . tour bus . walk | [(135.15, 44.15, 172.74, 96.72), (133.59, 34.84, 179.46, 67.8), (153.42, 59.65, 163.44, 66.16)] | [0.5035, 0.3673, 0.3156] | [face, hair, eye] | face . eye . hair . |
3 | coco_minitrain_25k/images/val2017/000000460929.jpg | beer . beer bottle . beverage . blanket . bottle . roll . can . car . chili dog . condiment . table . dog . drink . foil . hot . hot dog . mustard . picnic table . sit . soda . tinfoil . tomato sauce . wrap | [] | [] | [] | face . eye . hair . |
4 | coco_minitrain_25k/images/val2017/000000181796.jpg | bean . cup . table . dinning table . plate . food . fork . fruit . wine . meal . meat . peak . platter . potato . silverware . utensil . vegetable . white . wine glass | [] | [] | [] | face . eye . hair . |
5 | coco_minitrain_25k/images/val2017/000000052565.jpg | beach . black . cattle . coast . cow . sea . photo . sand . shore . shoreline . stand . stare . water . white | [(197.29, 193.85, 237.18, 263.0), (229.53, 214.17, 237.03, 223.19), (197.72, 213.68, 204.14, 222.31)] | [0.5046, 0.3701, 0.3383] | [face, eye, eye] | face . eye . hair . |
6 | coco_minitrain_25k/images/val2017/000000503755.jpg | catch . court . hand . goggles . necklace . play . racket . stand . sunglasses . tennis court . tennis player . tennis racket . wear . woman | [(121.77, 138.54, 265.7, 292.26)] | [0.6137] | [face] | face . eye . hair . |
7 | coco_minitrain_25k/images/val2017/000000477955.jpg | attach . beach . catch . coast . fly . person . kite . man . sea . parachute . parasail . sand . sky . stand . string . surfboard . surfer . wetsuit | [(192.48, 434.84, 229.47, 477.17)] | [0.7636] | [hair] | face . eye . hair . |
8 | coco_minitrain_25k/images/val2017/000000562229.jpg | boy . child . ride . road . skateboard . skateboarder . stand . trick | [(326.74, 154.97, 369.22, 202.22), (349.83, 163.57, 362.47, 171.97), (332.27, 165.95, 343.6, 172.67)] | [0.6392, 0.3289, 0.3172] | [face, eye, eye] | face . eye . hair . |
9 | coco_minitrain_25k/images/val2017/000000528862.jpg | animal . area . dirt field . enclosure . fence . field . giraffe . habitat . herd . lush . pen . savanna . stand . tree . walk . zoo | [] | [] | [] | face . eye . hair . |
Not all images contain "face", "eye" and "hair", let's remove the columns with no detections and plot the column with detections.
df = df[df['grounding_dino_labels'].astype(bool)]
plot_annotations(df,
image_col='filename',
tags_col='custom_prompt',
bbox_col='grounding_dino_bboxes',
scores_col='grounding_dino_scores',
labels_col='grounding_dino_labels',
num_rows=10
)
fastdup provides an easy way to load the Grounding DINO model and run an inference.
Let's suppose we have the following image and would like to run an inference with the Grounding DINO model.
from IPython.display import Image
Image("coco_minitrain_25k/images/val2017/000000449996.jpg")
You'll have to import the module and provide it with an image-text input pair.
Note: Text prompts must be separated with " . "
.
By default fastdup uses the smaller variant of Grounding DINO (Swin-T backbone).
from fastdup.models_grounding_dino import GroundingDINO
model = GroundingDINO()
results = model.run_inference(image_path="coco_minitrain_25k/images/val2017/000000449996.jpg",
text_prompt="air field . airliner . plane . airport . airport runway . airport terminal . jet . land . park . raceway . sky . tarmac . terminal",
box_threshold=0.3,
text_threshold=0.25
)
INFO:fastdup.models.grounding_dino:Loading model checkpoint from - /home/dnth/groundingdino_swint_ogc.pth
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). INFO:fastdup.models.grounding_dino:Model loaded on device - cuda
final text_encoder_type: bert-base-uncased
/home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/transformers/modeling_utils.py:768: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/dnth/anaconda3/envs/fastdup/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn(
results
{'labels': ['sky', 'airport terminal', 'plane', 'airliner', 'jet', 'jet', 'tarmac'], 'scores': [0.5286, 0.3451, 0.3822, 0.4872, 0.3853, 0.3502, 0.3026], 'boxes': [(1.47, 1.45, 638.46, 241.37), (329.38, 291.55, 468.1, 319.69), (142.03, 247.3, 261.96, 296.55), (443.6, 111.93, 495.47, 130.84), (113.85, 290.28, 246.55, 340.23), (391.59, 271.73, 465.1, 295.48), (2.35, 277.69, 637.63, 425.32)]}
Fine tune the detection output by varying the box_threshold
and text_threshold
values.
The outputs are stored in a Python dict
.
Let's plot the image and results using the annotate_image
convenience function.
from fastdup.models_utils import annotate_image
annotate_image("coco_minitrain_25k/images/val2017/000000449996.jpg", results)
You can optionally load another variant of Grounding DINO (Swin-B backbone) from the official Grounding DINO repo.
Download the weights and config into your local directory and pass them as arguments to the GroundingDINO
contructor.
model = GroundingDINO(model_config="GroundingDINO_SwinB_cfg.py",
model_weights="groundingdino_swinb_cogcoor.pth")
results = model.run_inference(image_path="coco_minitrain_25k/images/val2017/000000449996.jpg",
text_prompt="air field . airliner . plane . airport . airport runway . airport terminal . jet . land . park . raceway . sky . tarmac . terminal",
box_threshold=0.3,
text_threshold=0.25)
Once the enrichment is complete, you can also conveniently export the DataFrame into the COCO .json
annotation format. For now, only the bounding boxes and labels are exported. Masks will be added in a future release.
from fastdup.models_utils import export_to_coco
export_to_coco(df,
bbox_col='grounding_dino_bboxes',
label_col='grounding_dino_labels',
json_filename='grounding_dino_annot_coco_format.json'
)
In this tutorial, we showed how you can run zero-shot image detection models to enrich your dataset.
This notebook is Part 2 of the dataset enrichment notebook series where we utilize various zero-shot models to enrich datasets.
Please check out Part 3 of the series where we explore how to generate masks from the bounding boxes using zero-shot segmentation models like Segment Anything. See you there!
Questions about this tutorial? Reach out to us on our Slack channel!
Next, feel free to check out other tutorials -
If you prefer a no-code platform to inspect and visualize your dataset, try our free cloud product VL Profiler - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser.
VL Profiler is free to get started. Upload up to 1,000,000 images for analysis at zero cost!
Sign up now.
As usual, feedback is welcome! Questions? Drop by our Slack channel or open an issue on GitHub.