This notebook shows how to quickly analyze an image dataset for potential image mislabels and export the list of mislabeled images for further inspection.
First, let's start with the installation:
โ Tip - If you're new to fastdup, we encourage you to run the notebook in Google Colab or Kaggle for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using nbviewer.
!pip install fastdup -Uq
Now, test the installation by printing out the version. If there's no error message, we are ready to go!
import fastdup
fastdup.__version__
'2.0.21'
In this notebook let's use a widely available and relatively well curated Food-101 dataset.
The Food-101 dataset consists of 101 food classes with 1,000 images per class. That is a total of 101,000 images.
Let's download only from the dataset and extract them into our local directory:
๐ Note - fastdup works on both unlabeled and labeled images. But for now, we are only interested in finding issues in the images and not the annotations. If you're interested in finding annotation issues, head to:
Let's download only from the dataset and extract them into the local directory:
!wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz
!tar -xf food-101.tar.gz
food-101 dataset has a specific structure where the images are stored in folders named after the class name. Let's create a DataFrame with the annotations.
import os
import pandas as pd
dataset_dir = 'food-101/images/'
filenames = []
labels = []
# Iterate over the directory and subdirectories
for root, dirs, files in os.walk(dataset_dir):
# Skip the root directory
if root == dataset_dir:
continue
label = os.path.basename(root)
for filename in files:
filenames.append(os.path.join(root, filename))
labels.append(label)
data = {'filename': filenames, 'label': labels}
df = pd.DataFrame(data)
df
filename | label | |
---|---|---|
0 | food-101/images/gnocchi/1642469.jpg | gnocchi |
1 | food-101/images/gnocchi/1598303.jpg | gnocchi |
2 | food-101/images/gnocchi/79585.jpg | gnocchi |
3 | food-101/images/gnocchi/2397771.jpg | gnocchi |
4 | food-101/images/gnocchi/2388954.jpg | gnocchi |
... | ... | ... |
100995 | food-101/images/bread_pudding/2415610.jpg | bread_pudding |
100996 | food-101/images/bread_pudding/723067.jpg | bread_pudding |
100997 | food-101/images/bread_pudding/1051348.jpg | bread_pudding |
100998 | food-101/images/bread_pudding/3607583.jpg | bread_pudding |
100999 | food-101/images/bread_pudding/1907181.jpg | bread_pudding |
101000 rows ร 2 columns
Once the extraction completes, we can run fastdup on the images.
For that let's initialize fastdup and specify the input directory which points to the folder of images.
๐ Note - The
.create
method also has an optionalwork_dir
parameter which specifies the directory to store artifacts from the run.
In other words you can run fastdup.create(input_dir="images/", work_dir="my_work_dir/")
if you'd like to store the artifacts in a my_work_dir
.
Now, let's run fastdup.
fd = fastdup.create(input_dir="food-101/images/")
fd.run(annotations=df)
outliers_df = fd.outliers()
outliers_df
outlier | nearest | distance | filename_outlier | label_outlier | index_x | error_code_outlier | is_valid_outlier | fd_index_outlier | filename_nearest | label_nearest | index_y | error_code_nearest | is_valid_nearest | fd_index_nearest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75368 | 13490 | 0.379365 | food-101/images/breakfast_burrito/462294.jpg | breakfast_burrito | 75368 | VALID | True | 75368 | food-101/images/tacos/1505262.jpg | tacos | 13490 | VALID | True | 13490 |
1 | 41508 | 16764 | 0.429240 | food-101/images/macarons/2117640.jpg | macarons | 41508 | VALID | True | 41508 | food-101/images/fish_and_chips/2079080.jpg | fish_and_chips | 16764 | VALID | True | 16764 |
2 | 13490 | 19357 | 0.515785 | food-101/images/tacos/1505262.jpg | tacos | 13490 | VALID | True | 13490 | food-101/images/red_velvet_cake/3143813.jpg | red_velvet_cake | 19357 | VALID | True | 19357 |
3 | 3049 | 98686 | 0.528563 | food-101/images/shrimp_and_grits/1047420.jpg | shrimp_and_grits | 3049 | VALID | True | 3049 | food-101/images/club_sandwich/2465517.jpg | club_sandwich | 98686 | VALID | True | 98686 |
4 | 30949 | 65310 | 0.547157 | food-101/images/sushi/3100962.jpg | sushi | 30949 | VALID | True | 30949 | food-101/images/deviled_eggs/3145324.jpg | deviled_eggs | 65310 | VALID | True | 65310 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6045 | 40611 | 40758 | 0.772242 | food-101/images/chocolate_cake/2533462.jpg | chocolate_cake | 40611 | VALID | True | 40611 | food-101/images/chocolate_cake/652245.jpg | chocolate_cake | 40758 | VALID | True | 40758 |
6046 | 22826 | 66582 | 0.772263 | food-101/images/dumplings/1325469.jpg | dumplings | 22826 | VALID | True | 22826 | food-101/images/escargots/1488896.jpg | escargots | 66582 | VALID | True | 66582 |
6047 | 96748 | 96668 | 0.772266 | food-101/images/steak/513129.jpg | steak | 96748 | VALID | True | 96748 | food-101/images/steak/3113772.jpg | steak | 96668 | VALID | True | 96668 |
6048 | 61641 | 78643 | 0.772278 | food-101/images/chocolate_mousse/1463326.jpg | chocolate_mousse | 61641 | VALID | True | 61641 | food-101/images/tiramisu/849295.jpg | tiramisu | 78643 | VALID | True | 78643 |
6049 | 84483 | 84139 | 0.772282 | food-101/images/baby_back_ribs/645544.jpg | baby_back_ribs | 84483 | VALID | True | 84483 | food-101/images/baby_back_ribs/1571645.jpg | baby_back_ribs | 84139 | VALID | True | 84139 |
6050 rows ร 15 columns
outliers_df = outliers_df[['filename_outlier', 'filename_nearest', 'distance', 'label_outlier', 'label_nearest']]
outliers_df
filename_outlier | filename_nearest | distance | label_outlier | label_nearest | |
---|---|---|---|---|---|
0 | food-101/images/breakfast_burrito/462294.jpg | food-101/images/tacos/1505262.jpg | 0.379365 | breakfast_burrito | tacos |
1 | food-101/images/macarons/2117640.jpg | food-101/images/fish_and_chips/2079080.jpg | 0.429240 | macarons | fish_and_chips |
2 | food-101/images/tacos/1505262.jpg | food-101/images/red_velvet_cake/3143813.jpg | 0.515785 | tacos | red_velvet_cake |
3 | food-101/images/shrimp_and_grits/1047420.jpg | food-101/images/club_sandwich/2465517.jpg | 0.528563 | shrimp_and_grits | club_sandwich |
4 | food-101/images/sushi/3100962.jpg | food-101/images/deviled_eggs/3145324.jpg | 0.547157 | sushi | deviled_eggs |
... | ... | ... | ... | ... | ... |
6045 | food-101/images/chocolate_cake/2533462.jpg | food-101/images/chocolate_cake/652245.jpg | 0.772242 | chocolate_cake | chocolate_cake |
6046 | food-101/images/dumplings/1325469.jpg | food-101/images/escargots/1488896.jpg | 0.772263 | dumplings | escargots |
6047 | food-101/images/steak/513129.jpg | food-101/images/steak/3113772.jpg | 0.772266 | steak | steak |
6048 | food-101/images/chocolate_mousse/1463326.jpg | food-101/images/tiramisu/849295.jpg | 0.772278 | chocolate_mousse | tiramisu |
6049 | food-101/images/baby_back_ribs/645544.jpg | food-101/images/baby_back_ribs/1571645.jpg | 0.772282 | baby_back_ribs | baby_back_ribs |
6050 rows ร 5 columns
Let's select the top 30 outliers and display them.
outliers_df = outliers_df.head(30)
import base64
from io import BytesIO
from PIL import Image
def resize_and_encode_image(image_path, width=100):
with Image.open(image_path) as img:
wpercent = (width / float(img.size[0]))
height = int((float(img.size[1]) * float(wpercent)))
resized_img = img.resize((width, height))
buffered = BytesIO()
resized_img.save(buffered, format="PNG")
encoded_string = base64.b64encode(buffered.getvalue()).decode('utf-8')
return f'<img src="data:image/png;base64,{encoded_string}" width="{width}">'
def display_image(image_path, width=100):
if isinstance(image_path, str):
return resize_and_encode_image(image_path, width)
else:
return ''
outliers_df['filename_outlier_preview'] = outliers_df['filename_outlier'].apply(lambda x: display_image(x, width=100))
outliers_df['filename_nearest_preview'] = outliers_df['filename_nearest'].apply(lambda x: display_image(x, width=100))
display(outliers_df.style)
/tmp/ipykernel_54998/3709838206.py:21: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy outliers_df['filename_outlier_preview'] = outliers_df['filename_outlier'].apply(lambda x: display_image(x, width=100)) /tmp/ipykernel_54998/3709838206.py:22: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy outliers_df['filename_nearest_preview'] = outliers_df['filename_nearest'].apply(lambda x: display_image(x, width=100))
filename_outlier | filename_nearest | distance | label_outlier | label_nearest | filename_outlier_preview | filename_nearest_preview | |
---|---|---|---|---|---|---|---|
0 | food-101/images/breakfast_burrito/462294.jpg | food-101/images/tacos/1505262.jpg | 0.379365 | breakfast_burrito | tacos | ||
1 | food-101/images/macarons/2117640.jpg | food-101/images/fish_and_chips/2079080.jpg | 0.429240 | macarons | fish_and_chips | ||
2 | food-101/images/tacos/1505262.jpg | food-101/images/red_velvet_cake/3143813.jpg | 0.515785 | tacos | red_velvet_cake | ||
3 | food-101/images/shrimp_and_grits/1047420.jpg | food-101/images/club_sandwich/2465517.jpg | 0.528563 | shrimp_and_grits | club_sandwich | ||
4 | food-101/images/sushi/3100962.jpg | food-101/images/deviled_eggs/3145324.jpg | 0.547157 | sushi | deviled_eggs | ||
5 | food-101/images/pho/2399877.jpg | food-101/images/hot_dog/1823010.jpg | 0.573438 | pho | hot_dog | ||
6 | food-101/images/pho/1840846.jpg | food-101/images/chocolate_mousse/456162.jpg | 0.574433 | pho | chocolate_mousse | ||
7 | food-101/images/chocolate_cake/2518457.jpg | food-101/images/paella/3838854.jpg | 0.576987 | chocolate_cake | paella | ||
8 | food-101/images/tacos/1091159.jpg | food-101/images/ice_cream/618711.jpg | 0.583393 | tacos | ice_cream | ||
9 | food-101/images/red_velvet_cake/2894652.jpg | food-101/images/red_velvet_cake/2750594.jpg | 0.589379 | red_velvet_cake | red_velvet_cake | ||
10 | food-101/images/waffles/720603.jpg | food-101/images/lasagna/1142842.jpg | 0.591061 | waffles | lasagna | ||
11 | food-101/images/pad_thai/2614597.jpg | food-101/images/apple_pie/2008772.jpg | 0.592497 | pad_thai | apple_pie | ||
12 | food-101/images/prime_rib/587532.jpg | food-101/images/poutine/529562.jpg | 0.594438 | prime_rib | poutine | ||
13 | food-101/images/macarons/2591602.jpg | food-101/images/cup_cakes/3299930.jpg | 0.594465 | macarons | cup_cakes | ||
14 | food-101/images/hamburger/1608876.jpg | food-101/images/cheese_plate/2206573.jpg | 0.596191 | hamburger | cheese_plate | ||
15 | food-101/images/macaroni_and_cheese/912672.jpg | food-101/images/falafel/2666983.jpg | 0.596902 | macaroni_and_cheese | falafel | ||
16 | food-101/images/peking_duck/388951.jpg | food-101/images/macarons/2710408.jpg | 0.601192 | peking_duck | macarons | ||
17 | food-101/images/steak/2788759.jpg | food-101/images/pulled_pork_sandwich/2098588.jpg | 0.605568 | steak | pulled_pork_sandwich | ||
18 | food-101/images/ice_cream/1837798.jpg | food-101/images/chocolate_cake/662729.jpg | 0.610101 | ice_cream | chocolate_cake | ||
19 | food-101/images/grilled_salmon/795787.jpg | food-101/images/prime_rib/3286982.jpg | 0.611880 | grilled_salmon | prime_rib | ||
20 | food-101/images/miso_soup/881247.jpg | food-101/images/fried_calamari/440673.jpg | 0.615745 | miso_soup | fried_calamari | ||
21 | food-101/images/creme_brulee/1661605.jpg | food-101/images/pork_chop/1569230.jpg | 0.616932 | creme_brulee | pork_chop | ||
22 | food-101/images/ice_cream/1793992.jpg | food-101/images/hot_dog/502977.jpg | 0.619381 | ice_cream | hot_dog | ||
23 | food-101/images/cup_cakes/1005580.jpg | food-101/images/chocolate_cake/2480326.jpg | 0.622404 | cup_cakes | chocolate_cake | ||
24 | food-101/images/onion_rings/2447676.jpg | food-101/images/donuts/921183.jpg | 0.622859 | onion_rings | donuts | ||
25 | food-101/images/bread_pudding/1375816.jpg | food-101/images/chocolate_mousse/2177988.jpg | 0.624186 | bread_pudding | chocolate_mousse | ||
26 | food-101/images/chicken_curry/2523126.jpg | food-101/images/pulled_pork_sandwich/1782028.jpg | 0.625223 | chicken_curry | pulled_pork_sandwich | ||
27 | food-101/images/pho/3642399.jpg | food-101/images/grilled_cheese_sandwich/1709486.jpg | 0.628450 | pho | grilled_cheese_sandwich | ||
28 | food-101/images/cheesecake/2160930.jpg | food-101/images/mussels/2039320.jpg | 0.632786 | cheesecake | mussels | ||
29 | food-101/images/takoyaki/914304.jpg | food-101/images/grilled_salmon/2429320.jpg | 0.633307 | takoyaki | grilled_salmon |
Now we can export the results to a CSV file for further analysis and correction of labels.
outliers_df.drop(columns=['filename_outlier_preview', 'filename_nearest_preview']).to_csv('outliers.csv', index=False)
In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.
To explore the dataset and issues interactively in a browser, run:
fd.explore()
๐ Note - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.
You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.
That's a wrap! In this notebook, we showed how to get mislabels from a labeled dataset.
Next, feel free to check out other tutorials -
As usual, feedback is welcome! Questions? Drop by our Slack channel or open an issue on GitHub.