Creating your own dataset from Google Images

by: Francisco Ingham and Jeremy Howard. Inspired by Adrian Rosebrock

In this tutorial we will see how to easily create an image dataset through Google Images. Note: You will have to repeat these steps for any new category you want to Google (e.g once for dogs and once for cats).

Get a list of URLs

Search and scroll

Go to Google Images and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.

It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:

"canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis

You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.

Download into file

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

Press CtrlShiftJ in Windows/Linux and CmdOptJ in Mac, and a small window the javascript 'Console' will appear. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. You can do this by running the following commands:

urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

Create directory and upload urls file into your server

In [1]:
from fastai import *
from fastai.vision import *

Choose an appropriate name for your labeled images. You can run these steps multiple times to grab different labels.

In [4]:
folder = 'powdery_mildew'
file = 'urls_powdery_mildew.txt'
In [17]:
folder = 'blight'
file = 'urls_blight.txt'
In [22]:
folder = 'rust'
file = 'urls_rust.txt'
In [26]:
folder = 'mosaic'
file = 'urls_mosaic.txt'

You will need to run this line once per each category.

In [2]:
path = Path('data/plant_diseases')
In [5]:
dest = path / folder
dest.mkdir(parents=True, exist_ok=True)

Finally, upload your urls file. You just need to press 'Upload' in your working directory and select your file, then click 'Upload' for each of the displayed files.

uploaded file

Download images

Now you will need to download you images from their respective urls.

fast.ai has a function that allows you to do just that. You just have to specify the urls filename and the destination folder and this function will download and save all images that can be opened. If they have some problem in being opened, they will not be saved.

Let's download our images! Notice you can choose a maximum number of images to be downloaded. In this case we will not download all the urls.

You will need to run this line once for every category.

In [6]:
classes = ['powdery_mildew', 'blight', 'rust', 'mosaic']

Download images for powdery mildew:

In [12]:
download_images(path / file, dest, max_pics=200)
100.00% [200/200 00:15<00:00]
Error https://www.skynursery.com/wp-content/uploads/2016/07/PowderyMildewOnSquash.jpg HTTPSConnectionPool(host='www.skynursery.com', port=443): Max retries exceeded with url: /wp-content/uploads/2016/07/PowderyMildewOnSquash.jpg (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')])")))

Download images for blight:

In [21]:
download_images(path / file, dest, max_pics=200)
100.00% [200/200 00:39<00:00]
Error https://www.veggiegardener.com/wp-content/uploads/sites/3/2009/06/Tips-for-Preventing-and-Treating-Tomato-Blights.jpg HTTPConnectionPool(host='127.0.0.1', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc491b62e48>: Failed to establish a new connection: [Errno 111] Connection refused'))
Error https://extension.umd.edu/sites/default/files/_images/programs/grow_it_eat_it/diseases/EarlyBlight/20080710-early%20blight%20lesions%20with%20yellow%20haloes.jpg HTTPSConnectionPool(host='extension.umd.edu', port=443): Max retries exceeded with url: /sites/default/files/_images/programs/grow_it_eat_it/diseases/EarlyBlight/20080710-early%20blight%20lesions%20with%20yellow%20haloes.jpg (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')])")))
Error https://extension.umd.edu/sites/default/files/_images/programs/grow_it_eat_it/diseases/EarlyBlight/20020531-early%20blight%20%20starts%20on%20lower%20leaves.jpg HTTPSConnectionPool(host='extension.umd.edu', port=443): Max retries exceeded with url: /sites/default/files/_images/programs/grow_it_eat_it/diseases/EarlyBlight/20020531-early%20blight%20%20starts%20on%20lower%20leaves.jpg (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')])")))
Error https://www2.gov.bc.ca/assets/gov/farming-natural-resources-and-industry/agriculture-and-seafood/animal-and-crops/plant-health-images/hazelnt_bl1.jpg ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Download images for rust:

In [25]:
download_images(path / file, dest, max_pics=200)
100.00% [200/200 00:18<00:00]
Error https://extension.umd.edu/sites/default/files/_images/programs/hgic/Diseases/CedarAppleRustGalls.jpg HTTPSConnectionPool(host='extension.umd.edu', port=443): Max retries exceeded with url: /sites/default/files/_images/programs/hgic/Diseases/CedarAppleRustGalls.jpg (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')])")))
Error http://agriculture.vic.gov.au/__data/assets/image/0014/228002/stem-rust-example.jpg HTTPConnectionPool(host='agriculture.vic.gov.au', port=80): Read timed out. (read timeout=4)
Error http://agriculture.vic.gov.au/__data/assets/image/0019/228025/blueberry-rust2.jpg HTTPConnectionPool(host='agriculture.vic.gov.au', port=80): Read timed out. (read timeout=4)

Download images for mosaic:

In [28]:
download_images(path / file, dest, max_pics=200)
100.00% [200/200 00:35<00:00]
Error http://nwdistrict.ifas.ufl.edu/phag/files/2014/11/Paret-Fig-4.jpg HTTPConnectionPool(host='nwdistrict.ifas.ufl.edu', port=80): Read timed out. (read timeout=4)
Error x-raw-image:///225f4da7727fd1423ba9342df7c704665f43a39b6852dc6a0cac295f63fe2824 No connection adapters were found for 'x-raw-image:///225f4da7727fd1423ba9342df7c704665f43a39b6852dc6a0cac295f63fe2824'
In [ ]:
# If you have problems download, try with `max_workers=0` to see exceptions:
# download_images(path/file, dest, max_pics=20, max_workers=0)

Then we can remove any images that can't be opened:

In [29]:
for c in classes:
    print(c)
    verify_images(path / c, delete=True, max_workers=8)
powdery_mildew
100.00% [199/199 00:00<00:00]
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000087.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000076.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000181.jpeg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000162.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000113.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000099.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000097.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000021.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/powdery_mildew/00000184.jpeg'
blight
100.00% [196/196 00:00<00:00]
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/blight/00000119.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/blight/00000193.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/blight/00000044.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/blight/00000169.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/blight/00000081.jpg'
rust
100.00% [194/194 00:00<00:00]
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/rust/00000138.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/rust/00000043.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/rust/00000072.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/rust/00000052.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/rust/00000012.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/rust/00000193.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/rust/00000063.png'
mosaic
100.00% [197/197 00:00<00:00]
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000088.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000156.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000023.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000148.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000192.gif'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000157.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000179.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000018.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000122.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000055.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000143.jpg'
cannot identify image file '/home/cedric/course-v3/nbs/dl1/data/plant_diseases/mosaic/00000177.jpg'

View data

In [7]:
np.random.seed(42)
In [8]:
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
                                  ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
In [ ]:
# If you already cleaned your data, run this cell instead of the one before
# np.random.seed(42)
# data = ImageDataBunch.from_csv(".", folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
#         ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

Good! Let's take a look at some of our pictures then.

In [9]:
data.classes
Out[9]:
['blight', 'mosaic', 'powdery_mildew', 'rust']
In [10]:
data.show_batch(rows=5, figsize=(13, 12))