Smithsonian Open Access, allows us to download, share, and reuse millions of the Smithsonian’s images and data from across the Smithsonian’s 19 museums, nine research centers, libraries, archives, and the National Zoo.
This notebook introduces how to explore the repository and create a CSV dataset. In additon, this example applies computer vision methods based on face detection which has gained relevance especially in fields like photography and marketing.
The Open Access API requires an API key to access the endpoints. Please register with https://api.data.gov/signup/ to get a key.
import requests, csv
import json
import pandas as pd
import cv2
import os
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw, ImageFont, ImageOps
from io import BytesIO
In this section, we can add our api_key, the text that we want to use to search and retrieve the elements, and the number of records to retrieve.
api_key = 'L20kTDAWj35bazo1Zhwx8wN5Ua0zKmhHz8PtIacX' # add your own api_key
q = 'theodore roosevelt' # querystring
rows = '100' # number of records to retrieve
Please visit https://edan.si.edu/openaccess/apidocs/#api-search-search for more information.
url = 'https://api.si.edu/openaccess/api/v1.0/search'
r = requests.get(url, params = {'q': q, 'start':'0', 'rows': rows, 'api_key': api_key })
print(r.url)
response = r.text
csv_out = csv.writer(open('si_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
csv_out.writerow(['id','title', 'date', 'media_usage', 'data_source', 'dimensions', 'sitter', 'type', 'medium', 'artist', 'manifestUrl', 'imageUrl'])
results = json.loads(response)
for r in results['response']['rows']:
print(r['id'] + ' ' + r['title'])
print(r)
# getting the identifiers of the records to access the IIIF manifests
try:
for i in range(len(r['content']['descriptiveNonRepeating']['online_media']['media'])):
idsId = r['content']['descriptiveNonRepeating']['online_media']['media'][i]['idsId']
print(idsId)
# retrieving the manifest
iiifUrl = 'https://ids.si.edu/ids/manifest/' + idsId
iiifItemResponse = requests.get(iiifUrl)
imageUrl = 'https://ids.si.edu/ids/iiif/' + idsId + '/full/full/0/default.jpg'
print(imageUrl)
iiifItem = json.loads(iiifItemResponse.text)
# retrieving metadata
title = date = licence = datasource = dimensions = sitter = typem = medium = artist =''
for i in iiifItem['metadata']:
if i['label'] == 'Title':
title = i['value']
elif i['label'] == 'Date':
date = i['value']
elif i['label'] == 'Media Usage':
licence = i['value']
elif i['label'] == 'Data Source':
datasource = i['value']
elif i['label'] == 'Dimensions':
dimensions = i['value']
elif i['label'] == 'Sitter':
sitter = i['value']
elif i['label'] == 'Type':
typem = i['value']
elif i['label'] == 'Medium':
medium = i['value']
elif i['label'] == 'Artist':
artist = i['value']
else: pass
csv_out.writerow([idsId,title,date,licence,datasource,dimensions,sitter,typem,medium,artist,iiifUrl,imageUrl])
except:
print("An exception occurred")
We can use Pandas to give us a quick overview of the dataset.
# Load the CSV file from GitHub.
# This puts the data in a Pandas DataFrame
df = pd.read_csv('si_records.csv')
df
# How many items?
len(df)
# Get unique values
artist = pd.unique(df['artist'].str.split('|', expand=True).stack()).tolist()
for a in sorted(artist):
print(a)
# Splits the people column and counts frequencies
artist_counts = df['artist'].str.split('|').apply(lambda x: pd.Series(x).value_counts()).sum().astype('int').sort_values(ascending=False).to_frame().reset_index(level=0)
# Add column names
artist_counts.columns = ['name', 'count']
# Display with horizontal bars
display(artist_counts.style.bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'}))
# Get unique values
types = pd.unique(df['type']).tolist()
for type in sorted(types, key=str.lower):
print(type)
Face detection is a computer vision technology that locates human faces in digital images and videos. Let's try to identify.
Open Source Computer Vision Library\cite{opencv} (OpenCV) is an open source computer vision and machine learning software library. In this example, images are treated as an standard Numpy array containing pixels of data points.
Read in the image using the imread function. We will be using the colored 'mandrill' image for demonstration purpose.
test_image = cv2.imread('smithsonian-example.jpg')
plt.imshow(test_image)
The type and shape of the array.
#type(test_image)
print(test_image.shape)
Regarding the color, we expected a bright colored image but we obtained a bluish image. That happens because OpenCV and matplotlib have different orders of primary colors
While OpenCV reads images using BGR, matplotlib uses RGB. To avoid this issue, we will transform how matplotlib expects it using a cvtColor function.
rgb_image = cv2.cvtColor(test_image, cv2.COLOR_BGR2RGB)
plt.imshow(rgb_image)
#We define a function
def convertToRGB(image):
return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
haar_cascade_face = cv2.CascadeClassifier('opencv/haarcascade_frontalface_default.xml')
We shall be using the detectMultiscale module of the classifier. This function will return a rectangle with coordinates(x,y,w,h) around the detected face.
faces_rects = haar_cascade_face.detectMultiScale(test_image, scaleFactor = 1.2, minNeighbors = 5);
# Let us print the no. of faces found
print('Faces found: ', len(faces_rects))
Our next step is to loop over all the coordinates it returned and draw rectangles around them using Open CV. We will be drawing a green rectangle with a thickness of 2
for (x,y,w,h) in faces_rects:
cv2.rectangle(test_image, (x, y), (x+w, y+h), (0, 255, 0), 4)
Finally, we display the original image to identify if the face has been detected correctly.
plt.imshow(convertToRGB(test_image))
First we download all the portraits. It may take a while since portratis due to its size and quality.
for index, row in df.iterrows():
print(index, row['imageUrl'])
response = requests.get(row['imageUrl'])
img = Image.open(BytesIO(response.content))
img.save('is-images/m-{}.jpg'.format(row['id']), quality=90)
Finally, we process all the images
rows = 20
files = os.listdir('is-images')
fig = plt.figure(figsize=(100,300))
for num, x in enumerate(files):
#img = Image.open('is-images/'+ x)
img = cv2.imread('is-images/'+ x)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces_rects = haar_cascade_face.detectMultiScale(img_gray, scaleFactor = 1.2, minNeighbors = 5);
for (x,y,w,h) in faces_rects:
cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)
#convert image to RGB and show image
img_face = convertToRGB(img)
plt.subplot(rows,5,num+1)
plt.axis('off')
plt.imshow(img_face)