Europeana IIIF APIs, allows us to download, share, and reuse images and text of Europeana newspapers.
This notebook introduces how to explore the repository, search, read a record, obtain the fulltext and create a CSV dataset.
Europeana IIIF APIs requires an API key to access the endpoints. Please register at https://pro.europeana.eu/page/get-api to get a key.
import requests, csv
import json
import pandas as pd
In this section, we can add our api_key, the text that we want to use to search and retrieve the elements, and the number of records to retrieve.
api_key = 'add_your_api' #J6W44jvPV
query = 'paris'
The API allows us to search on text and retrieve the hits highlighted, as traditional systems (e.g. Lucene and Solr).
url = 'https://newspapers.eanadev.org/api/v2/search.json'
r = requests.get(url, params = {'query': query, 'profile': 'hits', 'wskey': api_key })
print(r.url)
response = r.text
#print(response)
results = json.loads(response)
for r in results['hits']:
print('id:' + r['scope'])
for s in r['selectors']:
print(s.get('prefix', '') + s.get('exact', '') + s.get('suffix', ''))
csv_out = csv.writer(open('eu_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
csv_out.writerow(['title', 'thumbnail', 'date', 'license', 'typem', 'language', 'fulltextUrl', 'manifestUrl', 'fulltext'])
A manifest describes the information needed for a viewer to present a digital object to the user, such as the title and the sequence of views/images. We can also retrieve the manifest of each item. According to the Europeana documentation, the request follows the pattern https://iiif.europeana.eu/presentation/%5BRECORD_ID%5D/manifest
The manifest includes the metadata, some of the attribues are multivalued.
The full text is available at https://www.europeana.eu/api/fulltext/9200303/BibliographicResource_3000059898023/472ef0641de5cce2ba8eb26d67110ed6#char=0,10o
results = json.loads(response)
for r in results['hits']:
title = thumbnail = date = license = typem = language = fulltextUrl = manifestUrl = fulltext =''
manifestUrl = 'https://iiif.europeana.eu/presentation/' + r['scope'] + '/manifest'
responseManifest = requests.get(manifestUrl, params = {'wskey': api_key })
print(responseManifest.url)
# retrieving the metadata
m = json.loads(responseManifest.text)
# retrieving metadata
title = m['label'][0]['@value']
thumbnail = m['thumbnail']['@id']
if 'navDate' in m:
date = m['navDate']
license = m['license']
for i in m['metadata']:
if i['label'] == 'type':
typem = i['value'][0]['@value']
elif i['label'] == 'language':
language = i['value'][0]['@value']
else: pass
## getting the full text
annopageUrl = 'https://iiif.europeana.eu/presentation/' + r['scope'] + '/annopage/1'
responseAnnopage = requests.get(annopageUrl, params = {'wskey': api_key })
print(responseAnnopage.url)
a = json.loads(responseAnnopage.text)
fulltextUrl = ''
if 'resources' in a:
fulltextUrl = a['resources'][0]['resource']['@id']
print(fulltextUrl)
responseFulltext = ''
if fulltextUrl != '':
responseFulltext = requests.get(fulltextUrl, params = {'wskey': api_key })
# retrieving the metadata
f = json.loads(responseFulltext.text)
# TODO check encoding
fulltext = f['value']
print('-------')
csv_out.writerow([title, thumbnail, date, license, typem, language, fulltextUrl, manifestUrl, fulltext])
# Load the CSV file from GitHub.
# This puts the data in a Pandas DataFrame
df = pd.read_csv('eu_records.csv')
df
Once we have queried the repository and we have the metadata as a CSV file, let's show the results as a thumbnail gallery.
from IPython.display import HTML, Image
def _src_from_data(data):
"""Base64 encodes image bytes for inclusion in an HTML img element"""
img_obj = Image(data=data)
for bundle in img_obj._repr_mimebundle_():
for mimetype, b64value in bundle.items():
if mimetype.startswith('image/'):
return f'data:{mimetype};base64,{b64value}'
def gallery(images, row_height='auto'):
"""Shows a set of images in a gallery that flexes with the width of the notebook.
Parameters
----------
images: list of str or bytes
URLs or bytes of images to display
row_height: str
CSS height value to assign to all images. Set to 'auto' by default to show images
with their native dimensions. Set to a value like '250px' to make all rows
in the gallery equal height.
"""
figures = []
for image in images:
if isinstance(image, bytes):
src = _src_from_data(image)
caption = ''
else:
src = image
caption = f'<figcaption style="font-size: 0.6em">{image}</figcaption>'
figures.append(f'''
<figure style="margin: 5px !important;">
<img src="{src}" style="height: {row_height}">
</figure>
''')
return HTML(data=f'''
<div style="display: flex; flex-flow: row wrap; text-align: center;">
{''.join(figures)}
</div>
''')
#gallery(urls, row_height='150px')
gallery(df['thumbnail'], row_height='150px')