Notebook

Accessing Europeana IIIF APIs¶

Europeana IIIF APIs, allows us to download, share, and reuse images and text of Europeana newspapers.

This notebook introduces how to explore the repository, search, read a record, obtain the fulltext and create a CSV dataset.

Europeana IIIF APIs requires an API key to access the endpoints. Please register at https://pro.europeana.eu/page/get-api to get a key.

Setting up things¶

In [ ]:

import requests, csv
import json
import pandas as pd

Glogal configuration¶

In this section, we can add our api_key, the text that we want to use to search and retrieve the elements, and the number of records to retrieve.

In [ ]:

api_key = 'add_your_api' #J6W44jvPV
query = 'paris'

Performing a search using the API¶

The API allows us to search on text and retrieve the hits highlighted, as traditional systems (e.g. Lucene and Solr).

In [ ]:

url = 'https://newspapers.eanadev.org/api/v2/search.json'
r = requests.get(url, params = {'query': query, 'profile': 'hits', 'wskey': api_key })
print(r.url)
response = r.text
#print(response)

Displaying the mentions in the transcribed text where the search keyword was found¶

In [ ]:

results = json.loads(response)

for r in results['hits']:
    print('id:' + r['scope'])
    for s in r['selectors']:
        
        print(s.get('prefix', '') + s.get('exact', '') + s.get('suffix', ''))

Creating a CSV file¶

In [ ]:

csv_out = csv.writer(open('eu_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
csv_out.writerow(['title', 'thumbnail', 'date', 'license', 'typem', 'language', 'fulltextUrl', 'manifestUrl', 'fulltext'])

Retrieving the manifests¶

A manifest describes the information needed for a viewer to present a digital object to the user, such as the title and the sequence of views/images. We can also retrieve the manifest of each item. According to the Europeana documentation, the request follows the pattern https://iiif.europeana.eu/presentation/%5BRECORD_ID%5D/manifest

The manifest includes the metadata, some of the attribues are multivalued.

The full text is available at https://www.europeana.eu/api/fulltext/9200303/BibliographicResource_3000059898023/472ef0641de5cce2ba8eb26d67110ed6#char=0,10o

In [ ]:

results = json.loads(response)

for r in results['hits']:
    
    title = thumbnail = date = license = typem = language = fulltextUrl = manifestUrl = fulltext =''
    
    manifestUrl = 'https://iiif.europeana.eu/presentation/' + r['scope'] + '/manifest'
    responseManifest = requests.get(manifestUrl, params = {'wskey': api_key })
    print(responseManifest.url)
    
    # retrieving the metadata
    m = json.loads(responseManifest.text)
    
    # retrieving metadata
    title = m['label'][0]['@value']
    thumbnail = m['thumbnail']['@id']
    if 'navDate' in m:
        date = m['navDate']
    license = m['license']

    for i in m['metadata']:
        if i['label'] == 'type':
            typem = i['value'][0]['@value']
        elif i['label'] == 'language':
            language = i['value'][0]['@value']
        else: pass
        
    ## getting the full text
    annopageUrl = 'https://iiif.europeana.eu/presentation/' + r['scope'] + '/annopage/1'
    responseAnnopage = requests.get(annopageUrl, params = {'wskey': api_key })
    print(responseAnnopage.url)
    
    a = json.loads(responseAnnopage.text)
    fulltextUrl = ''
    if 'resources' in a:
        fulltextUrl = a['resources'][0]['resource']['@id']
        print(fulltextUrl)
    
    responseFulltext = ''
    if fulltextUrl != '':
        responseFulltext = requests.get(fulltextUrl, params = {'wskey': api_key })
   
        # retrieving the metadata
        f = json.loads(responseFulltext.text)
        # TODO check encoding
        fulltext = f['value']
   
    print('-------')
    
    csv_out.writerow([title, thumbnail, date, license, typem, language, fulltextUrl, manifestUrl, fulltext])

In [ ]:

# Load the CSV file from GitHub.
# This puts the data in a Pandas DataFrame
df = pd.read_csv('eu_records.csv')

Have a peek¶

In [ ]:

df

Showing the thumbnails as a gallery¶

Once we have queried the repository and we have the metadata as a CSV file, let's show the results as a thumbnail gallery.

In [ ]:

from IPython.display import HTML, Image

def _src_from_data(data):
    """Base64 encodes image bytes for inclusion in an HTML img element"""
    img_obj = Image(data=data)
    for bundle in img_obj._repr_mimebundle_():
        for mimetype, b64value in bundle.items():
            if mimetype.startswith('image/'):
                return f'data:{mimetype};base64,{b64value}'

def gallery(images, row_height='auto'):
    """Shows a set of images in a gallery that flexes with the width of the notebook.
    
    Parameters
    ----------
    images: list of str or bytes
        URLs or bytes of images to display

    row_height: str
        CSS height value to assign to all images. Set to 'auto' by default to show images
        with their native dimensions. Set to a value like '250px' to make all rows
        in the gallery equal height.
    """
    figures = []
    for image in images:
        if isinstance(image, bytes):
            src = _src_from_data(image)
            caption = ''
        else:
            src = image
            caption = f'<figcaption style="font-size: 0.6em">{image}</figcaption>'
        figures.append(f'''
            <figure style="margin: 5px !important;">
              <img src="{src}" style="height: {row_height}">
              
            </figure>
        ''')
    return HTML(data=f'''
        <div style="display: flex; flex-flow: row wrap; text-align: center;">
        {''.join(figures)}
        </div>
    ''')

In [ ]:

#gallery(urls, row_height='150px')
gallery(df['thumbnail'], row_height='150px')