New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
Web archives don't just contain HTML pages! Using the filter
parameter in CDX queries we can limit our results to particular types of files, for example Powerpoint presentations.
This notebook helps you find, download, and explore all the presentation files captured from a particular domain, like defence.gov.au
. It uses the Internet Archive by default, as their CDX API allows domain level queries and pagination, however, you could try using the UKWA or the National Library of Australia (prefix queries only).
This notebook includes a series of processing steps:
Here's an example of the SQLite database created by harvesting Powerpoint files from the defence.gov.au
domain, running in Datasette on Glitch.
Moving large files around and extracting useful data from proprietary formats is not a straightforward process. While this notebook has been tested and will work running on Binder, you'll probably want to shift across to a local machine if you're doing any large-scale harvesting. That'll make it easier for you to deal with corrupted files, broken downloads etc.
For more examples of harvesting domain-level data see Harvesting data about a domain using the IA CDX API.
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
import pandas as pd
import time
import os
from urllib.parse import urlparse
from pathlib import Path
import time
import math
import io
from PIL import Image
import PIL
# pyMuPDF (aka Fitz) seems to do a better job of converting PDFs to images than pdf2image
import fitz
import sqlite_utils
from slugify import slugify
import arrow
from notebook.notebookapp import list_running_servers
from IPython.display import display, HTML, FileLink, FileLinks
# Also need LibreOffice installed by whatever means is needed for local system
s = requests.Session()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
def convert_lists_to_dicts(results):
'''
Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
Renames keys to standardise IA with other Timemaps.
'''
if results:
keys = results[0]
results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
else:
results_as_dicts = results
for d in results_as_dicts:
d['status'] = d.pop('statuscode')
d['mime'] = d.pop('mimetype')
d['url'] = d.pop('original')
return results_as_dicts
def get_total_pages(params):
these_params = params.copy()
these_params['showNumPages'] = 'true'
response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})
return int(response.text)
def query_cdx_ia(url, **kwargs):
results = []
page = 0
params = kwargs
params['url'] = url
params['output'] = 'json'
total_pages = get_total_pages(params)
# print(total_pages)
with tqdm(total=total_pages-page) as pbar:
while page < total_pages:
params['page'] = page
response = s.get('http://web.archive.org/cdx/search/cdx', params=params, headers={'User-Agent': ''})
# print(response.url)
response.raise_for_status()
results += convert_lists_to_dicts(response.json())
page += 1
pbar.update(1)
time.sleep(0.2)
return results
See Exploring the Internet Archive's CDX API and Comparing CDX APIs for more information on getting CDX data.
Setting filter=mimetype:application/vnd.ms-powerpoint
would get standard MS .ppt
files, but what about .pptx
or Open Office files? Using a regular expression like filter=mimetype:.*(powerpoint|presentation).*
should get most presentation software variants.
# Set the domain that we're interested in
domain = 'defence.gov.au'
# Create output directory for data
domain_dir = Path('domains', slugify(domain))
domain_dir.mkdir(parents=True, exist_ok=True)
# Domain or prefix search? Domain...
# Collapse on digest? Only removes adjacent captures with the same digest, so probably won't make much difference
# Note the use of regex in the mimetype filter -- should capture all(?) presentations.
results = query_cdx_ia(f'*.{domain}', filter='mimetype:.*(powerpoint|presentation).*')
Let's convert the harvested data into a dataframe.
df = pd.DataFrame(results)
How many captures are there?
df.shape
How many unique files are there? This drops duplicates based on the digest
field.
df_unique = df.drop_duplicates(subset=['digest'], keep='first')
df_unique.shape
What mimetype
values are there?
df_unique['mime'].value_counts()
In this step we'll work through the capture data and attempt to download each unique file. We'll also add in some extra metadata:
first_capture
– the first time this file appears in the archive (ISO formatted date, YYYY-MM-DD)last_capture
– the last time this file appears in the archive (ISO formatted date, YYYY-MM-DD)current_status
– HTTP status code returned now by the original urlThe downloaded Powerpoint files will be stored in /domains/[this domain]/powerpoints/original
. The script won't overwrite any existing files, so if it fails and you need to restart, it'll just pick up from where it left off.
def format_timestamp_as_iso(timestamp):
return arrow.get(timestamp, 'YYYYMMDDHHmmss').format('YYYY-MM-DD')
def get_date_range(df, digest):
'''
Find the first and last captures identified by digest.
Return the dates in ISO format.
'''
captures = df.loc[df['digest'] == digest]
from_date = format_timestamp_as_iso(captures['timestamp'].min())
to_date = format_timestamp_as_iso(captures['timestamp'].max())
return(from_date, to_date)
def check_if_exists(url):
'''
Check to see if a file still exists on the web.
Return the current status code.
'''
try:
response = s.head(url, allow_redirects=True)
# If there are any problems that don't generate codes
except requests.exceptions.ConnectionError:
return 'unreachable'
return response.status_code
def save_files(df, add_current_status=True):
'''
Attempts to download each unique file.
Adds first_capture, and last_capture, and file_path to the file metadata.
Optionally checks to see if file is still accessible on the live web, adding current status code to metadata.
'''
metadata = df.drop_duplicates(subset=['digest'], keep='first').to_dict('records')
for row in tqdm(metadata):
url = f'https://web.archive.org/web/{row["timestamp"]}id_/{row["url"]}'
parsed = urlparse(row['url'])
suffix = Path(parsed.path).suffix
# This should give a sortable and unique filename if there are multiple versions of a file
file_name = f'{slugify(row["urlkey"])}-{row["timestamp"]}{suffix}'
# print(filename)
# Create the output directory for the downloaded files
output_dir = Path(domain_dir, 'powerpoints', 'original')
output_dir.mkdir(parents=True, exist_ok=True)
file_path = Path(output_dir, file_name)
# Don't re-get files we already got
if not file_path.exists():
response = s.get(url=url)
file_path.write_bytes(response.content)
time.sleep(1)
# Get first and last cpature dates
first, last = get_date_range(df, row["digest"])
row['first_capture'] = first
row['last_capture'] = last
# This can slow things down a bit, so make it optional
if add_current_status:
row['current_status'] = check_if_exists(row["url"])
row['file_path'] = str(file_path)
row['memento'] = f'https://web.archive.org/web/{row["timestamp"]}/{row["url"]}'
return metadata
metadata = save_files(df)
Let's save the enriched data as a CSV file.
df_md = pd.DataFrame(metadata)
csv_file = Path(domain_dir, f'{slugify(domain)}-powerpoints.csv')
df_md.to_csv(csv_file, index=False)
display(FileLink(csv_file))
How many of the archived Powerpoints are still accessible on the live web?
df_md['current_status'].value_counts()
In this step we'll convert the Powerpoint files to PDFs. PDFs are easier to work with, and we can use them to extract text and images from the original presentation files. However, the conversion process can be a bit fiddly. Before you undertake this you'll need to make sure that you have a recent version of LibreOffice installed on your system.
If you're using this notebook on Binder, or have created a virtualenv
using this repository's requirements.txt
, you'll have installed the Python command-line program unoconv
. Unoconv wraps around LibreOffice to provide a standard interface for converting files from one file type to another. However, while unoconv
seems to work ok when running a notebook in a conda environment, it fails in a standard virtualenv
. It's got something to do with the default interpreter, but even using LibreOffice's own packaged Python doesn't seem to help.
Fortunately, you can just call LibreOffice directly. To do this, you'll need to know the location of the LibreOffice program. for example, on MacOS this is probably /Applications/LibreOffice.app/Contents/MacOS/soffice
. Once you have that, it's easy to convert a directory full of Powerpoint files to pdfs, for example:
!/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to [format to convert to, eg pdf] --outdir [output directory] [input files]
I've included both the unoconv
and direct LibreOffice options below so you can use the one that works best for you. Just comment/uncomment lines as necessary. If you're using this notebook on Binder, it should just work with the default settings.
Some other issues:
input_dir = str(Path(domain_dir, 'powerpoints', 'original')) + '/*[ppt,pps,pptx]'
output_dir = Path(domain_dir, 'powerpoints', 'pdfs')
output_dir.mkdir(parents=True, exist_ok=True)
# Try with unoconv?
#!unoconv -v -f pdf -o {pdf_output} {input_dir}
# Using typical Libre Office location on Ubuntu (ie Binder)
!/usr/bin/soffice --headless --convert-to pdf --outdir {output_dir} {input_dir}
# Typical MacOS installation of most recent LibreOffice
#!/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir {output_dir} {input_dir}
Now that we have PDF-formatted versions of the presentations, we can save individual slides as images, and extract all the text from the presentation. The function below works its way through out metadata file, looking for the PDF version of each of our downloaded Powerpoint files. If the PDF exists it:
/domains/[your selected domain]/powerpoints/images
directory.We end up with a metadata file that includes the full text of the file as well as an image of the first page. In the next step, we'll turn this into a database.
def get_data_from_pdfs(metadata):
pdf_data = metadata.copy()
for pdf in tqdm(pdf_data):
# See if pdf exists
pdf_name = f'{slugify(pdf["urlkey"])}-{pdf["timestamp"]}.pdf'
# print(filename)
input_dir = Path(domain_dir, 'powerpoints', 'pdfs')
output_dir = Path(domain_dir, 'powerpoints', 'images')
output_dir.mkdir(parents=True, exist_ok=True)
pdf_path = Path(input_dir, pdf_name)
if pdf_path.exists():
doc = fitz.open(pdf_path)
page_count = doc.pageCount
page_num = 0
text = ''
# You can define matrix to increase resolution / size - https://pymupdf.readthedocs.io/en/latest/faq/
# Increase numbers below to zoom
mat = fitz.Matrix(1,1)
while (page_num < page_count):
page = doc.loadPage(page_num)
# Convert first page to image
if page_num == 0:
# Get page as image and save
pix = page.getPixmap(matrix=mat)
image_file = Path(output_dir, f'{pdf_path.stem}.png')
pix.writePNG(str(image_file))
# Resize the image as a thumbnail for saving in db
img_data = pix.getPNGData()
img = Image.open(io.BytesIO(img_data))
ratio = 300 / img.width
(width, height) = (300, math.ceil(img.height * ratio))
resized_img = img.resize((width, height), PIL.Image.LANCZOS)
# Save as data
buffer = io.BytesIO()
resized_img.save(buffer, format='PNG')
pdf['image'] = buffer.getvalue()
text += page.getText()
page_num += 1
pdf['text'] = text
else:
pdf['text'] = ''
pdf['image'] = ''
return pdf_data
pdf_data = get_data_from_pdfs(metadata)
In this step we import the enriched metadata file into a SQLite database using SQlite-utils. Then we can open the database in Datasette and explore the results.
There's a few interesting things to note:
sql-utils
we can index the text
column so that it's searchable as full text.# Create the db
db_file = Path(domain_dir, f'{slugify(domain)}-powerpoints.db')
db = sqlite_utils.Database(db_file)
db['files'].delete_where()
# Insert the metadata into the db
db['files'].insert_all(pdf_data, column_order=('image', 'timestamp', 'memento', 'first_capture', 'last_capture', 'current_status', 'text', 'mime', 'status', 'url'))
# Make the text column full text searchable
db['files'].enable_fts(['text'], fts_version='FTS4')
db['files'].optimize()
The following code should launch Datasette correctly on either Binder or a local system.
# Get current running servers
servers = list_running_servers()
# Get the current base url
base_url = next(servers)['base_url']
# Create a base url for Datasette using the proxy path
proxy_url = f'{base_url}proxy/absolute/8001/'
# Display a link to Datasette
display(HTML(f'<p><a style="display: block; border: 1px solid #307fc1; background-color: #1976d2; color: #ffffff; padding: 10px; text-align: center;"href="{proxy_url}">View Datasette</a> (Click on the stop button in the top menu bar to close the Datasette server)</p>'))
# Launch Datasette
!datasette -- {str(db_file)} --port 8001 --config base_url:{proxy_url} --plugins-dir plugins
If you're using Jupyter Lab, you can browse the results of this notebook by just looking inside the domains
folder. I've also enabled the jupyter-archive
extension which adds a download option to the right-click menu. Just right click on a folder and you'll see an option to 'Download as an Archive'. This will zip up and download the folder. Remember, however, that some of these folders will contain a LOT of data, so downloading everything might not always work.
The cells below provide a couple of alternative ways of viewing and downloading the results.
# Display all the files under the current domain folder (this could be a long list)
display(FileLinks(str(domain_dir)))
# Tar/gzip the current domain folder
!tar -czf {str(domain_dir)}.tar.gz {str(domain_dir)}
# Display a link to the gzipped data
# In JupyterLab you'll need to Shift+right-click on the link and choose 'Download link'
display(FileLink(f'{str(domain_dir)}.tar.gz'))
Created by Tim Sherratt for the GLAM Workbench.
Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020