import requests
from bs4 import BeautifulSoup
First we'll load the page.
# Note the 'id_' in the url to get the original page without the IA navigation.
response = requests.get('https://web.archive.org/web/20190716210347id_/http://www.naa.gov.au/collection/fact-sheets/by-number/index.aspx')
soup = BeautifulSoup(response.content)
Then we'll extract the rows from the index table.
fs_list = soup.find('table', title='Numerical list of fact sheets').find_all('tr')[1:]
Let's loop through all the rows in the fact sheet index, extracting the fact sheet number, title and url. Then we'll try loading the url. We'll save all the details and the HTTP status code for further exploration.
fact_sheets = []
for row in fs_list:
num = row.td.text
fs = row.find('a')
title = fs.text
url = f'http://naa.gov.au{fs["href"]}'
response = requests.get(url)
status = response.status_code
print(f'{title}: {status}')
fact_sheets.append({'number': num, 'title': title, 'url': url, 'status': status})
Reading room addresses and hours of opening: 200 Using our collection: 404 Addresses of Australian archival institutions: 404 Reading room rules: 404 What are archives?: 404 Archival terms: 200 The Commonwealth Record Series (CRS) system: 200 Citing archival records: 200 Copyright: 200 Searching for records: 404 Access to records under the Archives Act: 200 Viewing records in the reading room: 404 What to do if we refuse you access: 200 RecordSearch: an overview: 404 Keyword searching in RecordSearch Advanced search screens: 404 Release of records containing personal information: 200 Service guidelines for the National Reference Service: 404 NameSearch: 200 PhotoSearch: 404 Parliamentary Papers: 404 Commonwealth of Australia Gazettes: 404 Customs House, Sydney: 404 Coastal fortifications in New South Wales: 404 Commonwealth Film Unit: 404 The wine industry in South Australia: 404 Tasmanian railways: 404 Australia First Movement: 404 Commonwealth banking policy: 404 Navy service records: 404 Navy crew and ships records: 404 RAAF service records: 404 Security intelligence records held in Canberra: 404 Cabinet records: 404 Administration of the Australian Capital Territory: 404 Military records held in Hobart: 404 Maritime records held in Hobart: 404 Passenger records held in Canberra: 404 Civilian service in World War II: 404 Research agents – Canberra: 404 Research agents – Sydney: 404 Research agents – Brisbane: 404 Research agents – Adelaide and Darwin: 404 Research agents – Melbourne and Hobart: 404 Research agents – Perth: 404 Why we refuse access: 200 Australian Overseas Information Service photographs: 404 Papua New Guinea patrol reports: 404 D Notices: 404 Post Office records: 404 Copying charges: 200 Exempt information in ASIO records: 404 Personal information in ASIO records: 404 Veterans' case files: 404 Fremantle Harbour: 404 Passenger records held in Perth: 404 Melbourne Olympics, 1956: 404 World War I internee, alien and POW records held in Canberra: 404 World War II internee, alien and POW records held in Canberra: 404 Design and development of the national capital: 404 World War II war crimes: 404 Indonesian independence: 404 War service information: 404 Passenger records held in Sydney: 404 Customs shipping records held in Sydney: 404 Migrant selection documents held in Canberra: 404 Boer War records: 404 Naturalisation records held in Canberra: 404 ASIO files on writers and literary groups: 404 Prime ministers of Australia: 404 Prime Minister Joseph Cook: 404 Prime Minister William Morris Hughes: 404 Prime Minister Stanley Melbourne Bruce: 404 Prime Minister James Henry Scullin: 404 Prime Minister Joseph Aloysius Lyons: 404 Prime Minister Earle Christmas Grafton Page: 404 Prime Minister Robert Gordon Menzies: 404 Prime Minister Arthur William Fadden: 404 Prime Minister John Joseph Ambrose Curtin: 404 Prime Minister Francis Michael Forde: 404 Prime Minister Joseph Benedict Chifley: 404 Prime Minister Harold Edward Holt: 404 Prime Minister John McEwen: 404 Prime Minister John Grey Gorton: 404 Family history sources held in Canberra: 404 Family history sources held in Adelaide: 404 Australia and the United Nations: 404 Births, deaths and marriages: 404 Cyclones and the Northern Territory: 404 Coastal fortifications in South Australia: 404 Customs houses in South Australia: 404 Customs House, Port Adelaide, South Australia: 404 Excise control of distilled products in South Australia: 404 Walter Burley Griffin and the design of Canberra: 404 J T Lang and Lang Labor: 404 Regulation of beer and brewing in South Australia: 404 Sir Frederick Shedden and the Shedden collection: 404 Records relating to Italian migration held in Sydney: 404 World War II internee, alien and POW records held in Sydney: 404 The Australian flag: 404 The Cocos (Keeling) Islands: 404 Commonwealth electoral rolls held in Perth: 404 Copyright records: 404 World War I internee, alien and POW records held in Adelaide: 404 World War II internee, alien and POW records held in Adelaide: 404 The Pastoral industry in the Northern Territory: 404 Building the provisional Parliament House: 404 When to use the Freedom of Information, Archives and Privacy Acts: 200 The sinking of HMAS Sydney, November 1941: 404 Royal Commission into Aboriginal Deaths in Custody: 404 Aboriginal and Torres Strait Islander people: 404 Memorandum of Understanding with Northern Territory Aboriginal people: 404 Introducing television to Australia, 1956: 404 Guides to the collection: 404 Australia's involvement in the Vietnam War: 404 Computer resources in reading rooms: 404 Commonwealth electoral rolls held in Brisbane: 404 Bankruptcy records held in Sydney: 404 General Sir John Monash: 404 Lighthouse records held in Hobart: 404 Records of British migrants held in Canberra: 404 Child migration to Australia: 404 Radar research in Australia during World War II: 404 Radar production and use during World War II: 404 War Cabinet records: 404 Cabinet notebooks: 404 British nuclear tests at Maralinga: 404 The Royal Commission on Espionage, 1954–55: 404 Posters: 404 World War ll Army pay files held in Adelaide: 404 Defence and service records held in Melbourne: 404 Colonial defence personnel records held in Melbourne: 404 Army administrative records held in Melbourne: 404 Army service records: 404 Navy administrative records held in Melbourne: 404 Navy service records held in Melbourne: 404 Royalty and Australian society: 404 Cockatoo Island Dockyard: 404
import pandas as pd
df = pd.DataFrame(fact_sheets)
Let's break down the results by HTTP status code.
df['status'].value_counts()
404 251 200 15 Name: status, dtype: int64
print(f'{251 / (251+15):.2%} of fact sheets are kaput!')
94.36% of fact sheets are kaput!
df.loc[df['status'] == 200]
number | title | url | status | |
---|---|---|---|---|
0 | 1 | Reading room addresses and hours of opening | http://naa.gov.au/collection/fact-sheets/fs01.... | 200 |
5 | 5 | Archival terms | http://naa.gov.au/collection/fact-sheets/fs05.... | 200 |
6 | 6 | The Commonwealth Record Series (CRS) system | http://naa.gov.au/collection/fact-sheets/fs06.... | 200 |
7 | 7 | Citing archival records | http://naa.gov.au/collection/fact-sheets/fs07.... | 200 |
8 | 8 | Copyright | http://naa.gov.au/collection/fact-sheets/fs08.... | 200 |
10 | 10 | Access to records under the Archives Act | http://naa.gov.au/collection/fact-sheets/fs10.... | 200 |
12 | 12 | What to do if we refuse you access | http://naa.gov.au/collection/fact-sheets/fs12.... | 200 |
15 | 15 | Release of records containing personal informa... | http://naa.gov.au/collection/fact-sheets/fs15.... | 200 |
17 | 18 | NameSearch | http://naa.gov.au/collection/fact-sheets/fs18.... | 200 |
44 | 46 | Why we refuse access | http://naa.gov.au/collection/fact-sheets/fs46.... | 200 |
49 | 51 | Copying charges | http://naa.gov.au/collection/fact-sheets/fs51.... | 200 |
106 | 110 | When to use the Freedom of Information, Archiv... | http://naa.gov.au/collection/fact-sheets/fs110... | 200 |
170 | 175 | Bringing Them Home name index | http://naa.gov.au/collection/fact-sheets/fs175... | 200 |
190 | 195 | The bombing of Darwin | http://naa.gov.au/collection/fact-sheets/fs195... | 200 |
214 | 220 | Passenger arrivals index | http://naa.gov.au/collection/fact-sheets/fs220... | 200 |
df.to_csv('data/fact_sheets.csv', index=False)