One frequently encountered task when working with regularly produced datasets is the downloading of 'latest release' spreadsheet or CSV data files. These can typically take one of three forms:
In each case, there is a requirement to regularly download the newly released file. In the first case, when starting a longitudinal analysis, there may be a requirement to download historical data files (the monthly releases for previous months, for example) from which a longitudinal dataset may be created.
Harvesting data refers to the collection of data from a known location. Harvesting may occur over time, for example, according a schedule - collecting a monhtly dataset when it has been published - or collecting a set of data from multiple files or locations in order to seed a larger data set.
Where regularly released datafiles are posted to a known web location with a long lived URL in the form of a link on that page to the datafile itself, possibly along with a sef of previously posted links to historical datasets, we can often automate the collection of the data by scraping the links to the datafile or datafiles and then using those links as the basis of an automatic downloader.
#import sys
#reload(sys)
#sys.setdefaultencoding("utf-8")
import pandas as pd
#requests is a library that simplifies grabbing web pages and downloading files from web addresses / URLs
import requests
#BeautifulSoup is a library that helps parse HTML web pages and XML files
from bs4 import BeautifulSoup
url='http://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/weekly-ae-sitreps-2014-15/'
#Grab the contents of a file of an arbitrary type that is published at a particular web location using an HTTP GET
page = requests.get(url)
#The file we have grabbed in his case is a web page - that is, an HTML file
#We can get the content of the page and parse it
soup=BeautifulSoup(page.content)
#BeautifulSoup has a routine - find_all() - that will find all the HTML tags of a particular sort
#Links are represented in HTML pages in the form <a href="http//ecample.com/page.html">link text</a>
#Grab all the <a> (anchor) tags...
links=soup.find_all('a')
#We can iterate through all the links one at a time, calling each link "link" temporarily as we do so
for link in links:
#Do a logical test to see if the linked URL ends in .csv, the conventional suffix for a CSV file
if link['href'].endswith('.csv'):
#If the link does look like its a link to a CSV file, print the URL out
print(link['href'],link.text)
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.03.01-WSitAE-Extractions-AE4g5fd5.csv', u'A&E Attendances Week Ending 01.03.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv', u'A&E Admissions Week Ending 01.03.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.22-WSitAE-Extractions-AE3h6yd5.csv', u'A&E Attendances Week Ending 22.02.2015 (CSV, 21K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv', u'A&E Admissions Week Ending 22.02.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.15-WSitAE-Extractions-AE9th3bd.csv', u'A&E Attendances Week Ending 15.02.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.15-WSitAE-Extractions-Adm9th3bd.csv', u'A&E Admissions Week Ending 15.02.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.08-WSitAE-Extractions-AE4df65w.csv', u'A&E Attendances Week Ending 08.02.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.08-WSitAE-Extractions-Adm4df65w.csv', u' A&E Admissions Week Ending 08.02.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.01-WSitAE-Extractions-AE9rb63g.csv', u'A&E Attendances Week Ending 01.02.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.01-WSitAE-Extractions-Adm9rb63g.csv', u'A&E Admissions Week Ending 01.02.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.25-WSitAE-Extractions-AE4rt7hg.csv', u'A&E Attendances Week Ending 25.01.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv', u'A&E Admissions Week Ending 25.01.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv', u'A&E Attendances Week Ending 18.01.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv', u'A&E Admissions Week Ending 18.01.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv', u'A&E Attendances Week Ending 11.01.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv', u'A&E Admissions Week Ending 11.01.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.04-WSitAE-Extractions-AEp19ak47t.csv', u'A&E Attendances Week Ending 04.01.2015 (CSV, 22K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv', u'A&E Admissions Week Ending 04.01.2015 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.28-WSitAE-Extractions-AE4d6h8g.csv', u'A&E Attendances Week Ending 28.12.2014 (CSV, 21K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv', u'A&E Admissions Week Ending 28.12.2014 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.21-WSitAE-Extractions-AE9yf54s.csv', u'A&E Attendances Week Ending 21.12.2014 (CSV, 21K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.21-WSitAE-Extractions-Adm8yf54s.csv', u'A&E Admissions Week Ending 21.12.2014 (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.14-WSitAE-Extractions-AE2yg54s1.csv', u'A&E Attendances Week Ending 14.12.2014 (Revised 06.01.2015) (CSV, 21K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv', u'A&E Admissions Week Ending 14.12.2014 (Revised 06.01.2015) (CSV, 32K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.07-WSitAE-Extractions-AE9ty4h21.csv', u'A&E Attendances Week Ending 07.12.2014 (Revised 06.01.2015) (CSV, 21K)') ('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv', u'A&E Admissions Week Ending 07.12.2014 (Revised 06.01.2015) (CSV, 32K)')
So we've seen how we can grab a list of links to CSV files that are published via a single web page. How does that help us harvest these files?
import os
directory='AEdata'
#Create a new directory - AEdata - if it doesnlt already exist
if not os.path.exists(directory):
os.makedirs(directory)
##Via http://stackoverflow.com/a/16696317/454773
#TH - added path
#The following function seems to be a pretty robust, generic function, for downloading an arbitrary file to our local machine
import requests
def download_file(url,path):
''' Download a file from a particular URL to the current directory with original filename '''
local_filename = url.split('/')[-1]
local_filename='/'.join([path,local_filename])
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
return local_filename
#In the same way we iterested through a list of links extracted from a web page so we could then print them out,
#we can iterate through a list of links so
for link in links:
if link['href'].endswith('.csv'):
fn=download_file(link['href'],directory)
print("Downloaded {}".format(fn))
Downloaded AEdata/2015.03.01-WSitAE-Extractions-AE4g5fd5.csv Downloaded AEdata/2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv Downloaded AEdata/2015.02.22-WSitAE-Extractions-AE3h6yd5.csv Downloaded AEdata/2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv Downloaded AEdata/2015.02.15-WSitAE-Extractions-AE9th3bd.csv Downloaded AEdata/2015.02.15-WSitAE-Extractions-Adm9th3bd.csv Downloaded AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv Downloaded AEdata/2015.02.08-WSitAE-Extractions-Adm4df65w.csv Downloaded AEdata/2015.02.01-WSitAE-Extractions-AE9rb63g.csv Downloaded AEdata/2015.02.01-WSitAE-Extractions-Adm9rb63g.csv Downloaded AEdata/2015.01.25-WSitAE-Extractions-AE4rt7hg.csv Downloaded AEdata/2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv Downloaded AEdata/2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv Downloaded AEdata/2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv Downloaded AEdata/2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv Downloaded AEdata/2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv Downloaded AEdata/2015.01.04-WSitAE-Extractions-AEp19ak47t.csv Downloaded AEdata/2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv Downloaded AEdata/2014.12.28-WSitAE-Extractions-AE4d6h8g.csv Downloaded AEdata/2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv Downloaded AEdata/2014.12.21-WSitAE-Extractions-AE9yf54s.csv Downloaded AEdata/2014.12.21-WSitAE-Extractions-Adm8yf54s.csv Downloaded AEdata/2014.12.14-WSitAE-Extractions-AE2yg54s1.csv Downloaded AEdata/2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv Downloaded AEdata/2014.12.07-WSitAE-Extractions-AE9ty4h21.csv Downloaded AEdata/2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv
#The os.listdir(PATH) command returns a list of files contain in the directory at path PATH
os.listdir('./AEdata')[:3]
['111-for-winter-daily-sitrep-up-to-20150301.xlsx', '2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv', '2014.12.07-WSitAE-Extractions-AE9ty4h21.csv']
#ACTIVITY - How would you identify whether or not a link to a file you haven't already downloaded has been posted?
#Automation
#Once we have worked out a recipe of one or more steps that might be usefully reusable,
#we might create a function that combines those steps so they can be called from a single command
#So for example, we could combine the command that gets the links from a page
#with a line that finds the ones corresponding to a particular filetype
#Bear in mind the function may be dependent on some imported libraries or other functions
#If we put the function into a file, we should make sure we also import or declare any other required libraries and functions
def linkscraper(url,path,filetypes):
''' url: a web page containing a list of links
path: a path to a local download directory
filetypes: a list of filetypes we want to check for'''
if not os.path.exists(directory):
os.makedirs(directory)
page = requests.get(url)
soup=BeautifulSoup(page.content)
links=soup.find_all('a')
for link in links:
filetype=link['href'].split('.')[-1]
if filetype in filetypes:
fn=download_file(link['href'],path)
print("Downloaded {}".format(fn))
#We now have a single command that will download files of a particular type linked to from a given URL
url='http://www.england.nhs.uk/statistics/statistical-work-areas/winter-daily-sitreps/winter-daily-sitrep-2014-15-data/'
linkscraper(url,directory,['xlsx'])
Downloaded AEdata/DailySR-Timeseries-WE-01.03.15.xlsx Downloaded AEdata/111-for-winter-daily-sitrep-up-to-20150301.xlsx Downloaded AEdata/DailySR-Web-file-WE-01.03.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-22.02.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-15.02.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-08.02.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-01.02.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-25.01.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-18.01.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-11.01.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-04.01.15.xlsx Downloaded AEdata/DailySR-Web-file-WE-28.12.14.xlsx Downloaded AEdata/DailySR-Web-file-WE-21.12.14.xlsx Downloaded AEdata/DailySR-Timeseries-WE-14.12.141.xlsx Downloaded AEdata/DailySR-Web-file-WE-07.12.14.xlsx
We might want to further refine the download for future use to only download files we haven't already downloaded. Note that this might further require checking against an exceptions list, eg for spreadsheets or CSV files that retain a filename but that get updated on a regular basis.
!ls AEdata/
111-for-winter-daily-sitrep-up-to-20150301.xlsx 2014.12.07-WSitAE-Extractions-AE9ty4h21.csv 2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv 2014.12.14-WSitAE-Extractions-AE2yg54s1.csv 2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv 2014.12.21-WSitAE-Extractions-AE9yf54s.csv 2014.12.21-WSitAE-Extractions-Adm8yf54s.csv 2014.12.28-WSitAE-Extractions-AE4d6h8g.csv 2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv 2015.01.04-WSitAE-Extractions-AEp19ak47t.csv 2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv 2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv 2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv 2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv 2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv 2015.01.25-WSitAE-Extractions-AE4rt7hg.csv 2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv 2015.02.01-WSitAE-Extractions-AE9rb63g.csv 2015.02.01-WSitAE-Extractions-Adm9rb63g.csv 2015.02.08-WSitAE-Extractions-AE4df65w.csv 2015.02.08-WSitAE-Extractions-Adm4df65w.csv 2015.02.15-WSitAE-Extractions-AE9th3bd.csv 2015.02.15-WSitAE-Extractions-Adm9th3bd.csv 2015.02.22-WSitAE-Extractions-AE3h6yd5.csv 2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv 2015.03.01-WSitAE-Extractions-AE4g5fd5.csv 2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv DailySR-Timeseries-WE-01.03.15.xlsx DailySR-Timeseries-WE-14.12.141.xlsx DailySR-Web-file-WE-01.02.15.xlsx DailySR-Web-file-WE-01.03.15.xlsx DailySR-Web-file-WE-04.01.15.xlsx DailySR-Web-file-WE-07.12.14.xlsx DailySR-Web-file-WE-08.02.15.xlsx DailySR-Web-file-WE-11.01.15.xlsx DailySR-Web-file-WE-15.02.15.xlsx DailySR-Web-file-WE-18.01.15.xlsx DailySR-Web-file-WE-21.12.14.xlsx DailySR-Web-file-WE-22.02.15.xlsx DailySR-Web-file-WE-25.01.15.xlsx DailySR-Web-file-WE-28.12.14.xlsx
When working with third party spreadsheets, it is not uncommon to find layouts that include metadata elements in the first sheet or at the top of a sheet, hierarchically declared column headings, where for example a heading spans several columns in one rows and is then subdivided into subheadings that appear in the next row, footnotes that appear at the bottom of a spreadsheet and even sidenotes that are pasted into additional columns.
Even simpler CSV files may include hierarchical indices and leading and trailing metadata or additional notes.
In the following example we'll start to look at the WSitAE reports, followed by the DailySR reports.
!head AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv
"Weekly A&E Sitreps - A&E Activity" "Year:2014-15","Period Name:W/E 08/02/2015","Provider Org Name:<All>" "","","Number of A&E Attendances",,,"Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission",,,"Number of patients spending >4 hours but <12 hours from decision to admit to admission",,,,"Number of patients spending >12 hours from decision to admit to admission",,, "","","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total 4 to 12 hour Trolley Waits","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total over 12 hour Trolley Waits" "AAH","TETBURY HOSPITAL TRUST LTD",0,0,63,0,0,0,0,0,0,0,0,0,0,0 "AD913","BECKENHAM BEACON UCC",0,0,796,0,0,0,0,0,0,0,0,0,0,0 "AJN","WORKINGTON HEALTH LIMITED",0,0,288,0,0,0,0,0,0,0,0,0,0,0 "C82009","MARKET HARBOROUGH MED.CTR",0,0,298,0,0,0,0,0,0,0,0,0,0,0
#The following recipe will open a file and then read the first few lines
with open('AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv','r') as f:
for i in range(0,7):
print('>>{}<<'.format(f.readline()))
>>"Weekly A&E Sitreps - A&E Activity" << >> << >>"Year:2014-15","Period Name:W/E 08/02/2015","Provider Org Name:<All>" << >> << >>"","","Number of A&E Attendances",,,"Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission",,,"Number of patients spending >4 hours but <12 hours from decision to admit to admission",,,,"Number of patients spending >12 hours from decision to admit to admission",,, << >>"","","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total 4 to 12 hour Trolley Waits","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total over 12 hour Trolley Waits" << >>"AAH","TETBURY HOSPITAL TRUST LTD",0,0,63,0,0,0,0,0,0,0,0,0,0,0 <<
Inspecting the first few rows of the file, we notice that we need to find some way of representing a hierarchical index. In pandas, this is referred to as a MultiIndex
. Working with such structures can be confusing: it is worth keeping notes of any useful recipes you manage to work out and keep a set of worked examples that demonstrate principles and techniques that may be useful to you up-to-date.
?pd.read_csv
df=pd.read_csv("AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv", header=[4,5])
df.head(3)
Unnamed: 0_level_0 | Unnamed: 1_level_0 | Number of A&E Attendances | Unnamed: 3_level_0 | Unnamed: 4_level_0 | Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission | Unnamed: 6_level_0 | Unnamed: 7_level_0 | Number of patients spending >4 hours but <12 hours from decision to admit to admission | Unnamed: 9_level_0 | Unnamed: 10_level_0 | Unnamed: 11_level_0 | Number of patients spending >12 hours from decision to admit to admission | Unnamed: 13_level_0 | Unnamed: 14_level_0 | Unnamed: 15_level_0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 0_level_1 | Unnamed: 1_level_1 | Type 1 Departments - Major A&E | Type 2 Departments - Single Specialty | Type 3 Departments - Other A&E/Minor Injury Unit | Type 1 Departments - Major A&E | Type 2 Departments - Single Specialty | Type 3 Departments - Other A&E/Minor Injury Unit | Type 1 Departments - Major A&E | Type 2 Departments - Single Specialty | Type 3 Departments - Other A&E/Minor Injury Unit | Total 4 to 12 hour Trolley Waits | Type 1 Departments - Major A&E | Type 2 Departments - Single Specialty | Type 3 Departments - Other A&E/Minor Injury Unit | Total over 12 hour Trolley Waits | |
0 | AAH | TETBURY HOSPITAL TRUST LTD | 0 | 0 | 63 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | AD913 | BECKENHAM BEACON UCC | 0 | 0 | 796 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | AJN | WORKINGTON HEALTH LIMITED | 0 | 0 | 288 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
#Flatten herarchical column names
#http://stackoverflow.com/a/14508355/454773
df.columns = ['::'.join(col).strip() for col in df.columns.values]
df.head(3)
Unnamed: 0_level_0::Unnamed: 0_level_1 | Unnamed: 1_level_0::Unnamed: 1_level_1 | Number of A&E Attendances::Type 1 Departments - Major A&E | Unnamed: 3_level_0::Type 2 Departments - Single Specialty | Unnamed: 4_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission::Type 1 Departments - Major A&E | Unnamed: 6_level_0::Type 2 Departments - Single Specialty | Unnamed: 7_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Number of patients spending >4 hours but <12 hours from decision to admit to admission::Type 1 Departments - Major A&E | Unnamed: 9_level_0::Type 2 Departments - Single Specialty | Unnamed: 10_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Unnamed: 11_level_0::Total 4 to 12 hour Trolley Waits | Number of patients spending >12 hours from decision to admit to admission::Type 1 Departments - Major A&E | Unnamed: 13_level_0::Type 2 Departments - Single Specialty | Unnamed: 14_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Unnamed: 15_level_0::Total over 12 hour Trolley Waits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAH | TETBURY HOSPITAL TRUST LTD | 0 | 0 | 63 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | AD913 | BECKENHAM BEACON UCC | 0 | 0 | 796 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | AJN | WORKINGTON HEALTH LIMITED | 0 | 0 | 288 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
df.head(3)
Unnamed: 0_level_0::Unnamed: 0_level_1 | Unnamed: 1_level_0::Unnamed: 1_level_1 | Number of A&E Attendances::Type 1 Departments - Major A&E | Unnamed: 3_level_0::Type 2 Departments - Single Specialty | Unnamed: 4_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission::Type 1 Departments - Major A&E | Unnamed: 6_level_0::Type 2 Departments - Single Specialty | Unnamed: 7_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Number of patients spending >4 hours but <12 hours from decision to admit to admission::Type 1 Departments - Major A&E | Unnamed: 9_level_0::Type 2 Departments - Single Specialty | Unnamed: 10_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Unnamed: 11_level_0::Total 4 to 12 hour Trolley Waits | Number of patients spending >12 hours from decision to admit to admission::Type 1 Departments - Major A&E | Unnamed: 13_level_0::Type 2 Departments - Single Specialty | Unnamed: 14_level_0::Type 3 Departments - Other A&E/Minor Injury Unit | Unnamed: 15_level_0::Total over 12 hour Trolley Waits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAH | TETBURY HOSPITAL TRUST LTD | 0 | 0 | 63 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | AD913 | BECKENHAM BEACON UCC | 0 | 0 | 796 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | AJN | WORKINGTON HEALTH LIMITED | 0 | 0 | 288 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
df_long=pd.melt(df,id_vars=["Unnamed: 0_level_0::Unnamed: 0_level_1","Unnamed: 1_level_0::Unnamed: 1_level_1"])
df_long.head(3)
Unnamed: 0_level_0::Unnamed: 0_level_1 | Unnamed: 1_level_0::Unnamed: 1_level_1 | variable | value | |
---|---|---|---|---|
0 | AAH | TETBURY HOSPITAL TRUST LTD | Number of A&E Attendances::Type 1 Departments ... | 0 |
1 | AD913 | BECKENHAM BEACON UCC | Number of A&E Attendances::Type 1 Departments ... | 0 |
2 | AJN | WORKINGTON HEALTH LIMITED | Number of A&E Attendances::Type 1 Departments ... | 0 |
df_long.columns=['Code','Name',"Type",'Count']
df_long.head(3)
Code | Name | Type | Count | |
---|---|---|---|---|
0 | AAH | TETBURY HOSPITAL TRUST LTD | Number of A&E Attendances::Type 1 Departments ... | 0 |
1 | AD913 | BECKENHAM BEACON UCC | Number of A&E Attendances::Type 1 Departments ... | 0 |
2 | AJN | WORKINGTON HEALTH LIMITED | Number of A&E Attendances::Type 1 Departments ... | 0 |
df_long['Type0']=df_long['Type'].map(lambda x: x.split('::')[0])
df_long['Type1']=df_long['Type'].map(lambda x: x.split('::')[1])
df_long.head()
Code | Name | Type | Count | Type0 | Type1 | |
---|---|---|---|---|---|---|
0 | AAH | TETBURY HOSPITAL TRUST LTD | Number of A&E Attendances::Type 1 Departments ... | 0 | Number of A&E Attendances | Type 1 Departments - Major A&E |
1 | AD913 | BECKENHAM BEACON UCC | Number of A&E Attendances::Type 1 Departments ... | 0 | Number of A&E Attendances | Type 1 Departments - Major A&E |
2 | AJN | WORKINGTON HEALTH LIMITED | Number of A&E Attendances::Type 1 Departments ... | 0 | Number of A&E Attendances | Type 1 Departments - Major A&E |
3 | C82009 | MARKET HARBOROUGH MED.CTR | Number of A&E Attendances::Type 1 Departments ... | 0 | Number of A&E Attendances | Type 1 Departments - Major A&E |
4 | C82010 | OAKHAM MEDICAL PRACTICE | Number of A&E Attendances::Type 1 Departments ... | 0 | Number of A&E Attendances | Type 1 Departments - Major A&E |
df_long['Type0'].unique()
array(['Number of A&E Attendances', 'Unnamed: 3_level_0', 'Unnamed: 4_level_0', 'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission', 'Unnamed: 6_level_0', 'Unnamed: 7_level_0', 'Number of patients spending >4 hours but <12 hours from decision to admit to admission', 'Unnamed: 9_level_0', 'Unnamed: 10_level_0', 'Unnamed: 11_level_0', 'Number of patients spending >12 hours from decision to admit to admission', 'Unnamed: 13_level_0', 'Unnamed: 14_level_0', 'Unnamed: 15_level_0'], dtype=object)
#There's more than one way to do it...
#use .replace()
#df_long['Type0']=df_long['Type0'].str.replace('Unnamed: 3_level_0':'Number of A&E Attendances')
#etc
#use .map()
colmap={'Number of A&E Attendances':'Number of A&E Attendances',
'Unnamed: 3_level_0':'Number of A&E Attendances',
'Unnamed: 4_level_0':'Number of A&E Attendances',
'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission':'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission',
'Unnamed: 6_level_0':'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission',
'Unnamed: 7_level_0':'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission',
'Number of patients spending >4 hours but <12 hours from decision to admit to admission':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
'Unnamed: 9_level_0':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
'Unnamed: 10_level_0':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
'Unnamed: 11_level_0':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
'Number of patients spending >12 hours from decision to admit to admission':'Number of patients spending >12 hours from decision to admit to admission',
'Unnamed: 13_level_0':'Number of patients spending >12 hours from decision to admit to admission',
'Unnamed: 14_level_0':'Number of patients spending >12 hours from decision to admit to admission',
'Unnamed: 15_level_0':'Number of patients spending >12 hours from decision to admit to admission'}
df_long['Type0']=df_long['Type0'].map(colmap)
One of the things we might want to do to this dataset is add some metadata back in relating to the date the data applies to. One source of this data was the filename:
ANother source was in the metadata rows contained within the file itself:
import os
def WSitAE_AE_read(date,path='AEData'):
ls=os.listdir(path)
fn=None
for item in ls:
if item.find(date)!=-1 and item.find('WSitAE-Extractions-AE')!=-1:
fn=item
if fn is None: return pd.DataFrame()
ffn='{path}/{fn}'.format(path=path,fn=fn)
z={}
with open(ffn,'r') as f:
for i in range(0,2):
for a in s.split(','):
z[a.strip('"').split(':')[0]]=a.strip('"').split(':')[1]
f.close()
#Read the file in without setting any headers
dfw=pd.read_csv(ffn, header=None,skiprows=4)
#Get rid of any empty columns
dfw.dropna(how='all',axis=1,inplace=True)
#The contents of the first two rows will be our column headers
#Fill across so blank columns are named according to the previously named column
dfw[:2].fillna(method='ffill', axis=1,inplace=True)
#Generate a multiindex in the columns from the first two rows
dfw.columns=pd.MultiIndex.from_arrays(dfw[:2].values, names=['Category','SubCategory'])
#Flatten the columns multi-index to give a single set of flat column names
dfw.columns = ['Code','Name']+['::'.join(col).strip() for col in dfw.columns.values[2:]]
##Hack a date from period- should really cast to period
dfw['Period Name']=z['Period Name'].strip()#date
dfw['Date']=z['Period Name'].replace('W/E','').strip()
dfw['Year']=z['Year'].strip()
return dfw[2:]
def WSitAE_AE_melt(df):
#Declare the id_vars
idx=['Code','Name','Year','Period Name','Date']
df=pd.melt(df,id_vars=idx, var_name='Type',value_name='Value')
df['SubType']=df['Type'].map(lambda x: x.split('::')[1])
df['Type']=df['Type'].map(lambda x: x.split('::')[0])
return df[idx+['Type','SubType','Value']]
def WSitAE_AE_long(date,path='AEData'):
df=WSitAE_AE_read(date,path)
return WSitAE_AE_melt(df)
!ls AEdata
111-for-winter-daily-sitrep-up-to-20150301.xlsx 2014.12.07-WSitAE-Extractions-AE9ty4h21.csv 2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv 2014.12.14-WSitAE-Extractions-AE2yg54s1.csv 2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv 2014.12.21-WSitAE-Extractions-AE9yf54s.csv 2014.12.21-WSitAE-Extractions-Adm8yf54s.csv 2014.12.28-WSitAE-Extractions-AE4d6h8g.csv 2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv 2015.01.04-WSitAE-Extractions-AEp19ak47t.csv 2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv 2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv 2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv 2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv 2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv 2015.01.25-WSitAE-Extractions-AE4rt7hg.csv 2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv 2015.02.01-WSitAE-Extractions-AE9rb63g.csv 2015.02.01-WSitAE-Extractions-Adm9rb63g.csv 2015.02.08-WSitAE-Extractions-AE4df65w.csv 2015.02.08-WSitAE-Extractions-Adm4df65w.csv 2015.02.15-WSitAE-Extractions-AE9th3bd.csv 2015.02.15-WSitAE-Extractions-Adm9th3bd.csv 2015.02.22-WSitAE-Extractions-AE3h6yd5.csv 2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv 2015.03.01-WSitAE-Extractions-AE4g5fd5.csv 2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv 2015.03.08-AEHLnoU.xls DailySR-Timeseries-WE-01.03.15.xlsx DailySR-Timeseries-WE-14.12.141.xlsx DailySR-Web-file-WE-01.02.15.xlsx DailySR-Web-file-WE-01.03.15.xlsx DailySR-Web-file-WE-04.01.15.xlsx DailySR-Web-file-WE-07.12.14.xlsx DailySR-Web-file-WE-08.02.15.xlsx DailySR-Web-file-WE-11.01.15.xlsx DailySR-Web-file-WE-15.02.15.xlsx DailySR-Web-file-WE-18.01.15.xlsx DailySR-Web-file-WE-21.12.14.xlsx DailySR-Web-file-WE-22.02.15.xlsx DailySR-Web-file-WE-25.01.15.xlsx DailySR-Web-file-WE-28.12.14.xlsx
dfw=WSitAE_AE_long('2015.02.08')
dfw[:3]
Code | Name | Year | Period Name | Date | Type | SubType | Value | |
---|---|---|---|---|---|---|---|---|
0 | AAH | TETBURY HOSPITAL TRUST LTD | 2014-15 | W/E 08/02/2015 | 08/02/2015 | Number of A&E Attendances | Type 1 Departments - Major A&E | 0 |
1 | AD913 | BECKENHAM BEACON UCC | 2014-15 | W/E 08/02/2015 | 08/02/2015 | Number of A&E Attendances | Type 1 Departments - Major A&E | 0 |
2 | AJN | WORKINGTON HEALTH LIMITED | 2014-15 | W/E 08/02/2015 | 08/02/2015 | Number of A&E Attendances | Type 1 Departments - Major A&E | 0 |
def WSitAE_Adm_read(date,path='AEData'):
ls=os.listdir(path)
fn=None
for item in ls:
if item.find(date)!=-1 and item.find('WSitAE-Extractions-Adm')!=-1:
fn=item
if fn is None: return pd.DataFrame()
ffn='{path}/{fn}'.format(path=path,fn=fn)
dfw=pd.read_csv(ffn, header=4)
dfw.dropna(how='all',axis=1,inplace=True)
dfw['Date']=dfw['Period Name'].str.replace('W/E','').str.strip()
cols=dfw.columns.tolist()
cols=cols[:3]+cols[-1:]+cols[3:-1]
return dfw[cols]
WSitAE_Adm_read('2015.02.15')[:3]
Basis | Year | Period Name | Date | Provider Parent Name | Provider Org Code | Provider Org Name | Emergency Adm Type 1 | Emergency Adm Type 2 | Emergency Adm Type 3 | Emergency Adm Other | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | PROV | 2014-15 | W/E 15/02/2015 | 15/02/2015 | CHESHIRE, WARRINGTON AND WIRRAL AREA TEAM | RTV | 5 BOROUGHS PARTNERSHIP NHS FOUNDATION TRUST | 0 | 0 | 0 | 0 |
1 | PROV | 2014-15 | W/E 15/02/2015 | 15/02/2015 | MERSEYSIDE AREA TEAM | REM | AINTREE UNIVERSITY HOSPITAL NHS FOUNDATION TRUST | 612 | 0 | 0 | 240 |
2 | PROV | 2014-15 | W/E 15/02/2015 | 15/02/2015 | WEST YORKSHIRE AREA TEAM | RCF | AIREDALE NHS FOUNDATION TRUST | 247 | 0 | 0 | 182 |
#Alternatively, we can load the data contained in the two CSV files from a single spreadsheet sheet
dfx=pd.read_excel('AEData/2015.03.08-AEHLnoU.xls',header=None,skiprows=14)
dfx.dropna(how='all',axis=1,inplace=True)
dfx[:1].fillna(method='ffill', axis=1,inplace=True)
dfx.columns=pd.MultiIndex.from_arrays(dfx[:2].values, names=['Category','SubCategory'])
dfx.columns = ['Area Team','Code','Name']+['::'.join(col).strip() for col in dfx.columns.values[3:]]
dfx[:3]
Area Team | Code | Name | A&E attendances::Type 1 Departments - Major A&E | A&E attendances::Type 2 Departments - Single Specialty | A&E attendances::Type 3 Departments - Other A&E/Minor Injury Unit | A&E attendances::Total Attendances | A&E attendances > 4 hours from arrival to admission, transfer or discharge::Type 1 Departments - Major A&E | A&E attendances > 4 hours from arrival to admission, transfer or discharge::Type 2 Departments - Single Specialty | A&E attendances > 4 hours from arrival to admission, transfer or discharge::Type 3 Departments - Other A&E/Minor Injury Unit | ... | A&E attendances > 4 hours from arrival to admission, transfer or discharge::Percentage in 4 hours or less (type 1) | A&E attendances > 4 hours from arrival to admission, transfer or discharge::Percentage in 4 hours or less (all) | Emergency Admissions::Emergency Admissions via Type 1 A&E | Emergency Admissions::Emergency Admissions via Type 2 A&E | Emergency Admissions::Emergency Admissions via Type 3 and 4 A&E | Emergency Admissions::Total Emergency Admissions via A&E | Emergency Admissions::Other Emergency Admissions (i.e not via A&E) | Emergency Admissions::Total Emergency Admissions | Emergency Admissions::Number of patients spending >4 hours from decision to admit to admission | Emergency Admissions::Number of patients spending >12 hours from decision to admit to admission | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | A&E attendances | A&E attendances | A&E attendances | A&E attendances | A&E attendances > 4 hours from arrival to admi... | A&E attendances > 4 hours from arrival to admi... | A&E attendances > 4 hours from arrival to admi... | ... | A&E attendances > 4 hours from arrival to admi... | A&E attendances > 4 hours from arrival to admi... | Emergency Admissions | Emergency Admissions | Emergency Admissions | Emergency Admissions | Emergency Admissions | Emergency Admissions | Emergency Admissions | Emergency Admissions |
1 | Area Team | Code | Name | Type 1 Departments - Major A&E | Type 2 Departments - Single Specialty | Type 3 Departments - Other A&E/Minor Injury Unit | Total Attendances | Type 1 Departments - Major A&E | Type 2 Departments - Single Specialty | Type 3 Departments - Other A&E/Minor Injury Unit | ... | Percentage in 4 hours or less (type 1) | Percentage in 4 hours or less (all) | Emergency Admissions via Type 1 A&E | Emergency Admissions via Type 2 A&E | Emergency Admissions via Type 3 and 4 A&E | Total Emergency Admissions via A&E | Other Emergency Admissions (i.e not via A&E) | Total Emergency Admissions | Number of patients spending >4 hours from deci... | Number of patients spending >12 hours from dec... |
2 | - | - | England | 280281 | 12803 | 141690 | 434774 | 30734 | 41 | 388 | ... | 0.8903458 | 0.9283237 | 76472 | 318 | 964 | 77754 | 28890 | 106644 | 6911 | 17 |
3 rows × 21 columns
#Alternatively, we could have separated out the A&E data from the Emergency Admissions Data
import pandas as pd
def DailySR_read(f):
return pd.ExcelFile(f)
#xl=DailySR_read('AEData/DailySR-Web-file-WE-01.03.15.xlsx')
xl=DailySR_read('AEData/DailySR-Web-file-WE-07.12.14.xlsx')
xl.sheet_names
[u'A&E closures', u'A&E diverts', u'Cancelled operations', u'Critical care transfers', u'Ambulances queuing', u'G&A beds', u'D&V, Norovirus', u'Delayed transfers of care', u'Adult critical care', u'Paediatric intensive care', u'Neonatal intensive care ', u'Trusts with Ops problems', u'A&E Performance', u'Macro1']
def DailySR_parse(xl,sheet,skiprows=13,header=None):
xlclass={'A&E closures':0,
'A&E diverts':0,
'Cancelled operations':1,
'Critical care transfers':1,
'Ambulances queuing':0,
'G&A beds':1,
'D&V, Norovirus':1,
'Delayed transfers of care':0,
'Adult critical care':1,
'Paediatric intensive care':1,
'Neonatal intensive care ':1,
'Trusts with Ops problems':0,
'A&E Performance':1#,
#'Macro1':-1
}
if sheet not in xlclass: return pd.DataFrame()
dfx=xl.parse(sheet,skiprows=skiprows,header=header)
dfx.dropna(how='all',inplace=True)
dfx.dropna(how='all',axis=1,inplace=True)
cols=[]
if xlclass[sheet]==0:
dfx.drop([2,3],inplace=True)
dfx.dropna(how='all',axis=1,inplace=True)
dfx.columns=dfx.ix[1].values
dfx=dfx[1:]
elif xlclass[sheet]==1:
#This is a real hack, trying to think what a sensible shape is and how to get there
#Drop an empty row and the total count for England row
dfx.drop([2,3],inplace=True)
#Get rid of empty columns
dfx.dropna(how='all',axis=1,inplace=True)
#Fill across on dates
dfx[:1].fillna(method='ffill', axis=1,inplace=True)
#Patch the leading empty columns
dfx[:1].fillna('Metadata',inplace=True)
#Generate a multi-index across the columns from the first two rows
dfx.columns=pd.MultiIndex.from_arrays(dfx[:2].values, names=['Category','SubCategory'])
#Drop the rows we used to make the multi-index
dfx=dfx[2:]
#Transpose rows and columns
dfx=dfx.T
#Set the column names to be the hospital codes
dfx.columns=dfx.ix['Metadata','Code'].values
#And then throw away the metadata rows
dfx=dfx[3:]
#Tidy up by throwing away any empty columns we created
dfx.dropna(how='all',axis=1,inplace=True)
return dfx
xx=DailySR_parse(xl,'G&A beds')
xx[:4]
RJC | RKB | RLQ | RLT | RWP | RD1 | RN3 | RNZ | RTE | RBK | ... | RD3 | RDZ | RHM | RHU | RN5 | RAE | RCF | RR8 | RWY | RXF | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Category | SubCategory | |||||||||||||||||||||
2014-12-01 00:00:00 | Core beds avail | 485 | 1114 | 208 | 268 | 799 | 595 | 469 | 410 | 911 | 410 | ... | 410 | 582 | 1205 | 995 | 827 | 661 | 325 | 1960 | 596 | 930 |
Escalation beds avail | 0 | 8 | 0 | 18 | 51 | 2 | 18 | 20 | 38 | 56 | ... | 50 | 0 | 12 | 58 | 33 | 0 | 8 | 42 | 28 | 36 | |
Total beds avail | 485 | 1122 | 208 | 286 | 850 | 597 | 487 | 430 | 949 | 466 | ... | 460 | 582 | 1217 | 1053 | 860 | 661 | 333 | 2002 | 624 | 966 | |
Total beds occ'd | 445 | 1116 | 208 | 286 | 848 | 569 | 486 | 421 | 922 | 466 | ... | 435 | 537 | 1186 | 931 | 831 | 629 | 317 | 1738 | 613 | 933 |
4 rows × 153 columns
xx['RNZ'][:,'Escalation beds avail']
Category 2014-12-01 00:00:00 20 2014-12-02 00:00:00 28 2014-12-03 00:00:00 27 2014-12-04 00:00:00 20 05-07-Dec-14 21 Name: RNZ, dtype: object
DailySR_parse(xl,'Paediatric intensive care')
RJC | RKB | RLQ | RLT | RWP | RD1 | RN3 | RNZ | RTE | RBK | ... | RD3 | RDZ | RHM | RHU | RN5 | RAE | RCF | RR8 | RWY | RXF | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Category | SubCategory | |||||||||||||||||||||
2014-12-01 00:00:00 | Paed Int Care Avail | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 16 | 0 | 0 |
Paed Int Care Occ | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | |
2014-12-02 00:00:00 | Paed Int Care Avail | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 16 | 0 | 0 |
Paed Int Care Occ | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | |
2014-12-03 00:00:00 | Paed Int Care Avail | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 16 | 0 | 0 |
Paed Int Care Occ | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | |
2014-12-04 00:00:00 | Paed Int Care Avail | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 16 | 0 | 0 |
Paed Int Care Occ | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | ... | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | |
05-07-Dec-14 | Paed Int Care Avail | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 0 | ... | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 16 | 0 | 0 |
Paed Int Care Occ | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | ... | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 12 | 0 | 0 |
10 rows × 153 columns
So now I've worked out the steps to get the separate waiting time
Where spreadsheets contain multiple heterogenous sheets, we need to load them in using the pandas ExcelFile
as the following examples using a DailySR report shows.
dfx=pd.ExcelFile('AEdata/DailySR-Web-file-WE-01.02.15.xlsx')
dfx.sheet_names
[u'A&E closures', u'A&E diverts', u'Cancelled operations', u'Critical care transfers', u'Ambulances queuing', u'G&A beds', u'D&V, Norovirus', u'Delayed transfers of care', u'Adult critical care', u'Paediatric intensive care', u'Neonatal intensive care ', u'Trusts with Ops problems', u'A&E Performance', u'Macro1']
df=dfx.parse('G&A beds',header=None,skiprows=13,parse_cols="B,D:Y")
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | 2015-01-26 00:00:00 | NaN | NaN | NaN | 2015-01-27 00:00:00 | NaN | NaN | ... | NaN | NaN | 2015-01-29 00:00:00 | NaN | NaN | NaN | 30-Jan-01-Feb-15 | NaN | NaN | NaN |
1 | Area Team | Code | Name | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | ... | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd |
2 | - | - | ENGLAND | 97161 | 4589 | 101750 | 97278 | 97169 | 4439 | 101608 | ... | 101564 | 96301 | 97224 | 4206 | 101430 | 95890 | 97009 | 4352 | 101361 | 95320 |
3 | NaN | NaN | NaN | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 |
4 | Arden, Herefordshire And Worcestershire Area Team | RJC | South Warwickshire NHS Foundation Trust | 485 | 0 | 485 | 469 | 485 | 0 | 485 | ... | 485 | 467 | 485 | 0 | 485 | 482 | 485 | 0 | 485 | 470 |
5 rows × 23 columns
df.set_index([0,1,2],inplace=True)
df.head(2)
3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | ||||||||||||||||||||
NaN | NaN | NaN | 2015-01-26 00:00:00 | NaN | NaN | NaN | 2015-01-27 00:00:00 | NaN | NaN | NaN | 2015-01-28 00:00:00 | NaN | NaN | NaN | 2015-01-29 00:00:00 | NaN | NaN | NaN | 30-Jan-01-Feb-15 | NaN | NaN | NaN |
Area Team | Code | Name | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd |
#http://stackoverflow.com/a/24154138/454773
#fill in Null values in "Headers"
df[:1].fillna(method='ffill', axis=1,inplace=True)
#df[:1].fillna(value='Base',axis=1,inplace=True)
df.head(3)
3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | ||||||||||||||||||||
NaN | NaN | NaN | 2015-01-26 00:00:00 | 2015-01-26 00:00:00 | 2015-01-26 00:00:00 | 2015-01-26 00:00:00 | 2015-01-27 00:00:00 | 2015-01-27 00:00:00 | 2015-01-27 00:00:00 | 2015-01-27 00:00:00 | 2015-01-28 00:00:00 | 2015-01-28 00:00:00 | 2015-01-28 00:00:00 | 2015-01-28 00:00:00 | 2015-01-29 00:00:00 | 2015-01-29 00:00:00 | 2015-01-29 00:00:00 | 2015-01-29 00:00:00 | 30-Jan-01-Feb-15 | 30-Jan-01-Feb-15 | 30-Jan-01-Feb-15 | 30-Jan-01-Feb-15 |
Area Team | Code | Name | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd | Core beds avail | Escalation beds avail | Total beds avail | Total beds occ'd |
- | - | ENGLAND | 97161 | 4589 | 101750 | 97278 | 97169 | 4439 | 101608 | 96826 | 97281 | 4283 | 101564 | 96301 | 97224 | 4206 | 101430 | 95890 | 97009 | 4352 | 101361 | 95320 |
#http://stackoverflow.com/a/24154138/454773
#create multiindex column names
df.columns=pd.MultiIndex.from_arrays(df[:2].values, names=['Date','Report'])
df[4:].head(3).T
0 | Arden, Herefordshire And Worcestershire Area Team | |||
---|---|---|---|---|
1 | RJC | RKB | RLQ | |
2 | South Warwickshire NHS Foundation Trust | University Hospitals Coventry And Warwickshire NHS Trust | Wye Valley NHS Trust | |
Date | Report | |||
2015-01-26 00:00:00 | Core beds avail | 485 | 1112 | 208 |
Escalation beds avail | 0 | 27 | 0 | |
Total beds avail | 485 | 1139 | 208 | |
Total beds occ'd | 469 | 1127 | 200 | |
2015-01-27 00:00:00 | Core beds avail | 485 | 1113 | 208 |
Escalation beds avail | 0 | 28 | 0 | |
Total beds avail | 485 | 1141 | 208 | |
Total beds occ'd | 477 | 1132 | 203 | |
2015-01-28 00:00:00 | Core beds avail | 485 | 1112 | 208 |
Escalation beds avail | 0 | 13 | 0 | |
Total beds avail | 485 | 1125 | 208 | |
Total beds occ'd | 467 | 1108 | 204 | |
2015-01-29 00:00:00 | Core beds avail | 485 | 1113 | 208 |
Escalation beds avail | 0 | 21 | 0 | |
Total beds avail | 485 | 1134 | 208 | |
Total beds occ'd | 482 | 1115 | 201 | |
30-Jan-01-Feb-15 | Core beds avail | 485 | 1112 | 208 |
Escalation beds avail | 0 | 28 | 2 | |
Total beds avail | 485 | 1140 | 210 | |
Total beds occ'd | 470 | 1135 | 204 |