http://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/weekly-ae-sitreps-2014-15/ http://www.england.nhs.uk/statistics/statistical-work-areas/winter-daily-sitreps/winter-daily-sitrep-2014-15-data/

Harvesting¶

One frequently encountered task when working with regularly produced datasets is the downloading of 'latest release' spreadsheet or CSV data files. These can typically take one of three forms:

a new weekly or monthly file is released containing just the figures for that week or month;
a refreshed release of the current state of an often time-sensitive dataset. For example, a list of current roadworks where no historical data is accessible
an updated longitudinal dataset that contains historical data, organised by date/time, as well as the most recent data.

In each case, there is a requirement to regularly download the newly released file. In the first case, when starting a longitudinal analysis, there may be a requirement to download historical data files (the monthly releases for previous months, for example) from which a longitudinal dataset may be created.

Harvesting data refers to the collection of data from a known location. Harvesting may occur over time, for example, according a schedule - collecting a monhtly dataset when it has been published - or collecting a set of data from multiple files or locations in order to seed a larger data set.

Where regularly released datafiles are posted to a known web location with a long lived URL in the form of a link on that page to the datafile itself, possibly along with a sef of previously posted links to historical datasets, we can often automate the collection of the data by scraping the links to the datafile or datafiles and then using those links as the basis of an automatic downloader.

Simple Screenscraper and Link Harvester¶

In [ ]:

#import sys
#reload(sys)
#sys.setdefaultencoding("utf-8")

In [7]:

import pandas as pd

#requests is a library that simplifies grabbing web pages and downloading files from web addresses / URLs
import requests

#BeautifulSoup is a library that helps parse HTML web pages and XML files
from bs4 import BeautifulSoup 

In [5]:

url='http://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/weekly-ae-sitreps-2014-15/'
#Grab the contents of a file of an arbitrary type that is published at a particular web location using an HTTP GET
page = requests.get(url)

#The file we have grabbed in his case is a web page - that is, an HTML file
#We can get the content of the page and parse it
soup=BeautifulSoup(page.content)
#BeautifulSoup has a routine - find_all() - that will find all the HTML tags of a particular sort
#Links are represented in HTML pages in the form <a href="http//ecample.com/page.html">link text</a>
#Grab all the <a> (anchor) tags...
links=soup.find_all('a')

In [11]:

#We can iterate through all the links one at a time, calling each link "link" temporarily as we do so
for link in links:
    #Do a logical test to see if the linked URL ends in .csv, the conventional suffix for a CSV file
    if link['href'].endswith('.csv'):
        #If the link does look like its a link to a CSV file, print the URL out
        print(link['href'],link.text)

('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.03.01-WSitAE-Extractions-AE4g5fd5.csv', u'A&E Attendances Week Ending 01.03.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv', u'A&E Admissions Week Ending 01.03.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.22-WSitAE-Extractions-AE3h6yd5.csv', u'A&E Attendances Week Ending 22.02.2015 (CSV, 21K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv', u'A&E Admissions Week Ending 22.02.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.15-WSitAE-Extractions-AE9th3bd.csv', u'A&E Attendances Week Ending 15.02.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.15-WSitAE-Extractions-Adm9th3bd.csv', u'A&E Admissions Week Ending 15.02.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.08-WSitAE-Extractions-AE4df65w.csv', u'A&E Attendances Week Ending 08.02.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.08-WSitAE-Extractions-Adm4df65w.csv', u' A&E Admissions Week Ending 08.02.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.01-WSitAE-Extractions-AE9rb63g.csv', u'A&E Attendances Week Ending 01.02.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.02.01-WSitAE-Extractions-Adm9rb63g.csv', u'A&E Admissions Week Ending 01.02.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.25-WSitAE-Extractions-AE4rt7hg.csv', u'A&E Attendances Week Ending 25.01.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv', u'A&E Admissions Week Ending 25.01.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv', u'A&E Attendances Week Ending 18.01.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv', u'A&E Admissions Week Ending 18.01.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv', u'A&E Attendances Week Ending 11.01.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv', u'A&E Admissions Week Ending 11.01.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.04-WSitAE-Extractions-AEp19ak47t.csv', u'A&E Attendances Week Ending 04.01.2015 (CSV, 22K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv', u'A&E Admissions Week Ending 04.01.2015 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.28-WSitAE-Extractions-AE4d6h8g.csv', u'A&E Attendances Week Ending 28.12.2014 (CSV, 21K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv', u'A&E Admissions Week Ending 28.12.2014 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.21-WSitAE-Extractions-AE9yf54s.csv', u'A&E Attendances Week Ending 21.12.2014 (CSV, 21K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.21-WSitAE-Extractions-Adm8yf54s.csv', u'A&E Admissions Week Ending 21.12.2014 (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.14-WSitAE-Extractions-AE2yg54s1.csv', u'A&E Attendances Week Ending 14.12.2014 (Revised 06.01.2015) (CSV, 21K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv', u'A&E Admissions Week Ending 14.12.2014 (Revised 06.01.2015) (CSV, 32K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.07-WSitAE-Extractions-AE9ty4h21.csv', u'A&E Attendances Week Ending 07.12.2014 (Revised 06.01.2015) (CSV, 21K)')
('http://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2014/04/2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv', u'A&E Admissions Week Ending 07.12.2014 (Revised 06.01.2015) (CSV, 32K)')

So we've seen how we can grab a list of links to CSV files that are published via a single web page. How does that help us harvest these files?

In [7]:

import os
directory='AEdata'
#Create a new directory - AEdata - if it doesnlt already exist
if not os.path.exists(directory):
    os.makedirs(directory)

In [14]:

##Via http://stackoverflow.com/a/16696317/454773
#TH - added path

#The following function seems to be a pretty robust, generic function, for downloading an arbitrary file to our local machine
import requests
def download_file(url,path):
    ''' Download a file from a particular URL to the current directory with original filename '''
    local_filename = url.split('/')[-1]
    local_filename='/'.join([path,local_filename])
    # NOTE the stream=True parameter
    r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
    return local_filename

In [15]:

#In the same way we iterested through a list of links extracted from a web page so we could then print them out,
#we can iterate through a list of links so 
for link in links:
    if link['href'].endswith('.csv'):
        fn=download_file(link['href'],directory)
        print("Downloaded {}".format(fn))

Downloaded AEdata/2015.03.01-WSitAE-Extractions-AE4g5fd5.csv
Downloaded AEdata/2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv
Downloaded AEdata/2015.02.22-WSitAE-Extractions-AE3h6yd5.csv
Downloaded AEdata/2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv
Downloaded AEdata/2015.02.15-WSitAE-Extractions-AE9th3bd.csv
Downloaded AEdata/2015.02.15-WSitAE-Extractions-Adm9th3bd.csv
Downloaded AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv
Downloaded AEdata/2015.02.08-WSitAE-Extractions-Adm4df65w.csv
Downloaded AEdata/2015.02.01-WSitAE-Extractions-AE9rb63g.csv
Downloaded AEdata/2015.02.01-WSitAE-Extractions-Adm9rb63g.csv
Downloaded AEdata/2015.01.25-WSitAE-Extractions-AE4rt7hg.csv
Downloaded AEdata/2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv
Downloaded AEdata/2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv
Downloaded AEdata/2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv
Downloaded AEdata/2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv
Downloaded AEdata/2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv
Downloaded AEdata/2015.01.04-WSitAE-Extractions-AEp19ak47t.csv
Downloaded AEdata/2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv
Downloaded AEdata/2014.12.28-WSitAE-Extractions-AE4d6h8g.csv
Downloaded AEdata/2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv
Downloaded AEdata/2014.12.21-WSitAE-Extractions-AE9yf54s.csv
Downloaded AEdata/2014.12.21-WSitAE-Extractions-Adm8yf54s.csv
Downloaded AEdata/2014.12.14-WSitAE-Extractions-AE2yg54s1.csv
Downloaded AEdata/2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv
Downloaded AEdata/2014.12.07-WSitAE-Extractions-AE9ty4h21.csv
Downloaded AEdata/2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv

In [190]:

#The os.listdir(PATH) command returns a list of files contain in the directory at path PATH
os.listdir('./AEdata')[:3]

Out[190]:

['111-for-winter-daily-sitrep-up-to-20150301.xlsx',
 '2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv',
 '2014.12.07-WSitAE-Extractions-AE9ty4h21.csv']

In [ ]:

#ACTIVITY - How would you identify whether or not a link to a file you haven't already downloaded has been posted?

In [19]:

#Automation

#Once we have worked out a recipe of one or more steps that might be usefully reusable,
#we might create a function that combines those steps so they can be called from a single command

#So for example, we could combine the command that gets the links from a page
#with a line that finds the ones corresponding to a particular filetype

#Bear in mind the function may be dependent on some imported libraries or other functions
#If we put the function into a file, we should make sure we also import or declare any other required libraries and functions
def linkscraper(url,path,filetypes):
    ''' url: a web page containing a list of links
        path: a path to a local download directory
        filetypes: a list of filetypes we want to check for'''
    
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    page = requests.get(url)
    
    soup=BeautifulSoup(page.content)
    links=soup.find_all('a')
    
    for link in links:
        filetype=link['href'].split('.')[-1]
        if filetype in filetypes:
            fn=download_file(link['href'],path)
            print("Downloaded {}".format(fn))

In [21]:

#We now have a single command that will download files of a particular type linked to from a given URL
url='http://www.england.nhs.uk/statistics/statistical-work-areas/winter-daily-sitreps/winter-daily-sitrep-2014-15-data/'
linkscraper(url,directory,['xlsx'])

Downloaded AEdata/DailySR-Timeseries-WE-01.03.15.xlsx
Downloaded AEdata/111-for-winter-daily-sitrep-up-to-20150301.xlsx
Downloaded AEdata/DailySR-Web-file-WE-01.03.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-22.02.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-15.02.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-08.02.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-01.02.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-25.01.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-18.01.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-11.01.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-04.01.15.xlsx
Downloaded AEdata/DailySR-Web-file-WE-28.12.14.xlsx
Downloaded AEdata/DailySR-Web-file-WE-21.12.14.xlsx
Downloaded AEdata/DailySR-Timeseries-WE-14.12.141.xlsx
Downloaded AEdata/DailySR-Web-file-WE-07.12.14.xlsx

We might want to further refine the download for future use to only download files we haven't already downloaded. Note that this might further require checking against an exceptions list, eg for spreadsheets or CSV files that retain a filename but that get updated on a regular basis.

In [6]:

!ls AEdata/

111-for-winter-daily-sitrep-up-to-20150301.xlsx
2014.12.07-WSitAE-Extractions-AE9ty4h21.csv
2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv
2014.12.14-WSitAE-Extractions-AE2yg54s1.csv
2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv
2014.12.21-WSitAE-Extractions-AE9yf54s.csv
2014.12.21-WSitAE-Extractions-Adm8yf54s.csv
2014.12.28-WSitAE-Extractions-AE4d6h8g.csv
2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv
2015.01.04-WSitAE-Extractions-AEp19ak47t.csv
2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv
2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv
2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv
2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv
2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv
2015.01.25-WSitAE-Extractions-AE4rt7hg.csv
2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv
2015.02.01-WSitAE-Extractions-AE9rb63g.csv
2015.02.01-WSitAE-Extractions-Adm9rb63g.csv
2015.02.08-WSitAE-Extractions-AE4df65w.csv
2015.02.08-WSitAE-Extractions-Adm4df65w.csv
2015.02.15-WSitAE-Extractions-AE9th3bd.csv
2015.02.15-WSitAE-Extractions-Adm9th3bd.csv
2015.02.22-WSitAE-Extractions-AE3h6yd5.csv
2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv
2015.03.01-WSitAE-Extractions-AE4g5fd5.csv
2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv
DailySR-Timeseries-WE-01.03.15.xlsx
DailySR-Timeseries-WE-14.12.141.xlsx
DailySR-Web-file-WE-01.02.15.xlsx
DailySR-Web-file-WE-01.03.15.xlsx
DailySR-Web-file-WE-04.01.15.xlsx
DailySR-Web-file-WE-07.12.14.xlsx
DailySR-Web-file-WE-08.02.15.xlsx
DailySR-Web-file-WE-11.01.15.xlsx
DailySR-Web-file-WE-15.02.15.xlsx
DailySR-Web-file-WE-18.01.15.xlsx
DailySR-Web-file-WE-21.12.14.xlsx
DailySR-Web-file-WE-22.02.15.xlsx
DailySR-Web-file-WE-25.01.15.xlsx
DailySR-Web-file-WE-28.12.14.xlsx

Parsing Complex Spreadsheets¶

When working with third party spreadsheets, it is not uncommon to find layouts that include metadata elements in the first sheet or at the top of a sheet, hierarchically declared column headings, where for example a heading spans several columns in one rows and is then subdivided into subheadings that appear in the next row, footnotes that appear at the bottom of a spreadsheet and even sidenotes that are pasted into additional columns.

Even simpler CSV files may include hierarchical indices and leading and trailing metadata or additional notes.

In the following example we'll start to look at the WSitAE reports, followed by the DailySR reports.

In [178]:

!head AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv

"Weekly A&E Sitreps - A&E Activity"

"Year:2014-15","Period Name:W/E 08/02/2015","Provider Org Name:<All>"

"","","Number of A&E Attendances",,,"Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission",,,"Number of patients spending >4 hours but <12 hours from decision to admit to admission",,,,"Number of patients spending >12 hours from decision to admit to admission",,,
"","","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total 4 to 12 hour Trolley Waits","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total over 12 hour Trolley Waits"
"AAH","TETBURY HOSPITAL TRUST LTD",0,0,63,0,0,0,0,0,0,0,0,0,0,0
"AD913","BECKENHAM BEACON UCC",0,0,796,0,0,0,0,0,0,0,0,0,0,0
"AJN","WORKINGTON HEALTH LIMITED",0,0,288,0,0,0,0,0,0,0,0,0,0,0
"C82009","MARKET HARBOROUGH MED.CTR",0,0,298,0,0,0,0,0,0,0,0,0,0,0

In [214]:

#The following recipe will open a file and then read the first few lines
with open('AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv','r') as f:
    for i in range(0,7):
        print('>>{}<<'.format(f.readline()))

>>"Weekly A&E Sitreps - A&E Activity"
<<
>>
<<
>>"Year:2014-15","Period Name:W/E 08/02/2015","Provider Org Name:<All>"
<<
>>
<<
>>"","","Number of A&E Attendances",,,"Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission",,,"Number of patients spending >4 hours but <12 hours from decision to admit to admission",,,,"Number of patients spending >12 hours from decision to admit to admission",,,
<<
>>"","","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total 4 to 12 hour Trolley Waits","Type 1 Departments - Major A&E","Type 2 Departments - Single Specialty","Type 3 Departments - Other A&E/Minor Injury Unit","Total over 12 hour Trolley Waits"
<<
>>"AAH","TETBURY HOSPITAL TRUST LTD",0,0,63,0,0,0,0,0,0,0,0,0,0,0
<<

Inspecting the first few rows of the file, we notice that we need to find some way of representing a hierarchical index. In pandas, this is referred to as a MultiIndex. Working with such structures can be confusing: it is worth keeping notes of any useful recipes you manage to work out and keep a set of worked examples that demonstrate principles and techniques that may be useful to you up-to-date.

In [204]:

?pd.read_csv

In [250]:

df=pd.read_csv("AEdata/2015.02.08-WSitAE-Extractions-AE4df65w.csv",  header=[4,5])
df.head(3)

Out[250]:

	Unnamed: 0_level_0	Unnamed: 1_level_0	Number of A&E Attendances	Unnamed: 3_level_0	Unnamed: 4_level_0	Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission	Unnamed: 6_level_0	Unnamed: 7_level_0	Number of patients spending >4 hours but <12 hours from decision to admit to admission	Unnamed: 9_level_0	Unnamed: 10_level_0	Unnamed: 11_level_0	Number of patients spending >12 hours from decision to admit to admission	Unnamed: 13_level_0	Unnamed: 14_level_0	Unnamed: 15_level_0
	Unnamed: 0_level_1	Unnamed: 1_level_1	Type 1 Departments - Major A&E	Type 2 Departments - Single Specialty	Type 3 Departments - Other A&E/Minor Injury Unit	Type 1 Departments - Major A&E	Type 2 Departments - Single Specialty	Type 3 Departments - Other A&E/Minor Injury Unit	Type 1 Departments - Major A&E	Type 2 Departments - Single Specialty	Type 3 Departments - Other A&E/Minor Injury Unit	Total 4 to 12 hour Trolley Waits	Type 1 Departments - Major A&E	Type 2 Departments - Single Specialty	Type 3 Departments - Other A&E/Minor Injury Unit	Total over 12 hour Trolley Waits
0	AAH	TETBURY HOSPITAL TRUST LTD	0	0	63	0	0	0	0	0	0	0	0	0	0	0
1	AD913	BECKENHAM BEACON UCC	0	0	796	0	0	0	0	0	0	0	0	0	0	0
2	AJN	WORKINGTON HEALTH LIMITED	0	0	288	0	0	0	0	0	0	0	0	0	0	0

In [251]:

#Flatten herarchical column names
#http://stackoverflow.com/a/14508355/454773
df.columns = ['::'.join(col).strip() for col in df.columns.values]
df.head(3)

Out[251]:

	Unnamed: 0_level_0::Unnamed: 0_level_1	Unnamed: 1_level_0::Unnamed: 1_level_1	Unnamed: 4_level_0::Type 3 Departments - Other A&E/Minor Injury Unit
0	AAH	TETBURY HOSPITAL TRUST LTD	63
1	AD913	BECKENHAM BEACON UCC	796
2	AJN	WORKINGTON HEALTH LIMITED	288

In [252]:

df.head(3)

Out[252]:

	Unnamed: 0_level_0::Unnamed: 0_level_1	Unnamed: 1_level_0::Unnamed: 1_level_1	Unnamed: 4_level_0::Type 3 Departments - Other A&E/Minor Injury Unit
0	AAH	TETBURY HOSPITAL TRUST LTD	63
1	AD913	BECKENHAM BEACON UCC	796
2	AJN	WORKINGTON HEALTH LIMITED	288

In [253]:

df_long=pd.melt(df,id_vars=["Unnamed: 0_level_0::Unnamed: 0_level_1","Unnamed: 1_level_0::Unnamed: 1_level_1"])
df_long.head(3)

Out[253]:

	Unnamed: 0_level_0::Unnamed: 0_level_1	Unnamed: 1_level_0::Unnamed: 1_level_1	variable
0	AAH	TETBURY HOSPITAL TRUST LTD	Number of A&E Attendances::Type 1 Departments ...
1	AD913	BECKENHAM BEACON UCC	Number of A&E Attendances::Type 1 Departments ...
2	AJN	WORKINGTON HEALTH LIMITED	Number of A&E Attendances::Type 1 Departments ...

In [254]:

df_long.columns=['Code','Name',"Type",'Count']
df_long.head(3)

Out[254]:

	Code	Name	Type
0	AAH	TETBURY HOSPITAL TRUST LTD	Number of A&E Attendances::Type 1 Departments ...
1	AD913	BECKENHAM BEACON UCC	Number of A&E Attendances::Type 1 Departments ...
2	AJN	WORKINGTON HEALTH LIMITED	Number of A&E Attendances::Type 1 Departments ...

In [255]:

df_long['Type0']=df_long['Type'].map(lambda x: x.split('::')[0])
df_long['Type1']=df_long['Type'].map(lambda x: x.split('::')[1])

In [256]:

df_long.head()

Out[256]:

	Code	Name	Type	Type0	Type1
0	AAH	TETBURY HOSPITAL TRUST LTD	Number of A&E Attendances::Type 1 Departments ...	Number of A&E Attendances	Type 1 Departments - Major A&E
1	AD913	BECKENHAM BEACON UCC	Number of A&E Attendances::Type 1 Departments ...	Number of A&E Attendances	Type 1 Departments - Major A&E
2	AJN	WORKINGTON HEALTH LIMITED	Number of A&E Attendances::Type 1 Departments ...	Number of A&E Attendances	Type 1 Departments - Major A&E
3	C82009	MARKET HARBOROUGH MED.CTR	Number of A&E Attendances::Type 1 Departments ...	Number of A&E Attendances	Type 1 Departments - Major A&E
4	C82010	OAKHAM MEDICAL PRACTICE	Number of A&E Attendances::Type 1 Departments ...	Number of A&E Attendances	Type 1 Departments - Major A&E

In [262]:

df_long['Type0'].unique()

Out[262]:

array(['Number of A&E Attendances', 'Unnamed: 3_level_0',
       'Unnamed: 4_level_0',
       'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission',
       'Unnamed: 6_level_0', 'Unnamed: 7_level_0',
       'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
       'Unnamed: 9_level_0', 'Unnamed: 10_level_0', 'Unnamed: 11_level_0',
       'Number of patients spending >12 hours from decision to admit to admission',
       'Unnamed: 13_level_0', 'Unnamed: 14_level_0', 'Unnamed: 15_level_0'], dtype=object)

In [266]:

#There's more than one way to do it...
#use .replace()
#df_long['Type0']=df_long['Type0'].str.replace('Unnamed: 3_level_0':'Number of A&E Attendances')
#etc
#use .map()
colmap={'Number of A&E Attendances':'Number of A&E Attendances',
        'Unnamed: 3_level_0':'Number of A&E Attendances',
       'Unnamed: 4_level_0':'Number of A&E Attendances',
       'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission':'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission',
       'Unnamed: 6_level_0':'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission',
       'Unnamed: 7_level_0':'Number of patients spending >4 hours in A&E from arrival to discharge, transfer or admission',
       'Number of patients spending >4 hours but <12 hours from decision to admit to admission':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
       'Unnamed: 9_level_0':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
       'Unnamed: 10_level_0':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
       'Unnamed: 11_level_0':'Number of patients spending >4 hours but <12 hours from decision to admit to admission',
       'Number of patients spending >12 hours from decision to admit to admission':'Number of patients spending >12 hours from decision to admit to admission',
       'Unnamed: 13_level_0':'Number of patients spending >12 hours from decision to admit to admission',
       'Unnamed: 14_level_0':'Number of patients spending >12 hours from decision to admit to admission',
       'Unnamed: 15_level_0':'Number of patients spending >12 hours from decision to admit to admission'}

In [268]:

df_long['Type0']=df_long['Type0'].map(colmap)

Adding metadata back in...¶

One of the things we might want to do to this dataset is add some metadata back in relating to the date the data applies to. One source of this data was the filename:

ANother source was in the metadata rows contained within the file itself:

WSitAE CSV API Demo¶

WSitAE-Extractions-AE¶

In [247]:

import os

def WSitAE_AE_read(date,path='AEData'):
    ls=os.listdir(path)
    fn=None
    for item in ls:
        if item.find(date)!=-1 and item.find('WSitAE-Extractions-AE')!=-1:
            fn=item
    if fn is None: return pd.DataFrame()
    
    ffn='{path}/{fn}'.format(path=path,fn=fn)
    
    z={}
    with open(ffn,'r') as f:
        for i in range(0,2):
            for a in s.split(','):
                z[a.strip('"').split(':')[0]]=a.strip('"').split(':')[1] 
    f.close()
    
    #Read the file in without setting any headers
    dfw=pd.read_csv(ffn, header=None,skiprows=4)
    #Get rid of any empty columns
    dfw.dropna(how='all',axis=1,inplace=True)
    #The contents of the first two rows will be our column headers
    #Fill across so blank columns are named according to the previously named column
    dfw[:2].fillna(method='ffill', axis=1,inplace=True)
    #Generate a multiindex in the columns from the first two rows
    dfw.columns=pd.MultiIndex.from_arrays(dfw[:2].values, names=['Category','SubCategory'])
    #Flatten the columns multi-index to give a single set of flat column names
    dfw.columns = ['Code','Name']+['::'.join(col).strip() for col in dfw.columns.values[2:]]
    ##Hack a date from period- should really cast to period
    dfw['Period Name']=z['Period Name'].strip()#date
    dfw['Date']=z['Period Name'].replace('W/E','').strip()
    dfw['Year']=z['Year'].strip()
    return dfw[2:]

def WSitAE_AE_melt(df):
    #Declare the id_vars
    idx=['Code','Name','Year','Period Name','Date']
    df=pd.melt(df,id_vars=idx, var_name='Type',value_name='Value')
    df['SubType']=df['Type'].map(lambda x: x.split('::')[1])
    df['Type']=df['Type'].map(lambda x: x.split('::')[0])
    return df[idx+['Type','SubType','Value']]


def WSitAE_AE_long(date,path='AEData'):
    df=WSitAE_AE_read(date,path)
    return WSitAE_AE_melt(df)

In [2]:

!ls AEdata

111-for-winter-daily-sitrep-up-to-20150301.xlsx
2014.12.07-WSitAE-Extractions-AE9ty4h21.csv
2014.12.07-WSitAE-Extractions-Adm9rh46d1.csv
2014.12.14-WSitAE-Extractions-AE2yg54s1.csv
2014.12.14-WSitAE-Extractions-Adm2yg54s1.csv
2014.12.21-WSitAE-Extractions-AE9yf54s.csv
2014.12.21-WSitAE-Extractions-Adm8yf54s.csv
2014.12.28-WSitAE-Extractions-AE4d6h8g.csv
2014.12.28-WSitAE-Extractions-Adm4d6h8g.csv
2015.01.04-WSitAE-Extractions-AEp19ak47t.csv
2015.01.04-WSitAE-Extractions-Adm19pa47kt.csv
2015.01.11-WSitAE-Extractions-AEy5rh47rt.csv
2015.01.11-WSitAE-Extractions-Admy5rh47rt.csv
2015.01.18-WSitAE-Extractions-AE8y7t4gd7.csv
2015.01.18-WSitAE-Extractions-Adm8y7t4gd7.csv
2015.01.25-WSitAE-Extractions-AE4rt7hg.csv
2015.01.25-WSitAE-Extractions-Adm4rt7hg.csv
2015.02.01-WSitAE-Extractions-AE9rb63g.csv
2015.02.01-WSitAE-Extractions-Adm9rb63g.csv
2015.02.08-WSitAE-Extractions-AE4df65w.csv
2015.02.08-WSitAE-Extractions-Adm4df65w.csv
2015.02.15-WSitAE-Extractions-AE9th3bd.csv
2015.02.15-WSitAE-Extractions-Adm9th3bd.csv
2015.02.22-WSitAE-Extractions-AE3h6yd5.csv
2015.02.22-WSitAE-Extractions-Adm3h6yd5.csv
2015.03.01-WSitAE-Extractions-AE4g5fd5.csv
2015.03.01-WSitAE-Extractions-Adm4g5fd5.csv
2015.03.08-AEHLnoU.xls
DailySR-Timeseries-WE-01.03.15.xlsx
DailySR-Timeseries-WE-14.12.141.xlsx
DailySR-Web-file-WE-01.02.15.xlsx
DailySR-Web-file-WE-01.03.15.xlsx
DailySR-Web-file-WE-04.01.15.xlsx
DailySR-Web-file-WE-07.12.14.xlsx
DailySR-Web-file-WE-08.02.15.xlsx
DailySR-Web-file-WE-11.01.15.xlsx
DailySR-Web-file-WE-15.02.15.xlsx
DailySR-Web-file-WE-18.01.15.xlsx
DailySR-Web-file-WE-21.12.14.xlsx
DailySR-Web-file-WE-22.02.15.xlsx
DailySR-Web-file-WE-25.01.15.xlsx
DailySR-Web-file-WE-28.12.14.xlsx

In [249]:

dfw=WSitAE_AE_long('2015.02.08')
dfw[:3]

Out[249]:

	Code	Name	Year	Period Name	Date	Type	SubType
0	AAH	TETBURY HOSPITAL TRUST LTD	2014-15	W/E 08/02/2015	08/02/2015	Number of A&E Attendances	Type 1 Departments - Major A&E
1	AD913	BECKENHAM BEACON UCC	2014-15	W/E 08/02/2015	08/02/2015	Number of A&E Attendances	Type 1 Departments - Major A&E
2	AJN	WORKINGTON HEALTH LIMITED	2014-15	W/E 08/02/2015	08/02/2015	Number of A&E Attendances	Type 1 Departments - Major A&E

WSitAE-Extractions-Adm¶

In [250]:

def WSitAE_Adm_read(date,path='AEData'):
    ls=os.listdir(path)
    fn=None
    for item in ls:
        if item.find(date)!=-1 and item.find('WSitAE-Extractions-Adm')!=-1:
            fn=item
    if fn is None: return pd.DataFrame()
    
    ffn='{path}/{fn}'.format(path=path,fn=fn)
    dfw=pd.read_csv(ffn, header=4)
    dfw.dropna(how='all',axis=1,inplace=True)
    dfw['Date']=dfw['Period Name'].str.replace('W/E','').str.strip()
    cols=dfw.columns.tolist()
    cols=cols[:3]+cols[-1:]+cols[3:-1]
    return dfw[cols]

In [251]:

WSitAE_Adm_read('2015.02.15')[:3]

Out[251]:

	Basis	Year	Period Name	Date	Provider Parent Name	Provider Org Code	Provider Org Name	Emergency Adm Type 1	Emergency Adm Other
0	PROV	2014-15	W/E 15/02/2015	15/02/2015	CHESHIRE, WARRINGTON AND WIRRAL AREA TEAM	RTV	5 BOROUGHS PARTNERSHIP NHS FOUNDATION TRUST	0	0
1	PROV	2014-15	W/E 15/02/2015	15/02/2015	MERSEYSIDE AREA TEAM	REM	AINTREE UNIVERSITY HOSPITAL NHS FOUNDATION TRUST	612	240
2	PROV	2014-15	W/E 15/02/2015	15/02/2015	WEST YORKSHIRE AREA TEAM	RCF	AIREDALE NHS FOUNDATION TRUST	247	182

In [257]:

#Alternatively, we can load the data contained in the two CSV files from a single spreadsheet sheet
dfx=pd.read_excel('AEData/2015.03.08-AEHLnoU.xls',header=None,skiprows=14)
dfx.dropna(how='all',axis=1,inplace=True)
dfx[:1].fillna(method='ffill', axis=1,inplace=True)
dfx.columns=pd.MultiIndex.from_arrays(dfx[:2].values, names=['Category','SubCategory'])
dfx.columns = ['Area Team','Code','Name']+['::'.join(col).strip() for col in dfx.columns.values[3:]]

dfx[:3]

Out[257]:

	Area Team	Code	Name	A&E attendances::Type 1 Departments - Major A&E	A&E attendances::Type 2 Departments - Single Specialty	A&E attendances::Type 3 Departments - Other A&E/Minor Injury Unit	A&E attendances::Total Attendances	A&E attendances > 4 hours from arrival to admission, transfer or discharge::Type 1 Departments - Major A&E	A&E attendances > 4 hours from arrival to admission, transfer or discharge::Type 2 Departments - Single Specialty	A&E attendances > 4 hours from arrival to admission, transfer or discharge::Type 3 Departments - Other A&E/Minor Injury Unit	...	A&E attendances > 4 hours from arrival to admission, transfer or discharge::Percentage in 4 hours or less (type 1)	A&E attendances > 4 hours from arrival to admission, transfer or discharge::Percentage in 4 hours or less (all)	Emergency Admissions::Emergency Admissions via Type 1 A&E	Emergency Admissions::Emergency Admissions via Type 2 A&E	Emergency Admissions::Emergency Admissions via Type 3 and 4 A&E	Emergency Admissions::Total Emergency Admissions via A&E	Emergency Admissions::Other Emergency Admissions (i.e not via A&E)	Emergency Admissions::Total Emergency Admissions	Emergency Admissions::Number of patients spending >4 hours from decision to admit to admission	Emergency Admissions::Number of patients spending >12 hours from decision to admit to admission
0	NaN	NaN	NaN	A&E attendances	A&E attendances	A&E attendances	A&E attendances	A&E attendances > 4 hours from arrival to admi...	A&E attendances > 4 hours from arrival to admi...	A&E attendances > 4 hours from arrival to admi...	...	A&E attendances > 4 hours from arrival to admi...	A&E attendances > 4 hours from arrival to admi...	Emergency Admissions	Emergency Admissions	Emergency Admissions	Emergency Admissions	Emergency Admissions	Emergency Admissions	Emergency Admissions	Emergency Admissions
1	Area Team	Code	Name	Type 1 Departments - Major A&E	Type 2 Departments - Single Specialty	Type 3 Departments - Other A&E/Minor Injury Unit	Total Attendances	Type 1 Departments - Major A&E	Type 2 Departments - Single Specialty	Type 3 Departments - Other A&E/Minor Injury Unit	...	Percentage in 4 hours or less (type 1)	Percentage in 4 hours or less (all)	Emergency Admissions via Type 1 A&E	Emergency Admissions via Type 2 A&E	Emergency Admissions via Type 3 and 4 A&E	Total Emergency Admissions via A&E	Other Emergency Admissions (i.e not via A&E)	Total Emergency Admissions	Number of patients spending >4 hours from deci...	Number of patients spending >12 hours from dec...
2	-	-	England	280281	12803	141690	434774	30734	41	388	...	0.8903458	0.9283237	76472	318	964	77754	28890	106644	6911	17

3 rows × 21 columns

In [ ]:

#Alternatively, we could have separated out the A&E data from the Emergency Admissions Data

Daily SitRep¶

In [15]:

import pandas as pd
def DailySR_read(f):
    return pd.ExcelFile(f)
#xl=DailySR_read('AEData/DailySR-Web-file-WE-01.03.15.xlsx')
xl=DailySR_read('AEData/DailySR-Web-file-WE-07.12.14.xlsx')
xl.sheet_names

Out[15]:

[u'A&E closures',
 u'A&E diverts',
 u'Cancelled operations',
 u'Critical care transfers',
 u'Ambulances queuing',
 u'G&A beds',
 u'D&V, Norovirus',
 u'Delayed transfers of care',
 u'Adult critical care',
 u'Paediatric intensive care',
 u'Neonatal intensive care ',
 u'Trusts with Ops problems',
 u'A&E Performance',
 u'Macro1']

In [115]:

def DailySR_parse(xl,sheet,skiprows=13,header=None):
    xlclass={'A&E closures':0,
             'A&E diverts':0,
             'Cancelled operations':1,
             'Critical care transfers':1,
             'Ambulances queuing':0,
             'G&A beds':1,
             'D&V, Norovirus':1,
             'Delayed transfers of care':0,
             'Adult critical care':1,
             'Paediatric intensive care':1,
             'Neonatal intensive care ':1,
             'Trusts with Ops problems':0,
             'A&E Performance':1#,
             #'Macro1':-1
             }
    if sheet not in xlclass: return pd.DataFrame()
    
    dfx=xl.parse(sheet,skiprows=skiprows,header=header)
    dfx.dropna(how='all',inplace=True)
    dfx.dropna(how='all',axis=1,inplace=True)
    cols=[]

    if xlclass[sheet]==0:
        dfx.drop([2,3],inplace=True)
        dfx.dropna(how='all',axis=1,inplace=True)
        dfx.columns=dfx.ix[1].values
        dfx=dfx[1:]
    elif xlclass[sheet]==1:
        #This is a real hack, trying to think what a sensible shape is and how to get there
        #Drop an empty row and the total count for England row
        dfx.drop([2,3],inplace=True)
        #Get rid of empty columns
        dfx.dropna(how='all',axis=1,inplace=True)
        #Fill across on dates
        dfx[:1].fillna(method='ffill', axis=1,inplace=True)
        #Patch the leading empty columns
        dfx[:1].fillna('Metadata',inplace=True)
        #Generate a multi-index across the columns from the first two rows
        dfx.columns=pd.MultiIndex.from_arrays(dfx[:2].values, names=['Category','SubCategory'])
        #Drop the rows we used to make the multi-index
        dfx=dfx[2:]
        #Transpose rows and columns
        dfx=dfx.T
        #Set the column names to be the hospital codes
        dfx.columns=dfx.ix['Metadata','Code'].values
        #And then throw away the metadata rows
        dfx=dfx[3:]
        #Tidy up by throwing away any empty columns we created
        dfx.dropna(how='all',axis=1,inplace=True)
    return dfx

xx=DailySR_parse(xl,'G&A beds')
xx[:4]

Out[115]:

		RJC	RKB	RLQ	RLT	RWP	RD1	RN3	RNZ	RTE	RBK	...	RD3	RDZ	RHM	RHU	RN5	RAE	RCF	RR8	RWY	RXF
Category	SubCategory
2014-12-01 00:00:00	Core beds avail	485	1114	208	268	799	595	469	410	911	410	...	410	582	1205	995	827	661	325	1960	596	930
	Escalation beds avail	0	8	0	18	51	2	18	20	38	56	...	50	0	12	58	33	0	8	42	28	36
	Total beds avail	485	1122	208	286	850	597	487	430	949	466	...	460	582	1217	1053	860	661	333	2002	624	966
	Total beds occ'd	445	1116	208	286	848	569	486	421	922	466	...	435	537	1186	931	831	629	317	1738	613	933

4 rows × 153 columns

In [117]:

xx['RNZ'][:,'Escalation beds avail']

Out[117]:

Category
2014-12-01 00:00:00    20
2014-12-02 00:00:00    28
2014-12-03 00:00:00    27
2014-12-04 00:00:00    20
05-07-Dec-14           21
Name: RNZ, dtype: object

In [119]:

DailySR_parse(xl,'Paediatric intensive care')

Out[119]:

		RJC	RKB	RLQ	RLT	RWP	RD1	RN3	RNZ	RTE	RBK	...	RD3	RDZ	RHM	RHU	RN5	RAE	RCF	RR8	RWY	RXF
Category	SubCategory
2014-12-01 00:00:00	Paed Int Care Avail	0	4	0	0	0	0	0	0	4	0	...	0	0	13	0	0	0	0	16	0	0
2014-12-01 00:00:00	Paed Int Care Occ	0	3	0	0	0	0	0	0	4	0	...	0	0	13	0	0	0	0	11	0	0
2014-12-02 00:00:00	Paed Int Care Avail	0	4	0	0	0	0	0	0	4	0	...	0	0	13	0	0	0	0	16	0	0
2014-12-02 00:00:00	Paed Int Care Occ	0	3	0	0	0	0	0	0	1	0	...	0	0	12	0	0	0	0	14	0	0
2014-12-03 00:00:00	Paed Int Care Avail	0	4	0	0	0	0	0	0	4	0	...	0	0	13	0	0	0	0	16	0	0
2014-12-03 00:00:00	Paed Int Care Occ	0	3	0	0	0	0	0	0	2	0	...	0	0	13	0	0	0	0	13	0	0
2014-12-04 00:00:00	Paed Int Care Avail	0	4	0	0	0	0	0	0	4	0	...	0	0	13	0	0	0	0	16	0	0
2014-12-04 00:00:00	Paed Int Care Occ	0	4	0	0	0	0	0	0	3	0	...	0	0	13	0	0	0	0	14	0	0
05-07-Dec-14	Paed Int Care Avail	0	4	0	0	0	1	0	0	4	0	...	0	0	14	0	0	0	0	16	0	0
05-07-Dec-14	Paed Int Care Occ	0	4	0	0	0	1	0	0	3	0	...	0	0	14	0	0	0	0	12	0	0

10 rows × 153 columns

So now I've worked out the steps to get the separate waiting time

Hierachical Column Indexes in Excel Spreadsheets¶

Where spreadsheets contain multiple heterogenous sheets, we need to load them in using the pandas ExcelFile as the following examples using a DailySR report shows.

In [208]:

dfx=pd.ExcelFile('AEdata/DailySR-Web-file-WE-01.02.15.xlsx')
dfx.sheet_names

Out[208]:

[u'A&E closures',
 u'A&E diverts',
 u'Cancelled operations',
 u'Critical care transfers',
 u'Ambulances queuing',
 u'G&A beds',
 u'D&V, Norovirus',
 u'Delayed transfers of care',
 u'Adult critical care',
 u'Paediatric intensive care',
 u'Neonatal intensive care ',
 u'Trusts with Ops problems',
 u'A&E Performance',
 u'Macro1']

In [214]:

df=dfx.parse('G&A beds',header=None,skiprows=13,parse_cols="B,D:Y")
df.head()

Out[214]:

	0	1	2	3	4	5	6	7	8	9	...	13	14	15	16	17	18	19	20	21	22
0	NaN	NaN	NaN	2015-01-26 00:00:00	NaN	NaN	NaN	2015-01-27 00:00:00	NaN	NaN	...	NaN	NaN	2015-01-29 00:00:00	NaN	NaN	NaN	30-Jan-01-Feb-15	NaN	NaN	NaN
1	Area Team	Code	Name	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	...	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd
2	-	-	ENGLAND	97161	4589	101750	97278	97169	4439	101608	...	101564	96301	97224	4206	101430	95890	97009	4352	101361	95320
3	NaN	NaN	NaN	3	4	5	6	7	8	9	...	13	14	15	16	17	18	19	20	21	22
4	Arden, Herefordshire And Worcestershire Area Team	RJC	South Warwickshire NHS Foundation Trust	485	0	485	469	485	0	485	...	485	467	485	0	485	482	485	0	485	470

5 rows × 23 columns

In [215]:

df.set_index([0,1,2],inplace=True)
df.head(2)

Out[215]:

			3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
0	1	2
NaN	NaN	NaN	2015-01-26 00:00:00	NaN	NaN	NaN	2015-01-27 00:00:00	NaN	NaN	NaN	2015-01-28 00:00:00	NaN	NaN	NaN	2015-01-29 00:00:00	NaN	NaN	NaN	30-Jan-01-Feb-15	NaN	NaN	NaN
Area Team	Code	Name	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd

In [216]:

#http://stackoverflow.com/a/24154138/454773
#fill in Null values in "Headers"
df[:1].fillna(method='ffill', axis=1,inplace=True)
#df[:1].fillna(value='Base',axis=1,inplace=True)
df.head(3)

Out[216]:

			3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
0	1	2
NaN	NaN	NaN	2015-01-26 00:00:00	2015-01-26 00:00:00	2015-01-26 00:00:00	2015-01-26 00:00:00	2015-01-27 00:00:00	2015-01-27 00:00:00	2015-01-27 00:00:00	2015-01-27 00:00:00	2015-01-28 00:00:00	2015-01-28 00:00:00	2015-01-28 00:00:00	2015-01-28 00:00:00	2015-01-29 00:00:00	2015-01-29 00:00:00	2015-01-29 00:00:00	2015-01-29 00:00:00	30-Jan-01-Feb-15	30-Jan-01-Feb-15	30-Jan-01-Feb-15	30-Jan-01-Feb-15
Area Team	Code	Name	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd	Core beds avail	Escalation beds avail	Total beds avail	Total beds occ'd
-	-	ENGLAND	97161	4589	101750	97278	97169	4439	101608	96826	97281	4283	101564	96301	97224	4206	101430	95890	97009	4352	101361	95320

In [217]:

#http://stackoverflow.com/a/24154138/454773
#create multiindex column names
df.columns=pd.MultiIndex.from_arrays(df[:2].values, names=['Date','Report']) 

In [218]:

df[4:].head(3).T

Out[218]:

	0	Arden, Herefordshire And Worcestershire Area Team
	1	RJC	RKB	RLQ
	2	South Warwickshire NHS Foundation Trust	University Hospitals Coventry And Warwickshire NHS Trust	Wye Valley NHS Trust
Date	Report
2015-01-26 00:00:00	Core beds avail	485	1112	208
	Escalation beds avail	0	27	0
	Total beds avail	485	1139	208
	Total beds occ'd	469	1127	200
2015-01-27 00:00:00	Core beds avail	485	1113	208
	Escalation beds avail	0	28	0
	Total beds avail	485	1141	208
	Total beds occ'd	477	1132	203
2015-01-28 00:00:00	Core beds avail	485	1112	208
	Escalation beds avail	0	13	0
	Total beds avail	485	1125	208
	Total beds occ'd	467	1108	204
2015-01-29 00:00:00	Core beds avail	485	1113	208
	Escalation beds avail	0	21	0
	Total beds avail	485	1134	208
	Total beds occ'd	482	1115	201
30-Jan-01-Feb-15	Core beds avail	485	1112	208
	Escalation beds avail	0	28	2
	Total beds avail	485	1140	210
	Total beds occ'd	470	1135	204

In [218]:

In [ ]: