Quick Look At TEF Data¶

Tony Hirst, @psychemedia

Notebook to load and have a quick peek at the Teaching Excellence Framework (TEF) data: Teaching Excellence Framework.

Example spreadsheet format used to develop the original code can be found here: Teaching Excellence Framework: year 2 specification.

Final spreadsheets available from: http://www.hefce.ac.uk/lt/tef/data/

User Settings¶

In [10]:

#Set to false for full run
#TEST: True, False
TEST=False

In [6]:

#Put Excel files into specified (sub)directory and load them from there
directory='TEFYearTwo_AllMetrics'#'tef'

Utils¶

Bits and bobs that may or may not be useful...

In [3]:

#Packages
import os
import pandas as pd

#Support inline plotting
%matplotlib inline

In [4]:

#Set letter vars as numeric - may be useful for referencing cells?
import string
k=0
for l in string.ascii_uppercase:
    exec(l+"="+str(k))
    k=k+1
    
A, D, Z

Out[4]:

(0, 3, 25)

In [5]:

#Get some UK HE admin data - may be useful?
#Via http://learning-provider.data.ac.uk/
#lp=pd.read_csv('learning-providers-plus.csv')
lp=pd.read_csv('http://learning-provider.data.ac.uk/data/learning-providers-plus.csv')
lp.head()

Out[5]:

	UKPRN	PROVIDER_NAME	VIEW_NAME	SORT_NAME	ALIAS	FLAT_NAME_NUMBER	BUILDING_NAME_NUMBER	LOCALITY	STREET_NAME	TOWN	POSTCODE	WEBSITE_URL	WIKIPEDIA_URL	GROUPS	LONGITUDE	LATITUDE	EASTING	NORTHING	GTR_ID	HESA_ID
0	10008640	FALMOUTH UNIVERSITY	University College Falmouth	Falmouth, University College	NaN	NaN	Woodlane Campus	NaN	Woodlane	Falmouth	TR11 4RH	http://www.falmouth.ac.uk/	http://en.wikipedia.org/wiki/University_Colleg...	NaN	-5.070901	50.149168	180711.0	32196.0	E84FC550-A4CC-4B98-A6F9-D15A33829D83	17.0
1	10007774	UNIVERSITY OF OXFORD	University of Oxford	Oxford, University of	NaN	NaN	UNIVERSITY OFFICES	NaN	WELLINGTON SQUARE	OXFORD	OX1 2JD	http://www.ox.ac.uk/	http://en.wikipedia.org/wiki/University_of_Oxford	Science_and_Engineering_South, Russell_Group, ...	-1.262868	51.757644	450974.0	206807.0	B1F0E8FE-FE3C-49ED-9C96-1ED75312A8A0	156.0
2	10007768	UNIVERSITY OF LANCASTER	University of Lancaster	Lancaster, University of	NaN	NaN	UNIVERSITY HOUSE	BAILRIGG	NaN	LANCASTER	LA1 4YW	http://www.lancs.ac.uk/	http://en.wikipedia.org/wiki/Lancaster_University	1994_Group, N8_Research_Partnership	-2.786905	54.010480	348528.0	457448.0	F8C7F869-77D6-4859-96F1-3550A4951F6C	123.0
3	10000571	BATH SPA UNIVERSITY	Bath Spa University	Bath Spa University	NaN	NaN	NEWTON PARK	NEWTON ST. LOE	NaN	BATH	BA2 9BN	http://www.bathspa.ac.uk/	http://en.wikipedia.org/wiki/Bath_Spa_University	Million_Plus	-2.437400	51.378739	369654.0	164501.0	1A469850-02BA-4814-81EE-E60C851CABC3	48.0
4	10007814	CARDIFF UNIVERSITY	Cardiff University	Cardiff University	NaN	NaN	NaN	NaN	PARK PLACE	CARDIFF	CF10 3AT	http://www.cardiff.ac.uk/	http://en.wikipedia.org/wiki/Cardiff_University	GW4, Russell_Group	-3.179907	51.489093	318176.0	177343.0	9C10D78F-6430-4CA7-9528-B96B0762A4C6	179.0

Get the Data¶

Data is published as via http://www.hefce.ac.uk/lt/tef/data/.

Originally it was thought that data files would be provided just as individual Excel spreadsheets (original example), one per institution, hence the need for this notebook. In the end, multiple versions of the data were made available, including monolithic CSV documents (of file for all institutions), rendering this notebook superfluous.

The final released spreadsheets had some minor differences in layout and sheet labeling from the original example. The original notebook scripts were not intelligent enough to automatically cope with these and nor is this one. For example, cell ranges are hard coded rather than being autodetected.

In [177]:

#Download the zipped Excel workbooks
!wget http://www.hefce.ac.uk/media/HEFCE,2014/Content/Learning,and,teaching/TEF/TEFYearTwo/data/TEFYearTwo_AllMetrics.zip
    

--2017-06-22 08:41:13--  http://www.hefce.ac.uk/media/HEFCE,2014/Content/Learning,and,teaching/TEF/TEFYearTwo/data/TEFYearTwo_AllMetrics.zip
Resolving www.hefce.ac.uk (www.hefce.ac.uk)... 195.194.167.210
Connecting to www.hefce.ac.uk (www.hefce.ac.uk)|195.194.167.210|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44520361 (42M) [application/x-zip-compressed]
Saving to: ‘TEFYearTwo_AllMetrics.zip’

TEFYearTwo_AllMetri 100%[===================>]  42.46M  4.60MB/s    in 12s     

2017-06-22 08:41:25 (3.61 MB/s) - ‘TEFYearTwo_AllMetrics.zip’ saved [44520361/44520361]

Put all the spreadsheet files into a single folder; the script will then load them in and generate some monolithic CSV files of various flavours.

Note the processing of the files by the rest of the notebook is not very efficient...

In [184]:

!mkdir -p {directory}
!unzip -q TEFYearTwo_AllMetrics.zip -d {directory}

In [35]:

#Column names are crappy with white space - clean it off
def cleanCols(df):
    df.rename(columns=lambda x: str(x).strip(),inplace=True)
    return df


#Hack the PRN out of each sheet and use it to key data values with a new prn column in each dataset
def getPRN2(fn,sn):
    i=pd.read_excel(fn, sheetname=sn, parse_cols=[0,1],header=None).head()
    institution=None
    prn=None
    institution=str(i[i[0].notnull() & i[0].str.contains('Institution')][1].iloc[0]).strip()
    prn=str(i[i[0].notnull() & i[0].str.contains('UKPRN')][1].iloc[0]).strip()
    return institution,prn

def getPRN(fn,sn):
    #TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
    prn=fn.split('/')[1].split('_')[0]
    institution=fn.split('/')[1].split('_')[1]
    return institution,prn

#Quick helper that would let us leave test spreadsheets in the data dir.
def fntest(f):
    if TEST: return not f.startswith('~') and f.startswith('TEST')
    return not f.startswith('~') and not f.startswith('TEST')

In [167]:

fn=os.listdir(directory)[0]
xl=pd.read_excel('{}/{}'.format(directory,fn), sheetname=None)
xl.keys()

Out[167]:

odict_keys(['Coversheet', 'Contextual data', 'Core metrics', 'Core metrics and splits', 'BME breakdown', 'Indicator (a)', 'Benchmark (b)', 'Difference (a-b)', 'Z-score', 'Numerators and denominators'])

In [169]:

#get PRN from filename
getPRN('{}/{}'.format(directory,fn),list(xl.keys())[1])

Out[169]:

('University of Exeter', '10007792')

Notes on Parsing the spreadsheets¶

The spreadsheets include various compound tables which need processing to split out the separate subtables, such as data for full-time versus part-time students.

The final spreadsheets also had populated, administrative(?) cells at the extremity of many, if not all, sheets. Using the pandas.read_excel() function's parse_cols parameter allow the ingest of only the desired columns. (Loading in the whole sheet width often caused the multiple header definition part of the ingest to break.)

Core metrics¶

Core metrics and splits¶

Guess that the spreadsheets are regular and use absolute range finding... Which is really dangerous...

In [104]:

def get_coremetricsAndSplits(fn):
    sn='Core metrics and splits'
    print(fn)
    xls=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=7,
                              header=[0,1]).dropna(how='all',axis=0))
    institution,prn=getPRN(fn,sn)
    i_fulltime=xls[xls.columns[0:27]][:6]
    i_fulltime.index.names=['Topic']
    i_fulltime['PRN']=prn
    i_fulltime['type']='FT'

    i_fulltime=i_fulltime.reset_index().set_index(['Topic','PRN'])

    i_fulltime.drop('Unnamed: 0_level_0', level=0, axis=1, inplace=True)

    i_parttime=xls[xls.columns[0:27]][7:13]

    i_parttime.index.names=['Topic']
    i_parttime['PRN']=prn
    i_parttime['type']='PT'

    i_parttime=i_parttime.reset_index().set_index(['Topic','PRN'])

    i_parttime.drop('Unnamed: 0_level_0', level=0, axis=1, inplace=True)
    return i_fulltime, i_parttime
    
cms_ft=pd.DataFrame()
cms_pt=pd.DataFrame()
cms_complete=pd.DataFrame()

#Loop through all the files in the declared directory, and parse the contents of a particular sheet out
for filename in [f for f in os.listdir(directory) if fntest(f)]:
    cms_ft_tmp, cms_pt_tmp=get_coremetricsAndSplits('{}/{}'.format(directory,filename))
    cms_ft=pd.concat([cms_ft,cms_ft_tmp])
    cms_pt=pd.concat([cms_pt,cms_pt_tmp])
    cms_complete=pd.concat([cms_complete,cms_pt_tmp,cms_ft_tmp])

cms_complete.to_csv('tef_coremetricandsplits.csv')

TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx
TEFYearTwo_AllMetrics/10007796_The University of Leicester_Metrics.xlsx
TEFYearTwo_AllMetrics/10007798_The University of Manchester_Metrics.xlsx

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-104-932eeaaf104f> in <module>()
     31 #Loop through all the files in the declared directory, and parse the contents of a particular sheet out
     32 for filename in [f for f in os.listdir(directory) if fntest(f)]:
---> 33     cms_ft_tmp, cms_pt_tmp=get_coremetricsAndSplits('{}/{}'.format(directory,filename))
     34     cms_ft=pd.concat([cms_ft,cms_ft_tmp])
     35     cms_pt=pd.concat([cms_pt,cms_pt_tmp])

<ipython-input-104-932eeaaf104f> in get_coremetricsAndSplits(fn)
      3     print(fn)
      4     xls=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=7,
----> 5                               header=[0,1]).dropna(how='all',axis=0))
      6     institution,prn=getPRN(fn,sn)
      7     i_fulltime=xls[xls.columns[0:27]][:6]

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    198 
    199     if not isinstance(io, ExcelFile):
--> 200         io = ExcelFile(io, engine=engine)
    201 
    202     return io._parse_excel(

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    255             self.book = xlrd.open_workbook(file_contents=data)
    256         elif isinstance(io, compat.string_types):
--> 257             self.book = xlrd.open_workbook(io)
    258         else:
    259             raise ValueError('Must explicitly set engine if not passing in'

/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    544         row_tag = U_SSML12 + "row"
    545         self_do_row = self.do_row
--> 546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
    548                 self_do_row(elem)

/usr/lib/python3.5/xml/etree/ElementTree.py in __next__(self)
   1302                 data = self._file.read(16 * 1024)
   1303                 if data:
-> 1304                     self._parser.feed(data)
   1305                 else:
   1306                     self._root = self._parser._close_and_return_root()

/usr/lib/python3.5/xml/etree/ElementTree.py in feed(self, data)
   1235         if data:
   1236             try:
-> 1237                 self._parser.feed(data)
   1238             except SyntaxError as exc:
   1239                 self._events_queue.append(exc)

KeyboardInterrupt:

In [105]:

cms_complete.head()

Out[105]:

		Core metrics					Splits	type
		Indicator\n(a) %	Benchmark\n(b) %	Difference \n(a)-(b) *	Z-score	Flag	Years
Topic	PRN
Assessment and feedback	10007792	SUP	SUP	SUP	SUP	SUP	SUP	PT
Academic support	10007792	SUP	SUP	SUP	SUP	SUP	SUP	PT
Non-continuation	10007792	NaN	NaN	NaN	NaN	NaN	N	PT
Employment or further study	10007792	NaN	NaN	NaN	NaN	NaN	N	PT
Highly skilled employment or further study	10007792	++	++	++	++	++	N	PT

In [106]:

#It may be easier to work with if we put everything into long format?
cms_complete_long=cms_complete.reset_index().melt(id_vars=['Topic','PRN', 'type'],
                                                  var_name=['Heading','Subheading'])
cms_complete_long.head()

Out[106]:

	Topic	PRN	type	Heading	Subheading	value
0	Assessment and feedback	10007792	PT	Core metrics	Indicator\n(a) %	SUP
1	Academic support	10007792	PT	Core metrics	Indicator\n(a) %	SUP
2	Non-continuation	10007792	PT	Core metrics	Indicator\n(a) %	NaN
3	Employment or further study	10007792	PT	Core metrics	Indicator\n(a) %	NaN
4	Highly skilled employment or further study	10007792	PT	Core metrics	Indicator\n(a) %	++

In [107]:

cms_complete_long['type'].unique()

Out[107]:

array(['PT', 'FT'], dtype=object)

In [108]:

#In long format it's easy to write queires... For example, report on ++ rows:
cms_complete_long[cms_complete_long['value']=='++'].head()

Out[108]:

	Topic	PRN	type	Heading	Subheading	value
4	Highly skilled employment or further study	10007792	PT	Core metrics	Indicator\n(a) %	++
8	Assessment and feedback	10007792	FT	Core metrics	Indicator\n(a) %	++
10	Highly skilled employment or further study	10007792	FT	Core metrics	Indicator\n(a) %	++
12	Highly skilled employment or further study	10007795	PT	Core metrics	Indicator\n(a) %	++
21	The teaching on my course	10007795	FT	Core metrics	Indicator\n(a) %	++

In [511]:

#A query over more columns...
cms_complete_long[(cms_complete_long['value']=='++') & 
                  (cms_complete_long['Heading']=='Disadvantaged') &
                  (cms_complete_long['Subheading']=='Yes')]

Out[511]:

	Topic	PRN	type	Heading	Subheading	value
298	Employment or further study	nan	FT	Disadvantaged	Yes	++
299	Highly skilled employment or further study	nan	FT	Disadvantaged	Yes	++
310	Employment or further study	nan	FT	Disadvantaged	Yes	++
311	Highly skilled employment or further study	nan	FT	Disadvantaged	Yes	++

In [783]:

cms_complete.to_csv('tef_coremetricandsplits_long.csv', index=False)

In [943]:

#Would it be easier/more useful to demo SQL queries?
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

q='''
SELECT * FROM cms_complete_long
WHERE value='++' AND Heading='Disadvantaged'AND Subheading='Yes' LIMIT 10;
'''
pysqldf(q)

Out[943]:

	Topic	PRN	type	Heading	Subheading	value
0	Employment or further study	nan	FT	Disadvantaged	Yes	++
1	Highly skilled employment or further study	nan	FT	Disadvantaged	Yes	++
2	Employment or further study	nan	FT	Disadvantaged	Yes	++
3	Highly skilled employment or further study	nan	FT	Disadvantaged	Yes	++

Contextual data¶

In [49]:

def get_contextualdata(fn):
    sn='Contextual data'
    print(fn)
    
    institution,prn=getPRN(fn,sn)
    
    _demog=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=11,parse_cols=8))
    _demog.columns=['Category','Dummy','Group','FT Headcount','FT %','PT Headcount','PT %','Total Headcount','Total %']
    _demog = _demog.drop('Dummy', axis=1)
    _demog.fillna(method='ffill',inplace=True)
    _demog['PRN']=prn
    
    _subj=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=9,parse_cols=range(10,17))).dropna(how='all',axis=1).dropna(how='all',axis=0)
    _subj.columns=['Subject of study','FT Headcount','FT %','PT Headcount','PT %','Total Headcount','Total %']
    _subj['PRN']=prn
    return _demog, _subj

demog=pd.DataFrame()
subj=pd.DataFrame()


for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
    demog_tmp,subj_tmp=get_contextualdata('{}/{}'.format(directory,filename))
    demog=pd.concat([demog,demog_tmp])
    subj=pd.concat([subj,subj_tmp])

TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx
TEFYearTwo_AllMetrics/10007796_The University of Leicester_Metrics.xlsx
TEFYearTwo_AllMetrics/10007798_The University of Manchester_Metrics.xlsx
TEFYearTwo_AllMetrics/10007799_University of Newcastle upon Tyne_Metrics.xlsx
TEFYearTwo_AllMetrics/10007801_University of Plymouth_Metrics.xlsx
TEFYearTwo_AllMetrics/10007802_The University of Reading_Metrics.xlsx
TEFYearTwo_AllMetrics/10007803_University of St Andrews_Metrics.xlsx
TEFYearTwo_AllMetrics/10007806_University of Sussex_Metrics.xlsx
TEFYearTwo_AllMetrics/10007811_Bishop Grosseteste University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007814_Cardiff University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007816_The Royal Central School of Speech and Drama_Metrics.xlsx
TEFYearTwo_AllMetrics/10007817_Chichester College_Metrics.xlsx
TEFYearTwo_AllMetrics/10007823_Edge Hill University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007825_Guildhall School of Music & Drama_Metrics.xlsx
TEFYearTwo_AllMetrics/10007832_Newman University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007833_Wrexham GlyndèÊr University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007835_The Royal Academy of Music_Metrics.xlsx
TEFYearTwo_AllMetrics/10007837_Royal Northern College of Music_Metrics.xlsx
TEFYearTwo_AllMetrics/10007842_The University of Cumbria_Metrics.xlsx
TEFYearTwo_AllMetrics/10007843_St Marys University  Twickenham_Metrics.xlsx
TEFYearTwo_AllMetrics/10007848_University of Chester_Metrics.xlsx
TEFYearTwo_AllMetrics/10007849_University of Abertay Dundee_Metrics.xlsx
TEFYearTwo_AllMetrics/10007850_The University of Bath_Metrics.xlsx
TEFYearTwo_AllMetrics/10007851_University of Derby_Metrics.xlsx
TEFYearTwo_AllMetrics/10007852_University of Dundee_Metrics.xlsx
TEFYearTwo_AllMetrics/10007854_Cardiff Metropolitan University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007855_Swansea University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007857_Bangor University_Metrics.xlsx
TEFYearTwo_AllMetrics/10007858_University of Wales Trinity Saint David_Metrics.xlsx
TEFYearTwo_AllMetrics/10007859_Warwickshire College_Metrics.xlsx
TEFYearTwo_AllMetrics/10007938_Grimsby Institute of Further and Higher Education_Metrics.xlsx

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-49-5b07e77609b8> in <module>()
     21 
     22 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
---> 23     demog_tmp,subj_tmp=get_contextualdata('{}/{}'.format(directory,filename))
     24     demog=pd.concat([demog,demog_tmp])
     25     subj=pd.concat([subj,subj_tmp])

<ipython-input-49-5b07e77609b8> in get_contextualdata(fn)
      5     institution,prn=getPRN(fn,sn)
      6 
----> 7     _demog=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=11,parse_cols=8))
      8     _demog.columns=['Category','Dummy','Group','FT Headcount','FT %','PT Headcount','PT %','Total Headcount','Total %']
      9     _demog = _demog.drop('Dummy', axis=1)

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    198 
    199     if not isinstance(io, ExcelFile):
--> 200         io = ExcelFile(io, engine=engine)
    201 
    202     return io._parse_excel(

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    255             self.book = xlrd.open_workbook(file_contents=data)
    256         elif isinstance(io, compat.string_types):
--> 257             self.book = xlrd.open_workbook(io)
    258         else:
    259             raise ValueError('Must explicitly set engine if not passing in'

/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in do_row(self, row_elem)
    652                         if c == '$':
    653                             continue
--> 654                         lv = letter_value[c]
    655                         if lv:
    656                             colx = colx * 26 + lv

KeyboardInterrupt:

In [170]:

demog.head()

Out[170]:

	Category	Group	FT Headcount	PT Headcount	PRN
0	Level of study	First degree	14425	65	10007792
1	Level of study	Other UG	610	55	10007792
2	Age	Under 21	13795	25	10007792
3	Age	21 to 30	1080	10	10007792
4	Age	Over 30	150	85	10007792

In [585]:

subj.head()

Out[585]:

	Subject of study	FT Headcount	FT %	PT Headcount	PT %	Total Headcount	Total %	PRN
0	Medicine & dentistry and veterinary science	300.0	0.03	30.0	0.00	330.0	0.03	nan
1	Subjects allied to medicine	1000.0	0.10	100.0	0.11	1100.0	0.10	nan
2	Biological sciences	1100.0	0.11	110.0	0.14	1210.0	0.11	nan
3	Agriculture & related subjects	100.0	0.01	10.0	0.00	110.0	0.01	nan
4	Physical sciences	500.0	0.05	50.0	0.03	550.0	0.05	nan

In [594]:

subj[['Subject of study','PRN','Total Headcount']].head()

Out[594]:

	Subject of study	PRN	Total Headcount
0	Medicine & dentistry and veterinary science	nan	330.0
1	Subjects allied to medicine	nan	1100.0
2	Biological sciences	nan	1210.0
3	Agriculture & related subjects	nan	110.0
4	Physical sciences	nan	550.0

In [825]:

subj_long=subj.melt(id_vars=['Subject of study','PRN'],var_name='Category')
subj_long['value']= pd.to_numeric(subj_long['value'],errors='coerce')
subj_long.head()

Out[825]:

	Subject of study	PRN	Category	value
0	Medicine & dentistry and veterinary science	nan	FT Headcount	300.0
1	Subjects allied to medicine	nan	FT Headcount	1000.0
2	Biological sciences	nan	FT Headcount	1100.0
3	Agriculture & related subjects	nan	FT Headcount	100.0
4	Physical sciences	nan	FT Headcount	500.0

In [872]:

def longify(df,id_vars=['Topic','PRN'],var_name=['Heading','Subheading'],valname='value',resetIndex=True):
    if resetIndex: df=df.reset_index()
    df=df.melt(id_vars=id_vars,
               var_name=var_name,
               value_name=valname)
    df[valname]= pd.to_numeric(df[valname],errors='coerce')
    return df

In [873]:

subj_long=longify(subj,id_vars=['Subject of study','PRN'],var_name='Category', resetIndex=False)
subj_long.head()

Out[873]:

	Subject of study	PRN	Category	value
0	Medicine & dentistry and veterinary science	nan	FT Headcount	300.0
1	Subjects allied to medicine	nan	FT Headcount	1000.0
2	Biological sciences	nan	FT Headcount	1100.0
3	Agriculture & related subjects	nan	FT Headcount	100.0
4	Physical sciences	nan	FT Headcount	500.0

In [785]:

subj_long.to_csv('tef_subj_long.csv', index=False)

Indicators (a)¶

In [143]:

def get_indicators(fn):
    sn='Indicator (a)'
    print(fn)
    institution,prn=getPRN(fn,sn)

    i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,
                              header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1))
    
    i['PRN']=prn
    
    ift=i[1:7][:]
    ift['Type']='FT'
    ipt=i[8:14][:]
    ipt['Type']='PT'
    
    i=pd.concat([ift,ipt]).dropna(how='all',axis=1)
    i.index.names=['Topic']


    i=i.reset_index().set_index(['Topic','PRN'])

    return i

In [144]:

indicators=pd.DataFrame()

for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
    indicators=pd.concat([indicators,get_indicators('{}/{}'.format(directory,filename))])
    
indicators.to_csv('tef_indicators.csv')

TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':

TEFYearTwo_AllMetrics/10007796_The University of Leicester_Metrics.xlsx
TEFYearTwo_AllMetrics/10007798_The University of Manchester_Metrics.xlsx
TEFYearTwo_AllMetrics/10007799_University of Newcastle upon Tyne_Metrics.xlsx

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-144-0a4c8ab949a1> in <module>()
      2 
      3 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
----> 4     indicators=pd.concat([indicators,get_indicators('{}/{}'.format(directory,filename))])
      5 
      6 indicators.to_csv('tef_indicators.csv')

<ipython-input-143-c1c21329cc55> in get_indicators(fn)
      5 
      6     i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,
----> 7                               header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1))
      8 
      9     i['PRN']=prn

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    198 
    199     if not isinstance(io, ExcelFile):
--> 200         io = ExcelFile(io, engine=engine)
    201 
    202     return io._parse_excel(

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    255             self.book = xlrd.open_workbook(file_contents=data)
    256         elif isinstance(io, compat.string_types):
--> 257             self.book = xlrd.open_workbook(io)
    258         else:
    259             raise ValueError('Must explicitly set engine if not passing in'

/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in do_row(self, row_elem)
    636         letter_value = _UPPERCASE_1_REL_INDEX
    637         for cell_elem in row_elem:
--> 638             cell_name = cell_elem.get('r')
    639             if cell_name is None: # Yes, it's optional.
    640                 colx += 1

KeyboardInterrupt:

In [171]:

indicators.head(15)

Out[171]:

		All years	Years			Level of study		Age		Disadvantaged		Ethnicity		Disabled		Sex		Domicile			Type
		Unnamed: 1_level_1	1	2	3	First degree	Other\nUG	Young	Mature	Yes	No	White	BME	Yes	No	Male	Female	UK	Other EU	Non EU
Topic	PRN
The teaching on my course	10007792	90.6797	90.8339	89.8528	91.2401	90.6797	NaN	90.7622	89.6226	91.6187	91.5394	91.8816	86.7551	91.2695	90.6016	89.0266	91.9115	91.5804	92.212	85.0954	FT
Assessment and feedback	10007792	75.517	76.4804	75.0354	75.0201	75.517	NaN	75.4647	76.1871	74.9424	75.5491	75.8037	73.7306	74.0418	75.7124	74.0954	76.5766	75.4667	75.1963	75.9003	FT
Academic support	10007792	83.8984	84.4193	83.9359	83.3757	83.8984	NaN	84.0099	82.4686	84.2166	83.9893	84.4267	81.8901	84.3709	83.8358	84.2393	83.6444	83.958	85.6894	83.0154	FT
Non-continuation	10007792	2.2783	2.12284	2.39577	2.3185	2.27954	N	1.75764	9.23277	3.33333	1.52757	2.2543	2.61283	3.98773	2.0488	2.44962	2.11622	0	0	0	FT
Employment or further study	10007792	94.4993	93.9013	95.2135	94.332	94.5236	R	94.592	93.0521	94.2377	94.6537	94.5472	93.9394	93.7793	94.6043	93.4256	95.4714	0	0	0	FT
Highly skilled employment or further study	10007792	79.3124	79.3296	79.1721	79.4422	79.3698	R	79.3065	79.4045	74.7899	80.011	79.0003	83.6364	77.8169	79.5307	80.8116	77.955	0	0	0	FT
The teaching on my course	10007792	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	N	R	N	SUP	SUP	SUP	SUP	R	N	PT
Assessment and feedback	10007792	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	N	R	N	SUP	SUP	SUP	SUP	R	N	PT
Academic support	10007792	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	N	R	N	SUP	SUP	SUP	SUP	R	N	PT
Non-continuation	10007792	21.0526	N	N	N	21.0526	0	N	DP	NaN	N	DP	N	N	DP	N	DP	0	0	0	PT
Employment or further study	10007792	91.1111	N	DP	DP	DP	SUP	N	DP	R	N	DP	NaN	N	DP	DP	DP	0	0	0	PT
Highly skilled employment or further study	10007792	88.8889	N	DP	DP	86.9565	SUP	N	DP	R	N	DP	NaN	N	DP	DP	DP	0	0	0	PT
The teaching on my course	10007795	90.9203	89.8789	91.7786	91.254	90.9274	89.5833	90.8864	91.2909	91.3506	90.8578	91.3007	89.5158	90.8638	90.9258	89.7398	91.633	90.9966	88.4162	90.9528	FT
Assessment and feedback	10007795	72.4769	70.8454	74.1373	72.6671	72.4931	69.4444	72.2322	75.1547	74.1532	71.0338	71.997	73.4978	69.9169	72.7273	73.136	72.079	71.7057	74.4241	80.7418	FT
Academic support	10007795	83.5246	82.0285	84.5623	84.2109	83.507	86.8056	83.433	84.5259	85.7576	82.5839	83.5152	83.2889	81.4646	83.7261	84.0149	83.2285	83.1844	84.1187	87.2634	FT

In [877]:

!head tef_indicators.csv

,,All years,Years,Years,Years,Level of study,Level of study,Age,Age,Disadvantaged,Disadvantaged,Ethnicity,Ethnicity,Disabled,Disabled,Sex,Sex,Welsh medium,Domicile,Domicile,Domicile
,,Unnamed: 1_level_1,1,2,3,First Degree,Other UG,Young,Mature,Yes,No,White,BME,Yes,No,Male,Female,Yes,UK,Other EU,Non EU
Topic,PRN,,,,,,,,,,,,,,,,,,,,
The teaching on my course,dfdf,87.2,90.68,88.92,85.84,92.03,80,83.07000000000001,88.43,83.44,90.91,89.23,88.45,84.94,91.65,82.42,90.66,,87.3,83.25,91.89
Assessment and feedback ,dfdf,73.3,72.14,75.83,70.06,74.24,69.09,68.67,75.75,76.24,71.81,68.87,76.31,77.02,69.86999999999999,68.9,74.96,,73.86999999999999,68.53999999999999,72.61999999999999
Academic support,dfdf,81.99,79.76,81.80000000000001,82.32000000000001,76.56,75.58,83.2,79.02000000000001,77.16000000000001,77.57000000000001,81.65,78.79,81.04,75.46000000000001,83.72,76.7,,78.41000000000001,83.42,80.22
Non-continuation,dfdf,5.317601422337159,6.427601422337159,3.527601422337159,4.847601422337159,4.887601422337159,6.617601422337159,4.6676014223371585,7.017601422337159,4.237601422337159,4.9176014223371585,7.287601422337159,6.327601422337159,3.687601422337159,4.307601422337159,5.227601422337159,7.55,,,,
Employment or further study ,dfdf,92.2,93.39,92.02,96.39,90.21000000000001,92.84,89.37,93.11,97.12,87.48,94.9,90.19,91.78,91.84,91.07000000000001,95.51,,,,
Highly skilled employment,dfdf,74.1,73.3,77.24,76.24,70.30999999999999,74.38,75.83,76.91,76.58999999999999,72.13,72.75999999999999,76.14,74.14,69.86999999999999,77.13,70.52,,,,
The teaching on my course ,dfdf,90.1,91.38,86.30999999999999,85.55,92.78999999999999,86.86,N,86.1,89.5,87.5,90.99,91.28,N,93.28,94.78,90.94999999999999,,94.85,89.96,89.52

In [878]:

pd.read_csv('tef_indicators.csv',header=[0,1],index_col=[0,1]).head()

Out[878]:

		All years	Years			Level of study		Age		Disadvantaged		Ethnicity		Disabled		Sex		Welsh medium	Domicile
		Unnamed: 1_level_1	1	2	3	First Degree	Other UG	Young	Mature	Yes	No	White	BME	Yes	No	Male	Female	Yes	UK	Other EU	Non EU
Topic	PRN
The teaching on my course	dfdf	87.200000	90.680000	88.920000	85.840000	92.030000	80.000000	83.07000000000001	88.430000	83.440000	90.910000	89.230000	88.450000	84.94	91.650000	82.420000	90.66	NaN	87.30	83.25	91.89
Assessment and feedback	dfdf	73.300000	72.140000	75.830000	70.060000	74.240000	69.090000	68.67	75.750000	76.240000	71.810000	68.870000	76.310000	77.02	69.870000	68.900000	74.96	NaN	73.87	68.54	72.62
Academic support	dfdf	81.990000	79.760000	81.800000	82.320000	76.560000	75.580000	83.2	79.020000	77.160000	77.570000	81.650000	78.790000	81.04	75.460000	83.720000	76.70	NaN	78.41	83.42	80.22
Non-continuation	dfdf	5.317601	6.427601	3.527601	4.847601	4.887601	6.617601	4.6676014223371585	7.017601	4.237601	4.917601	7.287601	6.327601	3.687601422337159	4.307601	5.227601	7.55	NaN	NaN	NaN	NaN
Employment or further study	dfdf	92.200000	93.390000	92.020000	96.390000	90.210000	92.840000	89.37	93.110000	97.120000	87.480000	94.900000	90.190000	91.78	91.840000	91.070000	95.51	NaN	NaN	NaN	NaN

In [122]:

dummyPRN=indicators.index.get_level_values('PRN').unique().tolist()[0]
dummyPRN

Out[122]:

'10007792'

In [123]:

#Example cross-section
indicators.xs(dummyPRN, level='PRN')['Age','Young']

Out[123]:

Topic
Full-time headcount:                                                NaN
The teaching on my course                                       90.7622
Assessment and feedback                                         75.4647
Academic support                                                84.0099
Non-continuation                                                1.75764
Employment or further study                                      94.592
Highly skilled employment or further study                      79.3065
Part-time headcount:                                                NaN
The teaching on my course                                             N
Assessment and feedback                                               N
Academic support                                                      N
Non-continuation                                                      N
Employment or further study                                           N
Highly skilled employment or further study                            N
NaN                                                           MET_WHITE
NaN                                                         TEFXXXAGE_Y
NaN                                           INDICATOR_\nTEFXXXAGE_\nY
NaN                                                                 NaN
NaN                                                                 NaN
NaN                                                                 NaN
N                                                                   NaN
R                                                                   NaN
SUP                                                                 NaN
NaN                                                                 NaN
DP                                                                  NaN
Name: (Age, Young), dtype: object

In [881]:

#Display the PRNs in the index
indicators.index.get_level_values('PRN').unique().tolist()

Out[881]:

['dfdf', 'JHDH']

In [882]:

#Example filter on column and PRN
indicators.xs('JHDH', level='PRN')['Domicile']

Out[882]:

	UK	Other EU	Non EU
Topic
The teaching on my course	87.30	83.25	91.89
Assessment and feedback	73.87	68.54	72.62
Academic support	78.41	83.42	80.22
Non-continuation	NaN	NaN	NaN
Employment or further study	NaN	NaN	NaN
Highly skilled employment	NaN	NaN	NaN
The teaching on my course	94.85	89.96	89.52
Assessment and feedback	74.95	77.69	82.98
Academic support	86.12	83.30	83.60
Non-continuation	NaN	NaN	NaN
Employment or further study	NaN	NaN	NaN
Highly skilled employment	NaN	NaN	NaN

In [883]:

indicators_long=longify(indicators)
indicators_long.tail(50).head(10)

Out[883]:

	Topic	PRN	Heading	Subheading	value
430	Employment or further study	JHDH	Domicile	UK	NaN
431	Highly skilled employment	JHDH	Domicile	UK	NaN
432	The teaching on my course	dfdf	Domicile	Other EU	83.25
433	Assessment and feedback	dfdf	Domicile	Other EU	68.54
434	Academic support	dfdf	Domicile	Other EU	83.42
435	Non-continuation	dfdf	Domicile	Other EU	NaN
436	Employment or further study	dfdf	Domicile	Other EU	NaN
437	Highly skilled employment	dfdf	Domicile	Other EU	NaN
438	The teaching on my course	dfdf	Domicile	Other EU	89.96
439	Assessment and feedback	dfdf	Domicile	Other EU	77.69

In [821]:

indicators_long.to_csv('tef_indicators_long.csv', index=False)

Benchmarks (b)¶

In [131]:

#This code is similar to that used for Zscores sheet
def get_benchmarks(fn):
    sn='Benchmark (b)'
    print(fn)
    institution,prn=getPRN(fn,sn)

    z=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=21, header=[0,1]))
    z['PRN']=prn
    zft=z[1:7][:]
    zft['Type']='FT'
    zpt=z[8:14][:]
    zpt['Type']='PT'
    z=pd.concat([zft,zpt]).dropna(how='all',axis=1)
    
    sd=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=30,parse_cols=21, header=[0,1]))
    sd['PRN']=prn
    sdft=sd[1:7][:]
    sdft['Type']='FT'
    sdpt=sd[8:14][:]
    sdpt['Type']='PT'
    
    sd=pd.concat([sdft,sdpt]).dropna(how='all',axis=1)
    
    z['Benchmark type']='benchmark'
    sd['Benchmark type']='providercontrib'
    
    z.index.names=['Topic']
    sd.index.names=['Topic']
    
    return z,sd

In [132]:

benchmark=pd.DataFrame()
providercontrib=pd.DataFrame()

for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
    benchmark_tmp,providercontrib_tmp= get_benchmarks(fn='{}/{}'.format(directory,filename))
    benchmark=pd.concat([benchmark,benchmark_tmp])
    providercontrib=pd.concat([providercontrib,providercontrib_tmp])

TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-132-06b8eac6d95c> in <module>()
      3 
      4 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
----> 5     benchmark_tmp,providercontrib_tmp= get_benchmarks(fn='{}/{}'.format(directory,filename))
      6     benchmark=pd.concat([benchmark,benchmark_tmp])
      7     providercontrib=pd.concat([providercontrib,providercontrib_tmp])

<ipython-input-131-af1240caaa5a> in get_benchmarks(fn)
     12     zpt['Type']='PT'
     13 
---> 14     sd=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=30,parse_cols=21, header=[0,1]))
     15     sd['PRN']=prn
     16     sdft=sd[1:7][:]

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    198 
    199     if not isinstance(io, ExcelFile):
--> 200         io = ExcelFile(io, engine=engine)
    201 
    202     return io._parse_excel(

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    255             self.book = xlrd.open_workbook(file_contents=data)
    256         elif isinstance(io, compat.string_types):
--> 257             self.book = xlrd.open_workbook(io)
    258         else:
    259             raise ValueError('Must explicitly set engine if not passing in'

/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in do_row(self, row_elem)
    663                 if explicit_row_number and cell_name[charx:] != row_number:
    664                     raise Exception('cell name %r but row number is %r' % (cell_name, row_number))
--> 665             xf_index = int(cell_elem.get('s', '0'))
    666             cell_type = cell_elem.get('t', 'n')
    667             tvalue = None

KeyboardInterrupt:

In [172]:

benchmark.head()

Out[172]:

	All years	Years			Level of study		Age		Disadvantaged		...	Ethnicity	Disabled		Sex		Domicile		PRN	Type	Benchmark type
	Unnamed: 1_level_1	1	2	3	First degree	Other\nUG	Young	Mature	Yes	No	...	BME	Yes	No	Male	Female	UK	Other EU
Topic
The teaching on my course	88.3309	87.7923	88.632	88.5606	88.4478	NaN	88.4035	87.3995	88.7461	89.4115	...	84.7989	87.9403	88.3826	87.3999	89.0245	89.1227	86.6562	10007792	FT	benchmark
Assessment and feedback	72.0203	71.0119	72.1508	72.8637	71.8199	NaN	71.9077	73.463	73.4956	72.1592	...	70.5695	70.3514	72.2414	71.8281	72.1635	72.1379	69.7941	10007792	FT	benchmark
Academic support	81.8189	81.2489	82.1147	82.1206	81.8246	NaN	81.9071	80.689	82.9026	82.2227	...	79.5073	80.9901	81.9287	82.543	81.2794	82.1206	81.6937	10007792	FT	benchmark
Non-continuation	3.44304	3.37228	3.56405	3.42482	3.39128	N	2.83198	11.612	3.84689	2.46721	...	4.03444	5.27223	3.21838	4.05566	2.94869	0	0	10007792	FT	benchmark
Employment or further study	93.9778	93.6149	94.0882	94.311	93.9752	R	94.0575	92.7333	94.0585	94.1474	...	92.4344	92.3242	94.1863	92.4983	95.3174	0	0	10007792	FT	benchmark

5 rows × 21 columns

In [861]:

providercontrib.head()

Out[861]:

	All years	Years			Level of study		Age		Disadvantaged		...	Disabled		Sex		Domicile			PRN	Type	Benchmark type
	Unnamed: 1_level_1	1	2	3	First Degree	Other UG	Young	Mature	Yes	No	...	Yes	No	Male	Female	UK	Other EU	Non EU
Topic
The teaching on my course	4.46	3.81	2.65	2.2	2.57	2.85	3.39	4.27	3.3	4.26	...	4.21	2.32	2.16	3.77	3.25	3.52	4.69	nan	FT	providercontrib
Assessment and feedback	3.06	4.4	4.88	4.75	4.33	2.64	2.03	4.48	4.7	2.79	...	3.81	4.34	4.74	2.05	4.1	4.52	4.24	nan	FT	providercontrib
Academic support	2.98	3.48	3.71	3.09	2.54	3.36	2.75	2.58	4.17	2.74	...	3.14	3.19	2.19	3.71	3.07	3.08	2.03	nan	FT	providercontrib
Non-continuation	4.91	0	2.69	4.89	2.1	4.86	4.07	3.02	2.08	3.35	...	2.29	2.5	3.71	2.61	NaN	NaN	NaN	nan	FT	providercontrib
Employment or further study	2.3	0	3.42	2.69	3.92	3.11	2.02	4.35	2.53	4.36	...	4.91	3.1	3.01	3.60	NaN	NaN	NaN	nan	FT	providercontrib

5 rows × 22 columns

In [891]:

benchmark_long=longify(benchmark,id_vars=['Topic','PRN','Benchmark type'],valname='value')
benchmark_long.head()

Out[891]:

	Topic	PRN	Benchmark type	Heading	Subheading	value
0	The teaching on my course	nan	benchmark	All years	Unnamed: 1_level_1	85.737368
1	Assessment and feedback	nan	benchmark	All years	Unnamed: 1_level_1	71.064694
2	Academic support	nan	benchmark	All years	Unnamed: 1_level_1	80.825662
3	Non-continuation	nan	benchmark	All years	Unnamed: 1_level_1	6.100000
4	Employment or further study	nan	benchmark	All years	Unnamed: 1_level_1	86.800000

In [892]:

providercontrib_long=longify(providercontrib,id_vars=['Topic','PRN','Benchmark type'],valname='value')
providercontrib_long.head()

Out[892]:

	Topic	PRN	Benchmark type	Heading	Subheading	value
0	The teaching on my course	nan	providercontrib	All years	Unnamed: 1_level_1	4.46
1	Assessment and feedback	nan	providercontrib	All years	Unnamed: 1_level_1	3.06
2	Academic support	nan	providercontrib	All years	Unnamed: 1_level_1	2.98
3	Non-continuation	nan	providercontrib	All years	Unnamed: 1_level_1	4.91
4	Employment or further study	nan	providercontrib	All years	Unnamed: 1_level_1	2.30

In [937]:

benchmarkprovider=pd.concat([benchmark_long,providercontrib_long])
benchmarkprovider.head()

Out[937]:

	Topic	PRN	Benchmark type	Heading	Subheading	value
0	The teaching on my course	nan	benchmark	All years	Unnamed: 1_level_1	85.737368
1	Assessment and feedback	nan	benchmark	All years	Unnamed: 1_level_1	71.064694
2	Academic support	nan	benchmark	All years	Unnamed: 1_level_1	80.825662
3	Non-continuation	nan	benchmark	All years	Unnamed: 1_level_1	6.100000
4	Employment or further study	nan	benchmark	All years	Unnamed: 1_level_1	86.800000

In [941]:

benchmarkprovider_wide=benchmarkprovider.pivot_table(index=['Topic','PRN','Heading','Subheading'],
                                                     columns='Benchmark type',
                                                     values='value').reset_index()
benchmarkprovider_wide.head()

Out[941]:

Benchmark type	Topic	PRN	Heading	Subheading	benchmark	providercontrib
0	Academic support	nan	Age	Mature	76.865662	2.58
1	Academic support	nan	Age	Young	78.745662	2.75
2	Academic support	nan	All years	Unnamed: 1_level_1	80.825662	2.98
3	Academic support	nan	Disabled	No	76.395662	3.19
4	Academic support	nan	Disabled	Yes	77.925662	3.14

In [ ]:

benchmarkprovider_wide.to_csv('tef_benchmarkprovider_wide.csv', index=False)

Differences (a-b)¶

In [149]:

#Thic code is actually same as for indicators I think?
def get_differences(fn):
    sn='Difference (a-b)'
    print(fn)
    institution,prn=getPRN(fn,sn)

    i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,
                              header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1))
    i.index.names=['Topic']
    i['PRN']=prn
    
    ift=i[1:7][:]
    ift['Type']='FT'
    ipt=i[8:14][:]
    ipt['Type']='PT'
    
    i=pd.concat([ift,ipt]).dropna(how='all',axis=1)

    i=i[:].reset_index().set_index(['Topic','PRN'])

    return i

In [150]:

differences=pd.DataFrame()

for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
    differences=pd.concat([differences,get_differences('{}/{}'.format(directory,filename))])
    
differences.head()#.to_csv('tef_differences.csv')

TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-150-b6079aa62e26> in <module>()
      2 
      3 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
----> 4     differences=pd.concat([differences,get_differences('{}/{}'.format(directory,filename))])
      5 
      6 differences.head()#.to_csv('tef_differences.csv')

<ipython-input-149-a35ae05f22a5> in get_differences(fn)
      6 
      7     i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,
----> 8                               header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1))
      9     i.index.names=['Topic']
     10     i['PRN']=prn

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    198 
    199     if not isinstance(io, ExcelFile):
--> 200         io = ExcelFile(io, engine=engine)
    201 
    202     return io._parse_excel(

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    255             self.book = xlrd.open_workbook(file_contents=data)
    256         elif isinstance(io, compat.string_types):
--> 257             self.book = xlrd.open_workbook(io)
    258         else:
    259             raise ValueError('Must explicitly set engine if not passing in'

/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    544         row_tag = U_SSML12 + "row"
    545         self_do_row = self.do_row
--> 546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
    548                 self_do_row(elem)

/usr/lib/python3.5/xml/etree/ElementTree.py in __next__(self)
   1295         try:
   1296             while 1:
-> 1297                 for event in self._parser.read_events():
   1298                     return event
   1299                 if self._parser._parser is None:

KeyboardInterrupt:

In [173]:

differences.head(20)

Out[173]:

		All years	Years			Level of study		Age		Disadvantaged		Ethnicity		Disabled		Sex		Domicile			Type
		Unnamed: 1_level_1	1	2	3	First degree	Other\nUG	Young	Mature	Yes	No	White	BME	Yes	No	Male	Female	UK	Other EU	Non EU
Topic	PRN
The teaching on my course	10007792	2.34886	3.04161	1.22077	2.67949	2.23199	NaN	2.35867	2.22319	2.87256	2.12792	2.60783	1.95624	3.3292	2.21903	1.62661	2.88701	2.45772	5.55582	-0.547378	FT
Assessment and feedback	10007792	3.4967	5.46856	2.88462	2.15646	3.69709	NaN	3.55698	2.72407	1.44683	3.38994	3.4811	3.16111	3.69043	3.47104	2.26725	4.41311	3.32879	5.40222	1.57468	FT
Academic support	10007792	2.07944	3.17045	1.8212	1.2551	2.07377	NaN	2.10283	1.77954	1.31398	1.76657	2.04698	2.38284	3.38082	1.90708	1.69626	2.36494	1.83747	3.99566	1.23495	FT
Non-continuation (b-a)	10007792	1.16474	1.24943	1.16828	1.10632	1.11174	N	1.07434	2.37925	0.513553	0.939633	1.07013	1.42161	1.2845	1.16958	1.60604	0.832465	0	0	0	FT
Employment or further study	10007792	0.521486	0.286413	1.12521	0.0209712	0.548394	R	0.534478	0.318813	0.179146	0.506285	0.436058	1.50502	1.45517	0.417999	0.927354	0.153997	0	0	0	FT
Highly skilled employment or further study	10007792	4.00179	4.95353	3.77736	3.01536	4.04543	R	4.03955	3.41268	1.62383	3.62774	3.97737	4.89422	3.35868	4.09564	3.6354	4.33353	0	0	0	FT
The teaching on my course	10007792	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	N	R	N	SUP	SUP	SUP	SUP	R	N	PT
Assessment and feedback	10007792	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	N	R	N	SUP	SUP	SUP	SUP	R	N	PT
Academic support	10007792	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	N	R	N	SUP	SUP	SUP	SUP	R	N	PT
Non-continuation (b-a)	10007792	16.2163	N	N	N	16.2163	0	N	DP	NaN	N	DP	N	N	DP	N	DP	0	0	0	PT
Employment or further study	10007792	-0.722503	N	DP	DP	DP	SUP	N	DP	R	N	DP	NaN	N	DP	DP	DP	0	0	0	PT
Highly skilled employment or further study	10007792	19.8536	N	DP	DP	27.152	SUP	N	DP	R	N	DP	NaN	N	DP	DP	DP	0	0	0	PT

In [853]:

differences_long=longify(differences,'value')
differences_long.head()

Out[853]:

	Topic	PRN	Heading	Subheading	value
0	The teaching on my course	nan	All years	Unnamed: 1_level_1	1.462632
1	Assessment and feedback	nan	All years	Unnamed: 1_level_1	2.235306
2	Academic support	nan	All years	Unnamed: 1_level_1	1.164338
3	Non-continuation (b-a)	nan	All years	Unnamed: 1_level_1	0.782399
4	Employment or further study	nan	All years	Unnamed: 1_level_1	5.400000

In [854]:

differences_long.to_csv('tef_differences_long.csv', index=False)

Zscores¶

These relate to the indicators.

In [165]:

def get_zscores(fn):
    sn='Z-score'
    print(fn)
    institution,prn=getPRN(fn,sn)

    z=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,header=[0,1]))
    z['PRN']=prn
    zft=z[1:7][:]
    zft['Type']='FT'
    zpt=z[8:14][:]
    zpt['Type']='PT'
    
    sd=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=30,parse_cols=22, header=[0,1]))
    sd['PRN']=prn
    sdft=sd[1:7][:]
    sdft['Type']='FT'
    sdpt=sd[8:14][:]
    sdpt['Type']='PT'
    
    z=pd.concat([zft,zpt]).dropna(how='all',axis=1)
    sd=pd.concat([sdft,sdpt]).dropna(how='all',axis=1)
    
    z['Error type']='z'
    sd['Error type']='sd'
    
    z.index.names=['Topic']
    sd.index.names=['Topic']
    
    return z,sd

#fn='tef/TEST_TEF_Metrics_workbook_exemplar.xlsx'
#z,sd= get_zscores(fn)

z=pd.DataFrame()
sd=pd.DataFrame()

for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
    z_tmp,sd_tmp= get_zscores(fn='{}/{}'.format(directory,filename))
    z=pd.concat([z,z_tmp])
    sd=pd.concat([sd,sd_tmp])

TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-165-a2cc9a145e05> in <module>()
     36 
     37 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
---> 38     z_tmp,sd_tmp= get_zscores(fn='{}/{}'.format(directory,filename))
     39     z=pd.concat([z,z_tmp])
     40     sd=pd.concat([sd,sd_tmp])

<ipython-input-165-a2cc9a145e05> in get_zscores(fn)
      4     institution,prn=getPRN(fn,sn)
      5 
----> 6     z=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,header=[0,1]))
      7     z['PRN']=prn
      8     zft=z[1:7][:]

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    198 
    199     if not isinstance(io, ExcelFile):
--> 200         io = ExcelFile(io, engine=engine)
    201 
    202     return io._parse_excel(

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    255             self.book = xlrd.open_workbook(file_contents=data)
    256         elif isinstance(io, compat.string_types):
--> 257             self.book = xlrd.open_workbook(io)
    258         else:
    259             raise ValueError('Must explicitly set engine if not passing in'

/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    544         row_tag = U_SSML12 + "row"
    545         self_do_row = self.do_row
--> 546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
    548                 self_do_row(elem)

/usr/lib/python3.5/xml/etree/ElementTree.py in __next__(self)
   1295         try:
   1296             while 1:
-> 1297                 for event in self._parser.read_events():
   1298                     return event
   1299                 if self._parser._parser is None:

/usr/lib/python3.5/xml/etree/ElementTree.py in read_events(self)
   1263             index = self._index
   1264             try:
-> 1265                 event = events[self._index]
   1266                 # Avoid retaining references to past events
   1267                 events[self._index] = None

KeyboardInterrupt:

In [174]:

z.head(20)

Out[174]:

	All years	Years			Level of study		Age		Disadvantaged		...	Disabled		Sex		Domicile			PRN	Type	Error type
	Unnamed: 1_level_1	1	2	3	First degree	Other\nUG	Young	Mature	Yes	No	...	Yes	No	Male	Female	UK	Other EU	Non EU
Topic
The teaching on my course	7.10342	5.3252	2.03351	4.94598	6.77154	NaN	6.85589	1.80743	2.75972	5.39015	...	3.45701	6.30257	3.06317	6.87317	6.85111	3.56633	-0.569411	10007792	FT	z
Assessment and feedback	7.6247	6.93072	3.48961	2.83967	8.04842	NaN	7.43428	1.639	0.989925	6.05305	...	2.73797	7.11754	3.19442	7.34698	6.53844	2.50602	1.34462	10007792	FT	z
Academic support	5.27047	4.67032	2.57395	1.91782	5.25888	NaN	5.13993	1.19991	1.05724	3.68713	...	2.94066	4.53997	2.8402	4.50317	4.20403	2.22035	1.19656	10007792	FT	z
Non-continuation	5.74893	3.6081	3.10899	3.35249	5.70975	N	5.63506	2.20931	0.824919	4.90464	...	1.97203	5.51903	5.08287	3.19744	0	0	0	10007792	FT	z
Employment or further study	1.84464	0.546067	2.44282	0.0442631	1.91896	R	1.84295	0.267295	0.222057	1.63644	...	1.69252	1.4045	2.04653	0.436756	0	0	0	10007792	FT	z
Highly skilled employment or further study	7.86696	5.49485	4.41337	3.52352	7.89039	R	7.59202	1.84611	1.07268	6.43625	...	2.38469	7.50914	4.99549	6.1014	0	0	0	10007792	FT	z
The teaching on my course	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	...	N	SUP	SUP	SUP	SUP	R	N	10007792	PT	z
Assessment and feedback	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	...	N	SUP	SUP	SUP	SUP	R	N	10007792	PT	z
Academic support	SUP	SUP	SUP	N	N	SUP	N	SUP	N	N	...	N	SUP	SUP	SUP	SUP	R	N	10007792	PT	z
Non-continuation	1.67569	N	N	N	1.67569	0	N	DP	NaN	N	...	N	DP	N	DP	0	0	0	10007792	PT	z
Employment or further study	-0.266142	N	DP	DP	DP	SUP	N	DP	R	N	...	N	DP	DP	DP	0	0	0	10007792	PT	z
Highly skilled employment or further study	4.05138	N	DP	DP	3.70309	SUP	N	DP	R	N	...	N	DP	DP	DP	0	0	0	10007792	PT	z

12 rows × 22 columns

In [841]:

sd.head()

Out[841]:

	All years	Years			Level of study		Age		Disadvantaged		...	Sex		Welsh medium		Domicile			PRN	Type	Error type
	Unnamed: 1_level_1	1	2	3	First Degree	Other UG	Young	Mature	Yes	No	...	Male	Female	Yes	No	UK	Other EU	Non EU
Topic
The teaching on my course	1.21886	4.44397	0.100994	6.0307	3.47268	1.08085	1.74223	3.14273	3.84986	0.287595	...	4.33997	1.712574	0	0	0.906851	7.05848	0.907585	dfdf	FT	sd
Assessment and feedback	0.620918	0.505516	1.22337	1.21236	0.0956798	0.754485	0.522697	1.45514	0.0966519	0.849326	...	0.347848	1.653479	0	0	0.175016	1.16463	0.810035	dfdf	FT	sd
Academic support	0.415835	0.926926	0.040144	1.17222	1.605	0.671054	1.70013	0.940759	2.50489	0.570067	...	1.61696	1.058166	0	0	0.360268	2.53371	0.0795091	dfdf	FT	sd
Non-continuation	0.325999	0.3527	0.90899	0.330694	1.65284	1.21958	1.01048	0.251151	1.60483	0.491675	...	1.17956	0.547368	0	0	NaN	NaN	NaN	dfdf	FT	sd
Employment or further study	0.642857	0.900592	0.0382883	0.57619	0.151839	0.408096	0.293578	0.911861	1.25866	0.0658683	...	0.413673	0.759783	0	0	NaN	NaN	NaN	dfdf	FT	sd

5 rows × 24 columns

In [855]:

sd_long=longify(sd,'sd value')
sd_long.head()

Out[855]:

	Topic	PRN	Heading	Subheading	sd value
0	The teaching on my course	dfdf	All years	Unnamed: 1_level_1	1.218860
1	Assessment and feedback	dfdf	All years	Unnamed: 1_level_1	0.620918
2	Academic support	dfdf	All years	Unnamed: 1_level_1	0.415835
3	Non-continuation	dfdf	All years	Unnamed: 1_level_1	0.325999
4	Employment or further study	dfdf	All years	Unnamed: 1_level_1	0.642857

In [856]:

z_long=longify(z,'z value')
z_long.head()

Out[856]:

	Topic	PRN	Heading	Subheading	z value
0	The teaching on my course	dfdf	All years	Unnamed: 1_level_1	1.2
1	Assessment and feedback	dfdf	All years	Unnamed: 1_level_1	3.6
2	Academic support	dfdf	All years	Unnamed: 1_level_1	-2.8
3	Non-continuation	dfdf	All years	Unnamed: 1_level_1	2.4
4	Employment or further study	dfdf	All years	Unnamed: 1_level_1	8.4

In [758]:

indicators_long.head()

Out[758]:

	Topic	PRN	Heading	Subheading	value
0	The teaching on my course	dfdf	All years	Unnamed: 1_level_1	87.2
1	Assessment and feedback	dfdf	All years	Unnamed: 1_level_1	73.3
2	Academic support	dfdf	All years	Unnamed: 1_level_1	81.99
3	Non-continuation	dfdf	All years	Unnamed: 1_level_1	5.3176
4	Employment or further study	dfdf	All years	Unnamed: 1_level_1	92.2

In [795]:

ifull_long=indicators_long.merge( sd_long, on=['Topic','PRN','Heading','Subheading'])
ifull_long=ifull_long.merge( z_long, on=['Topic','PRN','Heading','Subheading'])
ifull_long.head()

Out[795]:

	Topic	PRN	Heading	Subheading	value	sd value	z value
0	The teaching on my course	dfdf	All years	Unnamed: 1_level_1	87.2	1.21886	1.2
1	Assessment and feedback	dfdf	All years	Unnamed: 1_level_1	73.3	0.620918	3.6
2	Academic support	dfdf	All years	Unnamed: 1_level_1	81.99	0.415835	-2.8
3	Non-continuation	dfdf	All years	Unnamed: 1_level_1	5.3176	0.325999	2.4
4	Non-continuation	dfdf	All years	Unnamed: 1_level_1	5.3176	0.325999	1.4

In [796]:

ifull_long.to_csv('tef_indicators_with_error_long.csv', index=False)

Ethnicity¶

In [82]:

def get_ethnicity(fn):
    sn='BME breakdown'
    print(fn)
    institution,prn=getPRN(fn,sn)

    e=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=9,parse_cols=5))
    e.columns=['Category','NULL','BME','Black','Asian','Other']
    e['PRN']=prn
    
    ft=e[:6].dropna(how='all',axis=1)
    ft['Type']='FT'
    
    pt=e[7:13].dropna(how='all',axis=1)#.dropna(how='all',axis=1))[:-1]
    pt['Type']='PT'

    return pd.concat([ft,pt])

In [175]:

#fn='tef/TEF_Metrics_workbook_exemplar.xlsx'
#get_ethnicity(fn)

ethnicity=pd.DataFrame()

for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
    ethnicity=pd.concat([ethnicity,get_ethnicity('{}/{}'.format(directory,filename))])
    
ethnicity.to_csv('tef_ethnicity.csv',index=False)

TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-175-69c6fa652c51> in <module>()
      5 
      6 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
----> 7     ethnicity=pd.concat([ethnicity,get_ethnicity('{}/{}'.format(directory,filename))])
      8 
      9 ethnicity.to_csv('tef_ethnicity.csv',index=False)

<ipython-input-82-2790a51d3336> in get_ethnicity(fn)
      4     institution,prn=getPRN(fn,sn)
      5 
----> 6     e=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=9,parse_cols=5))
      7     e.columns=['Category','NULL','BME','Black','Asian','Other']
      8     e['PRN']=prn

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    198 
    199     if not isinstance(io, ExcelFile):
--> 200         io = ExcelFile(io, engine=engine)
    201 
    202     return io._parse_excel(

/usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    255             self.book = xlrd.open_workbook(file_contents=data)
    256         elif isinstance(io, compat.string_types):
--> 257             self.book = xlrd.open_workbook(io)
    258         else:
    259             raise ValueError('Must explicitly set engine if not passing in'

/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

/usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    550             elif elem.tag == U_SSML12 + "dimension":
    551                 self.do_dimension(elem)
--> 552             elif elem.tag == U_SSML12 + "mergeCell":
    553                 self.do_merge_cell(elem)
    554         self.finish_off()

KeyboardInterrupt:

In [631]:

!head tef_ethnicity.csv

Category,BME,Black,Asian,Other,PRN,Type
The teaching on my course,,,-,-,nan,FT
Assessment and feedback ,++,++,,+,nan,FT
Academic support,,-,,-,nan,FT
Non-continuation,,++,-,-,nan,FT
Employment or further study ,++,,+,+,nan,FT
Highly skilled employment,++,++,,++,nan,FT
The teaching on my course ,,+,- ,,nan,PT
Assessment and feedback,,- ,-,,nan,PT
Academic support ,,+,,,nan,PT

In [176]:

ethnicity.head()

Out[176]:

	Asian	BME	Black	Category	Other	PRN	Type
0	NaN	NaN	NaN	The teaching on my course	NaN	10007792	FT
1	NaN	+	NaN	Assessment and feedback	NaN	10007792	FT
2	+	+	NaN	Academic support	NaN	10007792	FT
3	NaN	NaN	NaN	Non-continuation	NaN	10007792	FT
4	NaN	NaN	NaN	Employment or further study	NaN	10007792	FT

In [ ]: