Tony Hirst, @psychemedia
Notebook to load and have a quick peek at the Teaching Excellence Framework (TEF) data: Teaching Excellence Framework.
Example spreadsheet format used to develop the original code can be found here: Teaching Excellence Framework: year 2 specification.
Final spreadsheets available from: http://www.hefce.ac.uk/lt/tef/data/
#Set to false for full run
#TEST: True, False
TEST=False
#Put Excel files into specified (sub)directory and load them from there
directory='TEFYearTwo_AllMetrics'#'tef'
Bits and bobs that may or may not be useful...
#Packages
import os
import pandas as pd
#Support inline plotting
%matplotlib inline
#Set letter vars as numeric - may be useful for referencing cells?
import string
k=0
for l in string.ascii_uppercase:
exec(l+"="+str(k))
k=k+1
A, D, Z
(0, 3, 25)
#Get some UK HE admin data - may be useful?
#Via http://learning-provider.data.ac.uk/
#lp=pd.read_csv('learning-providers-plus.csv')
lp=pd.read_csv('http://learning-provider.data.ac.uk/data/learning-providers-plus.csv')
lp.head()
UKPRN | PROVIDER_NAME | VIEW_NAME | SORT_NAME | ALIAS | FLAT_NAME_NUMBER | BUILDING_NAME_NUMBER | LOCALITY | STREET_NAME | TOWN | POSTCODE | WEBSITE_URL | WIKIPEDIA_URL | GROUPS | LONGITUDE | LATITUDE | EASTING | NORTHING | GTR_ID | HESA_ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10008640 | FALMOUTH UNIVERSITY | University College Falmouth | Falmouth, University College | NaN | NaN | Woodlane Campus | NaN | Woodlane | Falmouth | TR11 4RH | http://www.falmouth.ac.uk/ | http://en.wikipedia.org/wiki/University_Colleg... | NaN | -5.070901 | 50.149168 | 180711.0 | 32196.0 | E84FC550-A4CC-4B98-A6F9-D15A33829D83 | 17.0 |
1 | 10007774 | UNIVERSITY OF OXFORD | University of Oxford | Oxford, University of | NaN | NaN | UNIVERSITY OFFICES | NaN | WELLINGTON SQUARE | OXFORD | OX1 2JD | http://www.ox.ac.uk/ | http://en.wikipedia.org/wiki/University_of_Oxford | Science_and_Engineering_South, Russell_Group, ... | -1.262868 | 51.757644 | 450974.0 | 206807.0 | B1F0E8FE-FE3C-49ED-9C96-1ED75312A8A0 | 156.0 |
2 | 10007768 | UNIVERSITY OF LANCASTER | University of Lancaster | Lancaster, University of | NaN | NaN | UNIVERSITY HOUSE | BAILRIGG | NaN | LANCASTER | LA1 4YW | http://www.lancs.ac.uk/ | http://en.wikipedia.org/wiki/Lancaster_University | 1994_Group, N8_Research_Partnership | -2.786905 | 54.010480 | 348528.0 | 457448.0 | F8C7F869-77D6-4859-96F1-3550A4951F6C | 123.0 |
3 | 10000571 | BATH SPA UNIVERSITY | Bath Spa University | Bath Spa University | NaN | NaN | NEWTON PARK | NEWTON ST. LOE | NaN | BATH | BA2 9BN | http://www.bathspa.ac.uk/ | http://en.wikipedia.org/wiki/Bath_Spa_University | Million_Plus | -2.437400 | 51.378739 | 369654.0 | 164501.0 | 1A469850-02BA-4814-81EE-E60C851CABC3 | 48.0 |
4 | 10007814 | CARDIFF UNIVERSITY | Cardiff University | Cardiff University | NaN | NaN | NaN | NaN | PARK PLACE | CARDIFF | CF10 3AT | http://www.cardiff.ac.uk/ | http://en.wikipedia.org/wiki/Cardiff_University | GW4, Russell_Group | -3.179907 | 51.489093 | 318176.0 | 177343.0 | 9C10D78F-6430-4CA7-9528-B96B0762A4C6 | 179.0 |
Data is published as via http://www.hefce.ac.uk/lt/tef/data/.
Originally it was thought that data files would be provided just as individual Excel spreadsheets (original example), one per institution, hence the need for this notebook. In the end, multiple versions of the data were made available, including monolithic CSV documents (of file for all institutions), rendering this notebook superfluous.
The final released spreadsheets had some minor differences in layout and sheet labeling from the original example. The original notebook scripts were not intelligent enough to automatically cope with these and nor is this one. For example, cell ranges are hard coded rather than being autodetected.
#Download the zipped Excel workbooks
!wget http://www.hefce.ac.uk/media/HEFCE,2014/Content/Learning,and,teaching/TEF/TEFYearTwo/data/TEFYearTwo_AllMetrics.zip
--2017-06-22 08:41:13-- http://www.hefce.ac.uk/media/HEFCE,2014/Content/Learning,and,teaching/TEF/TEFYearTwo/data/TEFYearTwo_AllMetrics.zip Resolving www.hefce.ac.uk (www.hefce.ac.uk)... 195.194.167.210 Connecting to www.hefce.ac.uk (www.hefce.ac.uk)|195.194.167.210|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 44520361 (42M) [application/x-zip-compressed] Saving to: ‘TEFYearTwo_AllMetrics.zip’ TEFYearTwo_AllMetri 100%[===================>] 42.46M 4.60MB/s in 12s 2017-06-22 08:41:25 (3.61 MB/s) - ‘TEFYearTwo_AllMetrics.zip’ saved [44520361/44520361]
Put all the spreadsheet files into a single folder; the script will then load them in and generate some monolithic CSV files of various flavours.
Note the processing of the files by the rest of the notebook is not very efficient...
!mkdir -p {directory}
!unzip -q TEFYearTwo_AllMetrics.zip -d {directory}
#Column names are crappy with white space - clean it off
def cleanCols(df):
df.rename(columns=lambda x: str(x).strip(),inplace=True)
return df
#Hack the PRN out of each sheet and use it to key data values with a new prn column in each dataset
def getPRN2(fn,sn):
i=pd.read_excel(fn, sheetname=sn, parse_cols=[0,1],header=None).head()
institution=None
prn=None
institution=str(i[i[0].notnull() & i[0].str.contains('Institution')][1].iloc[0]).strip()
prn=str(i[i[0].notnull() & i[0].str.contains('UKPRN')][1].iloc[0]).strip()
return institution,prn
def getPRN(fn,sn):
#TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx
prn=fn.split('/')[1].split('_')[0]
institution=fn.split('/')[1].split('_')[1]
return institution,prn
#Quick helper that would let us leave test spreadsheets in the data dir.
def fntest(f):
if TEST: return not f.startswith('~') and f.startswith('TEST')
return not f.startswith('~') and not f.startswith('TEST')
fn=os.listdir(directory)[0]
xl=pd.read_excel('{}/{}'.format(directory,fn), sheetname=None)
xl.keys()
odict_keys(['Coversheet', 'Contextual data', 'Core metrics', 'Core metrics and splits', 'BME breakdown', 'Indicator (a)', 'Benchmark (b)', 'Difference (a-b)', 'Z-score', 'Numerators and denominators'])
#get PRN from filename
getPRN('{}/{}'.format(directory,fn),list(xl.keys())[1])
('University of Exeter', '10007792')
The spreadsheets include various compound tables which need processing to split out the separate subtables, such as data for full-time versus part-time students.
The final spreadsheets also had populated, administrative(?) cells at the extremity of many, if not all, sheets. Using the pandas.read_excel()
function's parse_cols
parameter allow the ingest of only the desired columns. (Loading in the whole sheet width often caused the multiple header definition part of the ingest to break.)
Guess that the spreadsheets are regular and use absolute range finding... Which is really dangerous...
def get_coremetricsAndSplits(fn):
sn='Core metrics and splits'
print(fn)
xls=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=7,
header=[0,1]).dropna(how='all',axis=0))
institution,prn=getPRN(fn,sn)
i_fulltime=xls[xls.columns[0:27]][:6]
i_fulltime.index.names=['Topic']
i_fulltime['PRN']=prn
i_fulltime['type']='FT'
i_fulltime=i_fulltime.reset_index().set_index(['Topic','PRN'])
i_fulltime.drop('Unnamed: 0_level_0', level=0, axis=1, inplace=True)
i_parttime=xls[xls.columns[0:27]][7:13]
i_parttime.index.names=['Topic']
i_parttime['PRN']=prn
i_parttime['type']='PT'
i_parttime=i_parttime.reset_index().set_index(['Topic','PRN'])
i_parttime.drop('Unnamed: 0_level_0', level=0, axis=1, inplace=True)
return i_fulltime, i_parttime
cms_ft=pd.DataFrame()
cms_pt=pd.DataFrame()
cms_complete=pd.DataFrame()
#Loop through all the files in the declared directory, and parse the contents of a particular sheet out
for filename in [f for f in os.listdir(directory) if fntest(f)]:
cms_ft_tmp, cms_pt_tmp=get_coremetricsAndSplits('{}/{}'.format(directory,filename))
cms_ft=pd.concat([cms_ft,cms_ft_tmp])
cms_pt=pd.concat([cms_pt,cms_pt_tmp])
cms_complete=pd.concat([cms_complete,cms_pt_tmp,cms_ft_tmp])
cms_complete.to_csv('tef_coremetricandsplits.csv')
TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx TEFYearTwo_AllMetrics/10007796_The University of Leicester_Metrics.xlsx TEFYearTwo_AllMetrics/10007798_The University of Manchester_Metrics.xlsx
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-104-932eeaaf104f> in <module>() 31 #Loop through all the files in the declared directory, and parse the contents of a particular sheet out 32 for filename in [f for f in os.listdir(directory) if fntest(f)]: ---> 33 cms_ft_tmp, cms_pt_tmp=get_coremetricsAndSplits('{}/{}'.format(directory,filename)) 34 cms_ft=pd.concat([cms_ft,cms_ft_tmp]) 35 cms_pt=pd.concat([cms_pt,cms_pt_tmp]) <ipython-input-104-932eeaaf104f> in get_coremetricsAndSplits(fn) 3 print(fn) 4 xls=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=7, ----> 5 header=[0,1]).dropna(how='all',axis=0)) 6 institution,prn=getPRN(fn,sn) 7 i_fulltime=xls[xls.columns[0:27]][:6] /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds) 198 199 if not isinstance(io, ExcelFile): --> 200 io = ExcelFile(io, engine=engine) 201 202 return io._parse_excel( /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds) 255 self.book = xlrd.open_workbook(file_contents=data) 256 elif isinstance(io, compat.string_types): --> 257 self.book = xlrd.open_workbook(io) 258 else: 259 raise ValueError('Must explicitly set engine if not passing in' /usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading) 544 row_tag = U_SSML12 + "row" 545 self_do_row = self.do_row --> 546 for event, elem in ET.iterparse(stream): 547 if elem.tag == row_tag: 548 self_do_row(elem) /usr/lib/python3.5/xml/etree/ElementTree.py in __next__(self) 1302 data = self._file.read(16 * 1024) 1303 if data: -> 1304 self._parser.feed(data) 1305 else: 1306 self._root = self._parser._close_and_return_root() /usr/lib/python3.5/xml/etree/ElementTree.py in feed(self, data) 1235 if data: 1236 try: -> 1237 self._parser.feed(data) 1238 except SyntaxError as exc: 1239 self._events_queue.append(exc) KeyboardInterrupt:
cms_complete.head()
Core metrics | Splits | type | ||||||
---|---|---|---|---|---|---|---|---|
Indicator\n(a) % | Benchmark\n(b) % | Difference \n(a)-(b) * | Z-score | Flag | Years | |||
Topic | PRN | |||||||
Assessment and feedback | 10007792 | SUP | SUP | SUP | SUP | SUP | SUP | PT |
Academic support | 10007792 | SUP | SUP | SUP | SUP | SUP | SUP | PT |
Non-continuation | 10007792 | NaN | NaN | NaN | NaN | NaN | N | PT |
Employment or further study | 10007792 | NaN | NaN | NaN | NaN | NaN | N | PT |
Highly skilled employment or further study | 10007792 | ++ | ++ | ++ | ++ | ++ | N | PT |
#It may be easier to work with if we put everything into long format?
cms_complete_long=cms_complete.reset_index().melt(id_vars=['Topic','PRN', 'type'],
var_name=['Heading','Subheading'])
cms_complete_long.head()
Topic | PRN | type | Heading | Subheading | value | |
---|---|---|---|---|---|---|
0 | Assessment and feedback | 10007792 | PT | Core metrics | Indicator\n(a) % | SUP |
1 | Academic support | 10007792 | PT | Core metrics | Indicator\n(a) % | SUP |
2 | Non-continuation | 10007792 | PT | Core metrics | Indicator\n(a) % | NaN |
3 | Employment or further study | 10007792 | PT | Core metrics | Indicator\n(a) % | NaN |
4 | Highly skilled employment or further study | 10007792 | PT | Core metrics | Indicator\n(a) % | ++ |
cms_complete_long['type'].unique()
array(['PT', 'FT'], dtype=object)
#In long format it's easy to write queires... For example, report on ++ rows:
cms_complete_long[cms_complete_long['value']=='++'].head()
Topic | PRN | type | Heading | Subheading | value | |
---|---|---|---|---|---|---|
4 | Highly skilled employment or further study | 10007792 | PT | Core metrics | Indicator\n(a) % | ++ |
8 | Assessment and feedback | 10007792 | FT | Core metrics | Indicator\n(a) % | ++ |
10 | Highly skilled employment or further study | 10007792 | FT | Core metrics | Indicator\n(a) % | ++ |
12 | Highly skilled employment or further study | 10007795 | PT | Core metrics | Indicator\n(a) % | ++ |
21 | The teaching on my course | 10007795 | FT | Core metrics | Indicator\n(a) % | ++ |
#A query over more columns...
cms_complete_long[(cms_complete_long['value']=='++') &
(cms_complete_long['Heading']=='Disadvantaged') &
(cms_complete_long['Subheading']=='Yes')]
Topic | PRN | type | Heading | Subheading | value | |
---|---|---|---|---|---|---|
298 | Employment or further study | nan | FT | Disadvantaged | Yes | ++ |
299 | Highly skilled employment or further study | nan | FT | Disadvantaged | Yes | ++ |
310 | Employment or further study | nan | FT | Disadvantaged | Yes | ++ |
311 | Highly skilled employment or further study | nan | FT | Disadvantaged | Yes | ++ |
cms_complete.to_csv('tef_coremetricandsplits_long.csv', index=False)
#Would it be easier/more useful to demo SQL queries?
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
q='''
SELECT * FROM cms_complete_long
WHERE value='++' AND Heading='Disadvantaged'AND Subheading='Yes' LIMIT 10;
'''
pysqldf(q)
Topic | PRN | type | Heading | Subheading | value | |
---|---|---|---|---|---|---|
0 | Employment or further study | nan | FT | Disadvantaged | Yes | ++ |
1 | Highly skilled employment or further study | nan | FT | Disadvantaged | Yes | ++ |
2 | Employment or further study | nan | FT | Disadvantaged | Yes | ++ |
3 | Highly skilled employment or further study | nan | FT | Disadvantaged | Yes | ++ |
def get_contextualdata(fn):
sn='Contextual data'
print(fn)
institution,prn=getPRN(fn,sn)
_demog=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=11,parse_cols=8))
_demog.columns=['Category','Dummy','Group','FT Headcount','FT %','PT Headcount','PT %','Total Headcount','Total %']
_demog = _demog.drop('Dummy', axis=1)
_demog.fillna(method='ffill',inplace=True)
_demog['PRN']=prn
_subj=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=9,parse_cols=range(10,17))).dropna(how='all',axis=1).dropna(how='all',axis=0)
_subj.columns=['Subject of study','FT Headcount','FT %','PT Headcount','PT %','Total Headcount','Total %']
_subj['PRN']=prn
return _demog, _subj
demog=pd.DataFrame()
subj=pd.DataFrame()
for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
demog_tmp,subj_tmp=get_contextualdata('{}/{}'.format(directory,filename))
demog=pd.concat([demog,demog_tmp])
subj=pd.concat([subj,subj_tmp])
TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx TEFYearTwo_AllMetrics/10007796_The University of Leicester_Metrics.xlsx TEFYearTwo_AllMetrics/10007798_The University of Manchester_Metrics.xlsx TEFYearTwo_AllMetrics/10007799_University of Newcastle upon Tyne_Metrics.xlsx TEFYearTwo_AllMetrics/10007801_University of Plymouth_Metrics.xlsx TEFYearTwo_AllMetrics/10007802_The University of Reading_Metrics.xlsx TEFYearTwo_AllMetrics/10007803_University of St Andrews_Metrics.xlsx TEFYearTwo_AllMetrics/10007806_University of Sussex_Metrics.xlsx TEFYearTwo_AllMetrics/10007811_Bishop Grosseteste University_Metrics.xlsx TEFYearTwo_AllMetrics/10007814_Cardiff University_Metrics.xlsx TEFYearTwo_AllMetrics/10007816_The Royal Central School of Speech and Drama_Metrics.xlsx TEFYearTwo_AllMetrics/10007817_Chichester College_Metrics.xlsx TEFYearTwo_AllMetrics/10007823_Edge Hill University_Metrics.xlsx TEFYearTwo_AllMetrics/10007825_Guildhall School of Music & Drama_Metrics.xlsx TEFYearTwo_AllMetrics/10007832_Newman University_Metrics.xlsx TEFYearTwo_AllMetrics/10007833_Wrexham GlyndèÊr University_Metrics.xlsx TEFYearTwo_AllMetrics/10007835_The Royal Academy of Music_Metrics.xlsx TEFYearTwo_AllMetrics/10007837_Royal Northern College of Music_Metrics.xlsx TEFYearTwo_AllMetrics/10007842_The University of Cumbria_Metrics.xlsx TEFYearTwo_AllMetrics/10007843_St Marys University Twickenham_Metrics.xlsx TEFYearTwo_AllMetrics/10007848_University of Chester_Metrics.xlsx TEFYearTwo_AllMetrics/10007849_University of Abertay Dundee_Metrics.xlsx TEFYearTwo_AllMetrics/10007850_The University of Bath_Metrics.xlsx TEFYearTwo_AllMetrics/10007851_University of Derby_Metrics.xlsx TEFYearTwo_AllMetrics/10007852_University of Dundee_Metrics.xlsx TEFYearTwo_AllMetrics/10007854_Cardiff Metropolitan University_Metrics.xlsx TEFYearTwo_AllMetrics/10007855_Swansea University_Metrics.xlsx TEFYearTwo_AllMetrics/10007857_Bangor University_Metrics.xlsx TEFYearTwo_AllMetrics/10007858_University of Wales Trinity Saint David_Metrics.xlsx TEFYearTwo_AllMetrics/10007859_Warwickshire College_Metrics.xlsx TEFYearTwo_AllMetrics/10007938_Grimsby Institute of Further and Higher Education_Metrics.xlsx
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-49-5b07e77609b8> in <module>() 21 22 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]: ---> 23 demog_tmp,subj_tmp=get_contextualdata('{}/{}'.format(directory,filename)) 24 demog=pd.concat([demog,demog_tmp]) 25 subj=pd.concat([subj,subj_tmp]) <ipython-input-49-5b07e77609b8> in get_contextualdata(fn) 5 institution,prn=getPRN(fn,sn) 6 ----> 7 _demog=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=11,parse_cols=8)) 8 _demog.columns=['Category','Dummy','Group','FT Headcount','FT %','PT Headcount','PT %','Total Headcount','Total %'] 9 _demog = _demog.drop('Dummy', axis=1) /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds) 198 199 if not isinstance(io, ExcelFile): --> 200 io = ExcelFile(io, engine=engine) 201 202 return io._parse_excel( /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds) 255 self.book = xlrd.open_workbook(file_contents=data) 256 elif isinstance(io, compat.string_types): --> 257 self.book = xlrd.open_workbook(io) 258 else: 259 raise ValueError('Must explicitly set engine if not passing in' /usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading) 546 for event, elem in ET.iterparse(stream): 547 if elem.tag == row_tag: --> 548 self_do_row(elem) 549 elem.clear() # destroy all child elements (cells) 550 elif elem.tag == U_SSML12 + "dimension": /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in do_row(self, row_elem) 652 if c == '$': 653 continue --> 654 lv = letter_value[c] 655 if lv: 656 colx = colx * 26 + lv KeyboardInterrupt:
demog.head()
Category | Group | FT Headcount | FT % | PT Headcount | PT % | Total Headcount | Total % | PRN | |
---|---|---|---|---|---|---|---|---|---|
0 | Level of study | First degree | 14425 | 0 | 65 | 0 | 0 | 0 | 10007792 |
1 | Level of study | Other UG | 610 | 0 | 55 | 0 | 0 | 0 | 10007792 |
2 | Age | Under 21 | 13795 | 0 | 25 | 0 | 0 | 0 | 10007792 |
3 | Age | 21 to 30 | 1080 | 0 | 10 | 0 | 0 | 0 | 10007792 |
4 | Age | Over 30 | 150 | 0 | 85 | 0 | 0 | 0 | 10007792 |
subj.head()
Subject of study | FT Headcount | FT % | PT Headcount | PT % | Total Headcount | Total % | PRN | |
---|---|---|---|---|---|---|---|---|
0 | Medicine & dentistry and veterinary science | 300.0 | 0.03 | 30.0 | 0.00 | 330.0 | 0.03 | nan |
1 | Subjects allied to medicine | 1000.0 | 0.10 | 100.0 | 0.11 | 1100.0 | 0.10 | nan |
2 | Biological sciences | 1100.0 | 0.11 | 110.0 | 0.14 | 1210.0 | 0.11 | nan |
3 | Agriculture & related subjects | 100.0 | 0.01 | 10.0 | 0.00 | 110.0 | 0.01 | nan |
4 | Physical sciences | 500.0 | 0.05 | 50.0 | 0.03 | 550.0 | 0.05 | nan |
subj[['Subject of study','PRN','Total Headcount']].head()
Subject of study | PRN | Total Headcount | |
---|---|---|---|
0 | Medicine & dentistry and veterinary science | nan | 330.0 |
1 | Subjects allied to medicine | nan | 1100.0 |
2 | Biological sciences | nan | 1210.0 |
3 | Agriculture & related subjects | nan | 110.0 |
4 | Physical sciences | nan | 550.0 |
subj_long=subj.melt(id_vars=['Subject of study','PRN'],var_name='Category')
subj_long['value']= pd.to_numeric(subj_long['value'],errors='coerce')
subj_long.head()
Subject of study | PRN | Category | value | |
---|---|---|---|---|
0 | Medicine & dentistry and veterinary science | nan | FT Headcount | 300.0 |
1 | Subjects allied to medicine | nan | FT Headcount | 1000.0 |
2 | Biological sciences | nan | FT Headcount | 1100.0 |
3 | Agriculture & related subjects | nan | FT Headcount | 100.0 |
4 | Physical sciences | nan | FT Headcount | 500.0 |
def longify(df,id_vars=['Topic','PRN'],var_name=['Heading','Subheading'],valname='value',resetIndex=True):
if resetIndex: df=df.reset_index()
df=df.melt(id_vars=id_vars,
var_name=var_name,
value_name=valname)
df[valname]= pd.to_numeric(df[valname],errors='coerce')
return df
subj_long=longify(subj,id_vars=['Subject of study','PRN'],var_name='Category', resetIndex=False)
subj_long.head()
Subject of study | PRN | Category | value | |
---|---|---|---|---|
0 | Medicine & dentistry and veterinary science | nan | FT Headcount | 300.0 |
1 | Subjects allied to medicine | nan | FT Headcount | 1000.0 |
2 | Biological sciences | nan | FT Headcount | 1100.0 |
3 | Agriculture & related subjects | nan | FT Headcount | 100.0 |
4 | Physical sciences | nan | FT Headcount | 500.0 |
subj_long.to_csv('tef_subj_long.csv', index=False)
def get_indicators(fn):
sn='Indicator (a)'
print(fn)
institution,prn=getPRN(fn,sn)
i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,
header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1))
i['PRN']=prn
ift=i[1:7][:]
ift['Type']='FT'
ipt=i[8:14][:]
ipt['Type']='PT'
i=pd.concat([ift,ipt]).dropna(how='all',axis=1)
i.index.names=['Topic']
i=i.reset_index().set_index(['Topic','PRN'])
return i
indicators=pd.DataFrame()
for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
indicators=pd.concat([indicators,get_indicators('{}/{}'.format(directory,filename))])
indicators.to_csv('tef_indicators.csv')
TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx
/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if sys.path[0] == '':
TEFYearTwo_AllMetrics/10007796_The University of Leicester_Metrics.xlsx TEFYearTwo_AllMetrics/10007798_The University of Manchester_Metrics.xlsx TEFYearTwo_AllMetrics/10007799_University of Newcastle upon Tyne_Metrics.xlsx
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-144-0a4c8ab949a1> in <module>() 2 3 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]: ----> 4 indicators=pd.concat([indicators,get_indicators('{}/{}'.format(directory,filename))]) 5 6 indicators.to_csv('tef_indicators.csv') <ipython-input-143-c1c21329cc55> in get_indicators(fn) 5 6 i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22, ----> 7 header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1)) 8 9 i['PRN']=prn /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds) 198 199 if not isinstance(io, ExcelFile): --> 200 io = ExcelFile(io, engine=engine) 201 202 return io._parse_excel( /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds) 255 self.book = xlrd.open_workbook(file_contents=data) 256 elif isinstance(io, compat.string_types): --> 257 self.book = xlrd.open_workbook(io) 258 else: 259 raise ValueError('Must explicitly set engine if not passing in' /usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading) 546 for event, elem in ET.iterparse(stream): 547 if elem.tag == row_tag: --> 548 self_do_row(elem) 549 elem.clear() # destroy all child elements (cells) 550 elif elem.tag == U_SSML12 + "dimension": /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in do_row(self, row_elem) 636 letter_value = _UPPERCASE_1_REL_INDEX 637 for cell_elem in row_elem: --> 638 cell_name = cell_elem.get('r') 639 if cell_name is None: # Yes, it's optional. 640 colx += 1 KeyboardInterrupt:
indicators.head(15)
All years | Years | Level of study | Age | Disadvantaged | Ethnicity | Disabled | Sex | Domicile | Type | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 1_level_1 | 1 | 2 | 3 | First degree | Other\nUG | Young | Mature | Yes | No | White | BME | Yes | No | Male | Female | UK | Other EU | Non EU | |||
Topic | PRN | ||||||||||||||||||||
The teaching on my course | 10007792 | 90.6797 | 90.8339 | 89.8528 | 91.2401 | 90.6797 | NaN | 90.7622 | 89.6226 | 91.6187 | 91.5394 | 91.8816 | 86.7551 | 91.2695 | 90.6016 | 89.0266 | 91.9115 | 91.5804 | 92.212 | 85.0954 | FT |
Assessment and feedback | 10007792 | 75.517 | 76.4804 | 75.0354 | 75.0201 | 75.517 | NaN | 75.4647 | 76.1871 | 74.9424 | 75.5491 | 75.8037 | 73.7306 | 74.0418 | 75.7124 | 74.0954 | 76.5766 | 75.4667 | 75.1963 | 75.9003 | FT |
Academic support | 10007792 | 83.8984 | 84.4193 | 83.9359 | 83.3757 | 83.8984 | NaN | 84.0099 | 82.4686 | 84.2166 | 83.9893 | 84.4267 | 81.8901 | 84.3709 | 83.8358 | 84.2393 | 83.6444 | 83.958 | 85.6894 | 83.0154 | FT |
Non-continuation | 10007792 | 2.2783 | 2.12284 | 2.39577 | 2.3185 | 2.27954 | N | 1.75764 | 9.23277 | 3.33333 | 1.52757 | 2.2543 | 2.61283 | 3.98773 | 2.0488 | 2.44962 | 2.11622 | 0 | 0 | 0 | FT |
Employment or further study | 10007792 | 94.4993 | 93.9013 | 95.2135 | 94.332 | 94.5236 | R | 94.592 | 93.0521 | 94.2377 | 94.6537 | 94.5472 | 93.9394 | 93.7793 | 94.6043 | 93.4256 | 95.4714 | 0 | 0 | 0 | FT |
Highly skilled employment or further study | 10007792 | 79.3124 | 79.3296 | 79.1721 | 79.4422 | 79.3698 | R | 79.3065 | 79.4045 | 74.7899 | 80.011 | 79.0003 | 83.6364 | 77.8169 | 79.5307 | 80.8116 | 77.955 | 0 | 0 | 0 | FT |
The teaching on my course | 10007792 | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | N | R | N | SUP | SUP | SUP | SUP | R | N | PT |
Assessment and feedback | 10007792 | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | N | R | N | SUP | SUP | SUP | SUP | R | N | PT |
Academic support | 10007792 | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | N | R | N | SUP | SUP | SUP | SUP | R | N | PT |
Non-continuation | 10007792 | 21.0526 | N | N | N | 21.0526 | 0 | N | DP | NaN | N | DP | N | N | DP | N | DP | 0 | 0 | 0 | PT |
Employment or further study | 10007792 | 91.1111 | N | DP | DP | DP | SUP | N | DP | R | N | DP | NaN | N | DP | DP | DP | 0 | 0 | 0 | PT |
Highly skilled employment or further study | 10007792 | 88.8889 | N | DP | DP | 86.9565 | SUP | N | DP | R | N | DP | NaN | N | DP | DP | DP | 0 | 0 | 0 | PT |
The teaching on my course | 10007795 | 90.9203 | 89.8789 | 91.7786 | 91.254 | 90.9274 | 89.5833 | 90.8864 | 91.2909 | 91.3506 | 90.8578 | 91.3007 | 89.5158 | 90.8638 | 90.9258 | 89.7398 | 91.633 | 90.9966 | 88.4162 | 90.9528 | FT |
Assessment and feedback | 10007795 | 72.4769 | 70.8454 | 74.1373 | 72.6671 | 72.4931 | 69.4444 | 72.2322 | 75.1547 | 74.1532 | 71.0338 | 71.997 | 73.4978 | 69.9169 | 72.7273 | 73.136 | 72.079 | 71.7057 | 74.4241 | 80.7418 | FT |
Academic support | 10007795 | 83.5246 | 82.0285 | 84.5623 | 84.2109 | 83.507 | 86.8056 | 83.433 | 84.5259 | 85.7576 | 82.5839 | 83.5152 | 83.2889 | 81.4646 | 83.7261 | 84.0149 | 83.2285 | 83.1844 | 84.1187 | 87.2634 | FT |
!head tef_indicators.csv
,,All years,Years,Years,Years,Level of study,Level of study,Age,Age,Disadvantaged,Disadvantaged,Ethnicity,Ethnicity,Disabled,Disabled,Sex,Sex,Welsh medium,Domicile,Domicile,Domicile ,,Unnamed: 1_level_1,1,2,3,First Degree,Other UG,Young,Mature,Yes,No,White,BME,Yes,No,Male,Female,Yes,UK,Other EU,Non EU Topic,PRN,,,,,,,,,,,,,,,,,,,, The teaching on my course,dfdf,87.2,90.68,88.92,85.84,92.03,80,83.07000000000001,88.43,83.44,90.91,89.23,88.45,84.94,91.65,82.42,90.66,,87.3,83.25,91.89 Assessment and feedback ,dfdf,73.3,72.14,75.83,70.06,74.24,69.09,68.67,75.75,76.24,71.81,68.87,76.31,77.02,69.86999999999999,68.9,74.96,,73.86999999999999,68.53999999999999,72.61999999999999 Academic support,dfdf,81.99,79.76,81.80000000000001,82.32000000000001,76.56,75.58,83.2,79.02000000000001,77.16000000000001,77.57000000000001,81.65,78.79,81.04,75.46000000000001,83.72,76.7,,78.41000000000001,83.42,80.22 Non-continuation,dfdf,5.317601422337159,6.427601422337159,3.527601422337159,4.847601422337159,4.887601422337159,6.617601422337159,4.6676014223371585,7.017601422337159,4.237601422337159,4.9176014223371585,7.287601422337159,6.327601422337159,3.687601422337159,4.307601422337159,5.227601422337159,7.55,,,, Employment or further study ,dfdf,92.2,93.39,92.02,96.39,90.21000000000001,92.84,89.37,93.11,97.12,87.48,94.9,90.19,91.78,91.84,91.07000000000001,95.51,,,, Highly skilled employment,dfdf,74.1,73.3,77.24,76.24,70.30999999999999,74.38,75.83,76.91,76.58999999999999,72.13,72.75999999999999,76.14,74.14,69.86999999999999,77.13,70.52,,,, The teaching on my course ,dfdf,90.1,91.38,86.30999999999999,85.55,92.78999999999999,86.86,N,86.1,89.5,87.5,90.99,91.28,N,93.28,94.78,90.94999999999999,,94.85,89.96,89.52
pd.read_csv('tef_indicators.csv',header=[0,1],index_col=[0,1]).head()
All years | Years | Level of study | Age | Disadvantaged | Ethnicity | Disabled | Sex | Welsh medium | Domicile | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 1_level_1 | 1 | 2 | 3 | First Degree | Other UG | Young | Mature | Yes | No | White | BME | Yes | No | Male | Female | Yes | UK | Other EU | Non EU | ||
Topic | PRN | ||||||||||||||||||||
The teaching on my course | dfdf | 87.200000 | 90.680000 | 88.920000 | 85.840000 | 92.030000 | 80.000000 | 83.07000000000001 | 88.430000 | 83.440000 | 90.910000 | 89.230000 | 88.450000 | 84.94 | 91.650000 | 82.420000 | 90.66 | NaN | 87.30 | 83.25 | 91.89 |
Assessment and feedback | dfdf | 73.300000 | 72.140000 | 75.830000 | 70.060000 | 74.240000 | 69.090000 | 68.67 | 75.750000 | 76.240000 | 71.810000 | 68.870000 | 76.310000 | 77.02 | 69.870000 | 68.900000 | 74.96 | NaN | 73.87 | 68.54 | 72.62 |
Academic support | dfdf | 81.990000 | 79.760000 | 81.800000 | 82.320000 | 76.560000 | 75.580000 | 83.2 | 79.020000 | 77.160000 | 77.570000 | 81.650000 | 78.790000 | 81.04 | 75.460000 | 83.720000 | 76.70 | NaN | 78.41 | 83.42 | 80.22 |
Non-continuation | dfdf | 5.317601 | 6.427601 | 3.527601 | 4.847601 | 4.887601 | 6.617601 | 4.6676014223371585 | 7.017601 | 4.237601 | 4.917601 | 7.287601 | 6.327601 | 3.687601422337159 | 4.307601 | 5.227601 | 7.55 | NaN | NaN | NaN | NaN |
Employment or further study | dfdf | 92.200000 | 93.390000 | 92.020000 | 96.390000 | 90.210000 | 92.840000 | 89.37 | 93.110000 | 97.120000 | 87.480000 | 94.900000 | 90.190000 | 91.78 | 91.840000 | 91.070000 | 95.51 | NaN | NaN | NaN | NaN |
dummyPRN=indicators.index.get_level_values('PRN').unique().tolist()[0]
dummyPRN
'10007792'
#Example cross-section
indicators.xs(dummyPRN, level='PRN')['Age','Young']
Topic Full-time headcount: NaN The teaching on my course 90.7622 Assessment and feedback 75.4647 Academic support 84.0099 Non-continuation 1.75764 Employment or further study 94.592 Highly skilled employment or further study 79.3065 Part-time headcount: NaN The teaching on my course N Assessment and feedback N Academic support N Non-continuation N Employment or further study N Highly skilled employment or further study N NaN MET_WHITE NaN TEFXXXAGE_Y NaN INDICATOR_\nTEFXXXAGE_\nY NaN NaN NaN NaN NaN NaN N NaN R NaN SUP NaN NaN NaN DP NaN Name: (Age, Young), dtype: object
#Display the PRNs in the index
indicators.index.get_level_values('PRN').unique().tolist()
['dfdf', 'JHDH']
#Example filter on column and PRN
indicators.xs('JHDH', level='PRN')['Domicile']
UK | Other EU | Non EU | |
---|---|---|---|
Topic | |||
The teaching on my course | 87.30 | 83.25 | 91.89 |
Assessment and feedback | 73.87 | 68.54 | 72.62 |
Academic support | 78.41 | 83.42 | 80.22 |
Non-continuation | NaN | NaN | NaN |
Employment or further study | NaN | NaN | NaN |
Highly skilled employment | NaN | NaN | NaN |
The teaching on my course | 94.85 | 89.96 | 89.52 |
Assessment and feedback | 74.95 | 77.69 | 82.98 |
Academic support | 86.12 | 83.30 | 83.60 |
Non-continuation | NaN | NaN | NaN |
Employment or further study | NaN | NaN | NaN |
Highly skilled employment | NaN | NaN | NaN |
indicators_long=longify(indicators)
indicators_long.tail(50).head(10)
Topic | PRN | Heading | Subheading | value | |
---|---|---|---|---|---|
430 | Employment or further study | JHDH | Domicile | UK | NaN |
431 | Highly skilled employment | JHDH | Domicile | UK | NaN |
432 | The teaching on my course | dfdf | Domicile | Other EU | 83.25 |
433 | Assessment and feedback | dfdf | Domicile | Other EU | 68.54 |
434 | Academic support | dfdf | Domicile | Other EU | 83.42 |
435 | Non-continuation | dfdf | Domicile | Other EU | NaN |
436 | Employment or further study | dfdf | Domicile | Other EU | NaN |
437 | Highly skilled employment | dfdf | Domicile | Other EU | NaN |
438 | The teaching on my course | dfdf | Domicile | Other EU | 89.96 |
439 | Assessment and feedback | dfdf | Domicile | Other EU | 77.69 |
indicators_long.to_csv('tef_indicators_long.csv', index=False)
#This code is similar to that used for Zscores sheet
def get_benchmarks(fn):
sn='Benchmark (b)'
print(fn)
institution,prn=getPRN(fn,sn)
z=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=21, header=[0,1]))
z['PRN']=prn
zft=z[1:7][:]
zft['Type']='FT'
zpt=z[8:14][:]
zpt['Type']='PT'
z=pd.concat([zft,zpt]).dropna(how='all',axis=1)
sd=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=30,parse_cols=21, header=[0,1]))
sd['PRN']=prn
sdft=sd[1:7][:]
sdft['Type']='FT'
sdpt=sd[8:14][:]
sdpt['Type']='PT'
sd=pd.concat([sdft,sdpt]).dropna(how='all',axis=1)
z['Benchmark type']='benchmark'
sd['Benchmark type']='providercontrib'
z.index.names=['Topic']
sd.index.names=['Topic']
return z,sd
benchmark=pd.DataFrame()
providercontrib=pd.DataFrame()
for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
benchmark_tmp,providercontrib_tmp= get_benchmarks(fn='{}/{}'.format(directory,filename))
benchmark=pd.concat([benchmark,benchmark_tmp])
providercontrib=pd.concat([providercontrib,providercontrib_tmp])
TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-132-06b8eac6d95c> in <module>() 3 4 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]: ----> 5 benchmark_tmp,providercontrib_tmp= get_benchmarks(fn='{}/{}'.format(directory,filename)) 6 benchmark=pd.concat([benchmark,benchmark_tmp]) 7 providercontrib=pd.concat([providercontrib,providercontrib_tmp]) <ipython-input-131-af1240caaa5a> in get_benchmarks(fn) 12 zpt['Type']='PT' 13 ---> 14 sd=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=30,parse_cols=21, header=[0,1])) 15 sd['PRN']=prn 16 sdft=sd[1:7][:] /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds) 198 199 if not isinstance(io, ExcelFile): --> 200 io = ExcelFile(io, engine=engine) 201 202 return io._parse_excel( /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds) 255 self.book = xlrd.open_workbook(file_contents=data) 256 elif isinstance(io, compat.string_types): --> 257 self.book = xlrd.open_workbook(io) 258 else: 259 raise ValueError('Must explicitly set engine if not passing in' /usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading) 546 for event, elem in ET.iterparse(stream): 547 if elem.tag == row_tag: --> 548 self_do_row(elem) 549 elem.clear() # destroy all child elements (cells) 550 elif elem.tag == U_SSML12 + "dimension": /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in do_row(self, row_elem) 663 if explicit_row_number and cell_name[charx:] != row_number: 664 raise Exception('cell name %r but row number is %r' % (cell_name, row_number)) --> 665 xf_index = int(cell_elem.get('s', '0')) 666 cell_type = cell_elem.get('t', 'n') 667 tvalue = None KeyboardInterrupt:
benchmark.head()
All years | Years | Level of study | Age | Disadvantaged | ... | Ethnicity | Disabled | Sex | Domicile | PRN | Type | Benchmark type | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 1_level_1 | 1 | 2 | 3 | First degree | Other\nUG | Young | Mature | Yes | No | ... | BME | Yes | No | Male | Female | UK | Other EU | ||||
Topic | |||||||||||||||||||||
The teaching on my course | 88.3309 | 87.7923 | 88.632 | 88.5606 | 88.4478 | NaN | 88.4035 | 87.3995 | 88.7461 | 89.4115 | ... | 84.7989 | 87.9403 | 88.3826 | 87.3999 | 89.0245 | 89.1227 | 86.6562 | 10007792 | FT | benchmark |
Assessment and feedback | 72.0203 | 71.0119 | 72.1508 | 72.8637 | 71.8199 | NaN | 71.9077 | 73.463 | 73.4956 | 72.1592 | ... | 70.5695 | 70.3514 | 72.2414 | 71.8281 | 72.1635 | 72.1379 | 69.7941 | 10007792 | FT | benchmark |
Academic support | 81.8189 | 81.2489 | 82.1147 | 82.1206 | 81.8246 | NaN | 81.9071 | 80.689 | 82.9026 | 82.2227 | ... | 79.5073 | 80.9901 | 81.9287 | 82.543 | 81.2794 | 82.1206 | 81.6937 | 10007792 | FT | benchmark |
Non-continuation | 3.44304 | 3.37228 | 3.56405 | 3.42482 | 3.39128 | N | 2.83198 | 11.612 | 3.84689 | 2.46721 | ... | 4.03444 | 5.27223 | 3.21838 | 4.05566 | 2.94869 | 0 | 0 | 10007792 | FT | benchmark |
Employment or further study | 93.9778 | 93.6149 | 94.0882 | 94.311 | 93.9752 | R | 94.0575 | 92.7333 | 94.0585 | 94.1474 | ... | 92.4344 | 92.3242 | 94.1863 | 92.4983 | 95.3174 | 0 | 0 | 10007792 | FT | benchmark |
5 rows × 21 columns
providercontrib.head()
All years | Years | Level of study | Age | Disadvantaged | ... | Disabled | Sex | Domicile | PRN | Type | Benchmark type | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 1_level_1 | 1 | 2 | 3 | First Degree | Other UG | Young | Mature | Yes | No | ... | Yes | No | Male | Female | UK | Other EU | Non EU | ||||
Topic | |||||||||||||||||||||
The teaching on my course | 4.46 | 3.81 | 2.65 | 2.2 | 2.57 | 2.85 | 3.39 | 4.27 | 3.3 | 4.26 | ... | 4.21 | 2.32 | 2.16 | 3.77 | 3.25 | 3.52 | 4.69 | nan | FT | providercontrib |
Assessment and feedback | 3.06 | 4.4 | 4.88 | 4.75 | 4.33 | 2.64 | 2.03 | 4.48 | 4.7 | 2.79 | ... | 3.81 | 4.34 | 4.74 | 2.05 | 4.1 | 4.52 | 4.24 | nan | FT | providercontrib |
Academic support | 2.98 | 3.48 | 3.71 | 3.09 | 2.54 | 3.36 | 2.75 | 2.58 | 4.17 | 2.74 | ... | 3.14 | 3.19 | 2.19 | 3.71 | 3.07 | 3.08 | 2.03 | nan | FT | providercontrib |
Non-continuation | 4.91 | 0 | 2.69 | 4.89 | 2.1 | 4.86 | 4.07 | 3.02 | 2.08 | 3.35 | ... | 2.29 | 2.5 | 3.71 | 2.61 | NaN | NaN | NaN | nan | FT | providercontrib |
Employment or further study | 2.3 | 0 | 3.42 | 2.69 | 3.92 | 3.11 | 2.02 | 4.35 | 2.53 | 4.36 | ... | 4.91 | 3.1 | 3.01 | 3.60 | NaN | NaN | NaN | nan | FT | providercontrib |
5 rows × 22 columns
benchmark_long=longify(benchmark,id_vars=['Topic','PRN','Benchmark type'],valname='value')
benchmark_long.head()
Topic | PRN | Benchmark type | Heading | Subheading | value | |
---|---|---|---|---|---|---|
0 | The teaching on my course | nan | benchmark | All years | Unnamed: 1_level_1 | 85.737368 |
1 | Assessment and feedback | nan | benchmark | All years | Unnamed: 1_level_1 | 71.064694 |
2 | Academic support | nan | benchmark | All years | Unnamed: 1_level_1 | 80.825662 |
3 | Non-continuation | nan | benchmark | All years | Unnamed: 1_level_1 | 6.100000 |
4 | Employment or further study | nan | benchmark | All years | Unnamed: 1_level_1 | 86.800000 |
providercontrib_long=longify(providercontrib,id_vars=['Topic','PRN','Benchmark type'],valname='value')
providercontrib_long.head()
Topic | PRN | Benchmark type | Heading | Subheading | value | |
---|---|---|---|---|---|---|
0 | The teaching on my course | nan | providercontrib | All years | Unnamed: 1_level_1 | 4.46 |
1 | Assessment and feedback | nan | providercontrib | All years | Unnamed: 1_level_1 | 3.06 |
2 | Academic support | nan | providercontrib | All years | Unnamed: 1_level_1 | 2.98 |
3 | Non-continuation | nan | providercontrib | All years | Unnamed: 1_level_1 | 4.91 |
4 | Employment or further study | nan | providercontrib | All years | Unnamed: 1_level_1 | 2.30 |
benchmarkprovider=pd.concat([benchmark_long,providercontrib_long])
benchmarkprovider.head()
Topic | PRN | Benchmark type | Heading | Subheading | value | |
---|---|---|---|---|---|---|
0 | The teaching on my course | nan | benchmark | All years | Unnamed: 1_level_1 | 85.737368 |
1 | Assessment and feedback | nan | benchmark | All years | Unnamed: 1_level_1 | 71.064694 |
2 | Academic support | nan | benchmark | All years | Unnamed: 1_level_1 | 80.825662 |
3 | Non-continuation | nan | benchmark | All years | Unnamed: 1_level_1 | 6.100000 |
4 | Employment or further study | nan | benchmark | All years | Unnamed: 1_level_1 | 86.800000 |
benchmarkprovider_wide=benchmarkprovider.pivot_table(index=['Topic','PRN','Heading','Subheading'],
columns='Benchmark type',
values='value').reset_index()
benchmarkprovider_wide.head()
Benchmark type | Topic | PRN | Heading | Subheading | benchmark | providercontrib |
---|---|---|---|---|---|---|
0 | Academic support | nan | Age | Mature | 76.865662 | 2.58 |
1 | Academic support | nan | Age | Young | 78.745662 | 2.75 |
2 | Academic support | nan | All years | Unnamed: 1_level_1 | 80.825662 | 2.98 |
3 | Academic support | nan | Disabled | No | 76.395662 | 3.19 |
4 | Academic support | nan | Disabled | Yes | 77.925662 | 3.14 |
benchmarkprovider_wide.to_csv('tef_benchmarkprovider_wide.csv', index=False)
#Thic code is actually same as for indicators I think?
def get_differences(fn):
sn='Difference (a-b)'
print(fn)
institution,prn=getPRN(fn,sn)
i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,
header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1))
i.index.names=['Topic']
i['PRN']=prn
ift=i[1:7][:]
ift['Type']='FT'
ipt=i[8:14][:]
ipt['Type']='PT'
i=pd.concat([ift,ipt]).dropna(how='all',axis=1)
i=i[:].reset_index().set_index(['Topic','PRN'])
return i
differences=pd.DataFrame()
for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
differences=pd.concat([differences,get_differences('{}/{}'.format(directory,filename))])
differences.head()#.to_csv('tef_differences.csv')
TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-150-b6079aa62e26> in <module>() 2 3 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]: ----> 4 differences=pd.concat([differences,get_differences('{}/{}'.format(directory,filename))]) 5 6 differences.head()#.to_csv('tef_differences.csv') <ipython-input-149-a35ae05f22a5> in get_differences(fn) 6 7 i=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22, ----> 8 header=[0,1]).dropna(how='all',axis=0).dropna(how='all',axis=1)) 9 i.index.names=['Topic'] 10 i['PRN']=prn /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds) 198 199 if not isinstance(io, ExcelFile): --> 200 io = ExcelFile(io, engine=engine) 201 202 return io._parse_excel( /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds) 255 self.book = xlrd.open_workbook(file_contents=data) 256 elif isinstance(io, compat.string_types): --> 257 self.book = xlrd.open_workbook(io) 258 else: 259 raise ValueError('Must explicitly set engine if not passing in' /usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading) 544 row_tag = U_SSML12 + "row" 545 self_do_row = self.do_row --> 546 for event, elem in ET.iterparse(stream): 547 if elem.tag == row_tag: 548 self_do_row(elem) /usr/lib/python3.5/xml/etree/ElementTree.py in __next__(self) 1295 try: 1296 while 1: -> 1297 for event in self._parser.read_events(): 1298 return event 1299 if self._parser._parser is None: KeyboardInterrupt:
differences.head(20)
All years | Years | Level of study | Age | Disadvantaged | Ethnicity | Disabled | Sex | Domicile | Type | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 1_level_1 | 1 | 2 | 3 | First degree | Other\nUG | Young | Mature | Yes | No | White | BME | Yes | No | Male | Female | UK | Other EU | Non EU | |||
Topic | PRN | ||||||||||||||||||||
The teaching on my course | 10007792 | 2.34886 | 3.04161 | 1.22077 | 2.67949 | 2.23199 | NaN | 2.35867 | 2.22319 | 2.87256 | 2.12792 | 2.60783 | 1.95624 | 3.3292 | 2.21903 | 1.62661 | 2.88701 | 2.45772 | 5.55582 | -0.547378 | FT |
Assessment and feedback | 10007792 | 3.4967 | 5.46856 | 2.88462 | 2.15646 | 3.69709 | NaN | 3.55698 | 2.72407 | 1.44683 | 3.38994 | 3.4811 | 3.16111 | 3.69043 | 3.47104 | 2.26725 | 4.41311 | 3.32879 | 5.40222 | 1.57468 | FT |
Academic support | 10007792 | 2.07944 | 3.17045 | 1.8212 | 1.2551 | 2.07377 | NaN | 2.10283 | 1.77954 | 1.31398 | 1.76657 | 2.04698 | 2.38284 | 3.38082 | 1.90708 | 1.69626 | 2.36494 | 1.83747 | 3.99566 | 1.23495 | FT |
Non-continuation (b-a) | 10007792 | 1.16474 | 1.24943 | 1.16828 | 1.10632 | 1.11174 | N | 1.07434 | 2.37925 | 0.513553 | 0.939633 | 1.07013 | 1.42161 | 1.2845 | 1.16958 | 1.60604 | 0.832465 | 0 | 0 | 0 | FT |
Employment or further study | 10007792 | 0.521486 | 0.286413 | 1.12521 | 0.0209712 | 0.548394 | R | 0.534478 | 0.318813 | 0.179146 | 0.506285 | 0.436058 | 1.50502 | 1.45517 | 0.417999 | 0.927354 | 0.153997 | 0 | 0 | 0 | FT |
Highly skilled employment or further study | 10007792 | 4.00179 | 4.95353 | 3.77736 | 3.01536 | 4.04543 | R | 4.03955 | 3.41268 | 1.62383 | 3.62774 | 3.97737 | 4.89422 | 3.35868 | 4.09564 | 3.6354 | 4.33353 | 0 | 0 | 0 | FT |
The teaching on my course | 10007792 | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | N | R | N | SUP | SUP | SUP | SUP | R | N | PT |
Assessment and feedback | 10007792 | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | N | R | N | SUP | SUP | SUP | SUP | R | N | PT |
Academic support | 10007792 | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | N | R | N | SUP | SUP | SUP | SUP | R | N | PT |
Non-continuation (b-a) | 10007792 | 16.2163 | N | N | N | 16.2163 | 0 | N | DP | NaN | N | DP | N | N | DP | N | DP | 0 | 0 | 0 | PT |
Employment or further study | 10007792 | -0.722503 | N | DP | DP | DP | SUP | N | DP | R | N | DP | NaN | N | DP | DP | DP | 0 | 0 | 0 | PT |
Highly skilled employment or further study | 10007792 | 19.8536 | N | DP | DP | 27.152 | SUP | N | DP | R | N | DP | NaN | N | DP | DP | DP | 0 | 0 | 0 | PT |
differences_long=longify(differences,'value')
differences_long.head()
Topic | PRN | Heading | Subheading | value | |
---|---|---|---|---|---|
0 | The teaching on my course | nan | All years | Unnamed: 1_level_1 | 1.462632 |
1 | Assessment and feedback | nan | All years | Unnamed: 1_level_1 | 2.235306 |
2 | Academic support | nan | All years | Unnamed: 1_level_1 | 1.164338 |
3 | Non-continuation (b-a) | nan | All years | Unnamed: 1_level_1 | 0.782399 |
4 | Employment or further study | nan | All years | Unnamed: 1_level_1 | 5.400000 |
differences_long.to_csv('tef_differences_long.csv', index=False)
These relate to the indicators.
def get_zscores(fn):
sn='Z-score'
print(fn)
institution,prn=getPRN(fn,sn)
z=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,header=[0,1]))
z['PRN']=prn
zft=z[1:7][:]
zft['Type']='FT'
zpt=z[8:14][:]
zpt['Type']='PT'
sd=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=30,parse_cols=22, header=[0,1]))
sd['PRN']=prn
sdft=sd[1:7][:]
sdft['Type']='FT'
sdpt=sd[8:14][:]
sdpt['Type']='PT'
z=pd.concat([zft,zpt]).dropna(how='all',axis=1)
sd=pd.concat([sdft,sdpt]).dropna(how='all',axis=1)
z['Error type']='z'
sd['Error type']='sd'
z.index.names=['Topic']
sd.index.names=['Topic']
return z,sd
#fn='tef/TEST_TEF_Metrics_workbook_exemplar.xlsx'
#z,sd= get_zscores(fn)
z=pd.DataFrame()
sd=pd.DataFrame()
for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
z_tmp,sd_tmp= get_zscores(fn='{}/{}'.format(directory,filename))
z=pd.concat([z,z_tmp])
sd=pd.concat([sd,sd_tmp])
TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-165-a2cc9a145e05> in <module>() 36 37 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]: ---> 38 z_tmp,sd_tmp= get_zscores(fn='{}/{}'.format(directory,filename)) 39 z=pd.concat([z,z_tmp]) 40 sd=pd.concat([sd,sd_tmp]) <ipython-input-165-a2cc9a145e05> in get_zscores(fn) 4 institution,prn=getPRN(fn,sn) 5 ----> 6 z=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=8,parse_cols=22,header=[0,1])) 7 z['PRN']=prn 8 zft=z[1:7][:] /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds) 198 199 if not isinstance(io, ExcelFile): --> 200 io = ExcelFile(io, engine=engine) 201 202 return io._parse_excel( /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds) 255 self.book = xlrd.open_workbook(file_contents=data) 256 elif isinstance(io, compat.string_types): --> 257 self.book = xlrd.open_workbook(io) 258 else: 259 raise ValueError('Must explicitly set engine if not passing in' /usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading) 544 row_tag = U_SSML12 + "row" 545 self_do_row = self.do_row --> 546 for event, elem in ET.iterparse(stream): 547 if elem.tag == row_tag: 548 self_do_row(elem) /usr/lib/python3.5/xml/etree/ElementTree.py in __next__(self) 1295 try: 1296 while 1: -> 1297 for event in self._parser.read_events(): 1298 return event 1299 if self._parser._parser is None: /usr/lib/python3.5/xml/etree/ElementTree.py in read_events(self) 1263 index = self._index 1264 try: -> 1265 event = events[self._index] 1266 # Avoid retaining references to past events 1267 events[self._index] = None KeyboardInterrupt:
z.head(20)
All years | Years | Level of study | Age | Disadvantaged | ... | Disabled | Sex | Domicile | PRN | Type | Error type | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 1_level_1 | 1 | 2 | 3 | First degree | Other\nUG | Young | Mature | Yes | No | ... | Yes | No | Male | Female | UK | Other EU | Non EU | ||||
Topic | |||||||||||||||||||||
The teaching on my course | 7.10342 | 5.3252 | 2.03351 | 4.94598 | 6.77154 | NaN | 6.85589 | 1.80743 | 2.75972 | 5.39015 | ... | 3.45701 | 6.30257 | 3.06317 | 6.87317 | 6.85111 | 3.56633 | -0.569411 | 10007792 | FT | z |
Assessment and feedback | 7.6247 | 6.93072 | 3.48961 | 2.83967 | 8.04842 | NaN | 7.43428 | 1.639 | 0.989925 | 6.05305 | ... | 2.73797 | 7.11754 | 3.19442 | 7.34698 | 6.53844 | 2.50602 | 1.34462 | 10007792 | FT | z |
Academic support | 5.27047 | 4.67032 | 2.57395 | 1.91782 | 5.25888 | NaN | 5.13993 | 1.19991 | 1.05724 | 3.68713 | ... | 2.94066 | 4.53997 | 2.8402 | 4.50317 | 4.20403 | 2.22035 | 1.19656 | 10007792 | FT | z |
Non-continuation | 5.74893 | 3.6081 | 3.10899 | 3.35249 | 5.70975 | N | 5.63506 | 2.20931 | 0.824919 | 4.90464 | ... | 1.97203 | 5.51903 | 5.08287 | 3.19744 | 0 | 0 | 0 | 10007792 | FT | z |
Employment or further study | 1.84464 | 0.546067 | 2.44282 | 0.0442631 | 1.91896 | R | 1.84295 | 0.267295 | 0.222057 | 1.63644 | ... | 1.69252 | 1.4045 | 2.04653 | 0.436756 | 0 | 0 | 0 | 10007792 | FT | z |
Highly skilled employment or further study | 7.86696 | 5.49485 | 4.41337 | 3.52352 | 7.89039 | R | 7.59202 | 1.84611 | 1.07268 | 6.43625 | ... | 2.38469 | 7.50914 | 4.99549 | 6.1014 | 0 | 0 | 0 | 10007792 | FT | z |
The teaching on my course | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | ... | N | SUP | SUP | SUP | SUP | R | N | 10007792 | PT | z |
Assessment and feedback | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | ... | N | SUP | SUP | SUP | SUP | R | N | 10007792 | PT | z |
Academic support | SUP | SUP | SUP | N | N | SUP | N | SUP | N | N | ... | N | SUP | SUP | SUP | SUP | R | N | 10007792 | PT | z |
Non-continuation | 1.67569 | N | N | N | 1.67569 | 0 | N | DP | NaN | N | ... | N | DP | N | DP | 0 | 0 | 0 | 10007792 | PT | z |
Employment or further study | -0.266142 | N | DP | DP | DP | SUP | N | DP | R | N | ... | N | DP | DP | DP | 0 | 0 | 0 | 10007792 | PT | z |
Highly skilled employment or further study | 4.05138 | N | DP | DP | 3.70309 | SUP | N | DP | R | N | ... | N | DP | DP | DP | 0 | 0 | 0 | 10007792 | PT | z |
12 rows × 22 columns
sd.head()
All years | Years | Level of study | Age | Disadvantaged | ... | Sex | Welsh medium | Domicile | PRN | Type | Error type | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 1_level_1 | 1 | 2 | 3 | First Degree | Other UG | Young | Mature | Yes | No | ... | Male | Female | Yes | No | UK | Other EU | Non EU | ||||
Topic | |||||||||||||||||||||
The teaching on my course | 1.21886 | 4.44397 | 0.100994 | 6.0307 | 3.47268 | 1.08085 | 1.74223 | 3.14273 | 3.84986 | 0.287595 | ... | 4.33997 | 1.712574 | 0 | 0 | 0.906851 | 7.05848 | 0.907585 | dfdf | FT | sd |
Assessment and feedback | 0.620918 | 0.505516 | 1.22337 | 1.21236 | 0.0956798 | 0.754485 | 0.522697 | 1.45514 | 0.0966519 | 0.849326 | ... | 0.347848 | 1.653479 | 0 | 0 | 0.175016 | 1.16463 | 0.810035 | dfdf | FT | sd |
Academic support | 0.415835 | 0.926926 | 0.040144 | 1.17222 | 1.605 | 0.671054 | 1.70013 | 0.940759 | 2.50489 | 0.570067 | ... | 1.61696 | 1.058166 | 0 | 0 | 0.360268 | 2.53371 | 0.0795091 | dfdf | FT | sd |
Non-continuation | 0.325999 | 0.3527 | 0.90899 | 0.330694 | 1.65284 | 1.21958 | 1.01048 | 0.251151 | 1.60483 | 0.491675 | ... | 1.17956 | 0.547368 | 0 | 0 | NaN | NaN | NaN | dfdf | FT | sd |
Employment or further study | 0.642857 | 0.900592 | 0.0382883 | 0.57619 | 0.151839 | 0.408096 | 0.293578 | 0.911861 | 1.25866 | 0.0658683 | ... | 0.413673 | 0.759783 | 0 | 0 | NaN | NaN | NaN | dfdf | FT | sd |
5 rows × 24 columns
sd_long=longify(sd,'sd value')
sd_long.head()
Topic | PRN | Heading | Subheading | sd value | |
---|---|---|---|---|---|
0 | The teaching on my course | dfdf | All years | Unnamed: 1_level_1 | 1.218860 |
1 | Assessment and feedback | dfdf | All years | Unnamed: 1_level_1 | 0.620918 |
2 | Academic support | dfdf | All years | Unnamed: 1_level_1 | 0.415835 |
3 | Non-continuation | dfdf | All years | Unnamed: 1_level_1 | 0.325999 |
4 | Employment or further study | dfdf | All years | Unnamed: 1_level_1 | 0.642857 |
z_long=longify(z,'z value')
z_long.head()
Topic | PRN | Heading | Subheading | z value | |
---|---|---|---|---|---|
0 | The teaching on my course | dfdf | All years | Unnamed: 1_level_1 | 1.2 |
1 | Assessment and feedback | dfdf | All years | Unnamed: 1_level_1 | 3.6 |
2 | Academic support | dfdf | All years | Unnamed: 1_level_1 | -2.8 |
3 | Non-continuation | dfdf | All years | Unnamed: 1_level_1 | 2.4 |
4 | Employment or further study | dfdf | All years | Unnamed: 1_level_1 | 8.4 |
indicators_long.head()
Topic | PRN | Heading | Subheading | value | |
---|---|---|---|---|---|
0 | The teaching on my course | dfdf | All years | Unnamed: 1_level_1 | 87.2 |
1 | Assessment and feedback | dfdf | All years | Unnamed: 1_level_1 | 73.3 |
2 | Academic support | dfdf | All years | Unnamed: 1_level_1 | 81.99 |
3 | Non-continuation | dfdf | All years | Unnamed: 1_level_1 | 5.3176 |
4 | Employment or further study | dfdf | All years | Unnamed: 1_level_1 | 92.2 |
ifull_long=indicators_long.merge( sd_long, on=['Topic','PRN','Heading','Subheading'])
ifull_long=ifull_long.merge( z_long, on=['Topic','PRN','Heading','Subheading'])
ifull_long.head()
Topic | PRN | Heading | Subheading | value | sd value | z value | |
---|---|---|---|---|---|---|---|
0 | The teaching on my course | dfdf | All years | Unnamed: 1_level_1 | 87.2 | 1.21886 | 1.2 |
1 | Assessment and feedback | dfdf | All years | Unnamed: 1_level_1 | 73.3 | 0.620918 | 3.6 |
2 | Academic support | dfdf | All years | Unnamed: 1_level_1 | 81.99 | 0.415835 | -2.8 |
3 | Non-continuation | dfdf | All years | Unnamed: 1_level_1 | 5.3176 | 0.325999 | 2.4 |
4 | Non-continuation | dfdf | All years | Unnamed: 1_level_1 | 5.3176 | 0.325999 | 1.4 |
ifull_long.to_csv('tef_indicators_with_error_long.csv', index=False)
def get_ethnicity(fn):
sn='BME breakdown'
print(fn)
institution,prn=getPRN(fn,sn)
e=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=9,parse_cols=5))
e.columns=['Category','NULL','BME','Black','Asian','Other']
e['PRN']=prn
ft=e[:6].dropna(how='all',axis=1)
ft['Type']='FT'
pt=e[7:13].dropna(how='all',axis=1)#.dropna(how='all',axis=1))[:-1]
pt['Type']='PT'
return pd.concat([ft,pt])
#fn='tef/TEF_Metrics_workbook_exemplar.xlsx'
#get_ethnicity(fn)
ethnicity=pd.DataFrame()
for filename in [f for f in os.listdir(directory) if not f.startswith('~')]:
ethnicity=pd.concat([ethnicity,get_ethnicity('{}/{}'.format(directory,filename))])
ethnicity.to_csv('tef_ethnicity.csv',index=False)
TEFYearTwo_AllMetrics/10007792_University of Exeter_Metrics.xlsx TEFYearTwo_AllMetrics/10007795_The University of Leeds_Metrics.xlsx
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-175-69c6fa652c51> in <module>() 5 6 for filename in [f for f in os.listdir(directory) if not f.startswith('~')]: ----> 7 ethnicity=pd.concat([ethnicity,get_ethnicity('{}/{}'.format(directory,filename))]) 8 9 ethnicity.to_csv('tef_ethnicity.csv',index=False) <ipython-input-82-2790a51d3336> in get_ethnicity(fn) 4 institution,prn=getPRN(fn,sn) 5 ----> 6 e=cleanCols(pd.read_excel(fn, sheetname=sn, skiprows=9,parse_cols=5)) 7 e.columns=['Category','NULL','BME','Black','Asian','Other'] 8 e['PRN']=prn /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds) 198 199 if not isinstance(io, ExcelFile): --> 200 io = ExcelFile(io, engine=engine) 201 202 return io._parse_excel( /usr/local/lib/python3.5/dist-packages/pandas/io/excel.py in __init__(self, io, **kwds) 255 self.book = xlrd.open_workbook(file_contents=data) 256 elif isinstance(io, compat.string_types): --> 257 self.book = xlrd.open_workbook(io) 258 else: 259 raise ValueError('Must explicitly set engine if not passing in' /usr/local/lib/python3.5/dist-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows) 420 formatting_info=formatting_info, 421 on_demand=on_demand, --> 422 ragged_rows=ragged_rows, 423 ) 424 return bk /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows) 831 x12sheet = X12Sheet(sheet, logfile, verbosity) 832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname) --> 833 x12sheet.process_stream(zflo, heading) 834 del zflo 835 /usr/local/lib/python3.5/dist-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading) 550 elif elem.tag == U_SSML12 + "dimension": 551 self.do_dimension(elem) --> 552 elif elem.tag == U_SSML12 + "mergeCell": 553 self.do_merge_cell(elem) 554 self.finish_off() KeyboardInterrupt:
!head tef_ethnicity.csv
Category,BME,Black,Asian,Other,PRN,Type The teaching on my course,,,-,-,nan,FT Assessment and feedback ,++,++,,+,nan,FT Academic support,,-,,-,nan,FT Non-continuation,,++,-,-,nan,FT Employment or further study ,++,,+,+,nan,FT Highly skilled employment,++,++,,++,nan,FT The teaching on my course ,,+,- ,,nan,PT Assessment and feedback,,- ,-,,nan,PT Academic support ,,+,,,nan,PT
ethnicity.head()
Asian | BME | Black | Category | Other | PRN | Type | |
---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | The teaching on my course | NaN | 10007792 | FT |
1 | NaN | + | NaN | Assessment and feedback | NaN | 10007792 | FT |
2 | + | + | NaN | Academic support | NaN | 10007792 | FT |
3 | NaN | NaN | NaN | Non-continuation | NaN | 10007792 | FT |
4 | NaN | NaN | NaN | Employment or further study | NaN | 10007792 | FT |