#!/usr/bin/env python # coding: utf-8 # In[152]: get_ipython().run_cell_magic('HTML', '', ' \n') # In[2]: from IPython.display import Image import warnings warnings.filterwarnings("ignore", category=DeprecationWarning) warnings.filterwarnings("ignore", category=FutureWarning) from matplotlib import pyplot as plt plt.rcParams.update({'figure.max_open_warning': 0}) graph_figsize = (10,6) # I'm forgetful and lazy plt.rcParams.update({'figure.figsize': graph_figsize}) get_ipython().run_line_magic('matplotlib', 'nbagg') # Jupyter Magics! import geopandas as gp import urllib.request import zipfile from pathlib import Path import pandas as pd import numpy as np import seaborn as sns import moviepy.editor as mpy import moviepy.video as mpyv import plotly.plotly as py from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot import plotly.graph_objs as go from datetime import datetime import cufflinks init_notebook_mode(connected=False) pd.Series([0,1,2,3,4,5]).iplot() # # Idiots Guide to (Open) Data Science # # Andrew Bolster # * [bolster.online](https://bolster.online) # * Tweets [@bolster](https://twitter.bolster.online) # * GitHubs at [andrewbolster](https://github.com/andrewbolster) # * Machine Learning Team Lead at [WhiteHatSec](https://www.whitehatsec.com) # * **We're hiring** (Placements also available) # * Plays at [Farset Labs](https://www.farsetlabs.org.uk) # * **[THIS NOTEBOOK IS AT present.bolster.online](http://present.bolster.online)** # * [Also available in source at presentgh.bolster.online](http://presentgh.bolster.online) # ## What is Data Science? # * "Deriving actionable business and operational insights from multi-modal data sources" # * AKA: "Turning Numbers into Other Numbers, and occasionally pretty graphs" # # ![](img/decline.png) # ## Data Science in Northern Ireland # ![](img/dozens.gif) # ### Vibrant Corporate Ecosystem # # * [Analytics Engines](http://www.analyticsengines.com/) # * [AquaQ Analytics](https://www.aquaq.co.uk/) # * [**WhiteHat Security**😜](https://www.whitehatsec.com) # * [Kainos](https://www.kainos.com/) # * [Flexera|BDNA](https://www.bdna.com/category/flexera/) # * [First Derivitives](https://www.firstderivatives.com/) # * [Neueda](http://www.neueda.com/) # * [Sensum](https://sensum.co/) # * [BrainwaveBank](https://www.brainwavebank.com/) # * Many Many _Many_ More # ### Open Source / Meetup Ecosystem too # ***Hint: If you really want to learn data science, go to/get involved in some of these*** # # * [IoT Belfast](https://www.meetup.com/IOTBelfast/) # * [IoT Alliance](https://www.meetup.com/Belfast-New-Technology-Meetup/) # * [PyBelfast](https://www.meetup.com/PyBelfast/) # * [Code Co-op NI](https://www.meetup.com/CodeCoop-NI/) # * [Data Art Belfast](https://www.meetup.com/Data-Art-Belfast/) # * [Big Data Belfast Breakout](https://www.meetup.com/Big-Data-Belfast-Breakout/) # * [Women Who Code Belfast](https://www.meetup.com/Women-Who-Code-Belfast/) # * [Docker Belfast](https://www.meetup.com/Docker-Belfast/) # * [DevOps Belfast](https://www.meetup.com/DevOps-Belfast/) # * [DevSecOps Belfast](https://www.meetup.com/devsecops-belfast/) # # ***Hint the Second: These are the _best_ ways to get a job in the field*** # ### Engaged local Government/ComVol support # # * [OpenData NI](https://www.opendatani.gov.uk/) # * [DetailData](http://data.nicva.org/) # * [Code4Good](https://wedonthaveasiteyet.lol) # * [DoF Open Data Advisory Board]() # # ### NI Data Science on the world stage # # * [Big Data Belfast Conference](http://www.bigdatabelfast.com/) # * [ODCamp](http://odcamp.co.uk/) # * [TIMON Hackathon](https://www.eventbrite.co.uk/e/timon-open-transport-hackathon-tickets-40149226417#) # # What is Open Data? # # > Open data and content can be freely used, modified, and shared by anyone for any purpose - [The Open Definition](https://opendefinition.org/) # > Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. - [Wikipedia](https://en.wikipedia.org/wiki/Open_data) # > Open data is data that’s available to everyone to access, use and share. Yep, even your nan.- [The Open Data Institute](https://theodi.org/article/what-is-open-data-and-why-should-we-care/) # ## Key Principles of Open Data # # * Open # * Accessible # * Available # * Understandable # * Traceable # ## What's the point? # # (Following slides stolen shamelessly but with attribution from [The ODI](https://theodi.org/article/what-is-open-data-and-why-should-we-care/), because they're awesome # # ![](img/odi_perm.png) # # ### Open data should be easy to access. Especially for your nan. # # > Open data is only useful if it’s shared in ways that people can actually understand. **It needs to be shared in a standardised format and easily traced back to where it came from.** # # ![](img/homer.webp) # ### Open data isn’t the same as big data, but big data can be open data too # # > When people talk about ‘big data’ they mean a lot of data. Obviously. But ‘big data’, like government statistics on health care, is often closed to the public (or expensive to access, like the terrifyingly-named TWITTER FIREHOSE). **Opening up big data lets people use it to spot trends, fill gaps and improve services.** # # ![](img/data.jpg) # ### Oh, and it’s also not the same as ‘shared data’ # # > If you’re worried about big companies being fed all your private details, that’s got nothing to do with open data. Groups sharing information with each other is different from opening it up for all to access. **Your private data should only be open if you choose to share it.** (But if you want to know who’s accessing or sharing your data, open data can help.) # # ![](img/rugrats.gif) # ### Open data is good for democracy # # >If citizens know about their governments they can hold leaders to account, make more informed decisions and demand better services. **Open data can also help governments stay on their toes and make better policies for society, the economy and the environment.** # # ![](img/vote.webp) # ### Open data can help fight crime # > It’s helped people in London to track [stolen bikes](https://www.nesta.org.uk/blog/check-that-bike-wins-crime-justice-open-data-challenge/) and police in [Vancouver](http://www.cbc.ca/bc/features/vancouver-property-crime-map/) to stay one step ahead of criminals. # # ![](img/superman.webp) # ### Open data is good for your health # # > With tools like [FoodTradeMenu](https://www.socialtech.org.uk/projects/food-trade-menu/) using it to help restaurants make sure they don’t serve you food you’re allergic to without realising. # # ![](img/sneeze.webp) # ### Open data can save lives # # > It helps groups to [coordinate aid delivery in humanitarian disasters](https://www.sciencedirect.com/science/article/pii/S0740624X13000737). # # ![](img/aid.jpg) # ### Open data helps you get around your city, and saves you money # # > Apps like [CityMapper](https://citymapper.com) use open data from groups like Transport for London to help you find the quickest and cheapest way to get from A to B. Even maps can be open, like OpenStreetMap, which powers map data for websites and humanitarian crises relief around the world. # # ![](img/dr.webp) # #
__No, NI isn't on CityMapper, woooooooo__
# ### Open data is actually really pretty # # > [Look](https://nats.aero/blog/2014/03/europe-24-air-traffic-data-visualisation/) what happened when NATS used UK radar and European flight plan data to visualise air traffic on a typical summer’s day in 2013. # # ![https://nats.aero/blog/2014/03/europe-24-air-traffic-data-visualisation/](img/nats.jpg) # ## Scales of Data Openness # # This is a complex area, if you care, google for ["5 star data"](http://5stardata.info/en/) # # TL;DR # ![](img/stars.png) # ### And how shit is NI? Cus we're shit at everything right? # # ![https://index.okfn.org/place/nir/](img/nidata.png) # # ### NOTBAD.JPG # # ![](img/notbad.png) # # What is Data Driven Journalism? # # > ‘Data journalism’ only differs from ‘words journalism’ in that we use a different kit. We all sniff out, report, and relate stories for a living. It’s like ‘photo journalism’; just swap the camera for a laptop. - Brian Boyer, Chicago Tribune # > [...] providing information and analysis to help inform us all about important issues of the day. - Aron Pilhofer, New York Times # > Using Open Data and Data Science principles to ask, analyse and answer complex or contentious questions or areas of inquiry using available evidence. - Bolster, Here # ## What do you what to know? # # This is usually the wrong question; # ## What do you want to know *more about*? # ![](img/google-niedu.png) # ## What relevant data is available? # ![](img/opendata-fp.png) # ![](img/opendata-places.png) # ![](img/opendata-pp.png) # ## So what do we *think* we have at the moment? # # * List of schools with at least council/constitutency locations and postcodes # * Enrolements per school for 2016/17 # * Demographic comparisons (R/FSM/SEN 0+ or 5) # * A 'churn' rate (newcomers/movers/immigration) # ## What else could we do with? # # * More years, ideally same format # * _What schools have fewer places per/k council pop?_ # * Correlations to deprivation # * Correlations to monoculture/political alignment # * Correlations to population *change* # * Correlations to achievement/mobility # * Correlations to claiment count etc. # I *think* we can get most of these... but first things first # # Extraction, Transformation and Loading # # * Get the data # * Clean the data # * Store the data # Gov Open Data is notoriously difficult to 'wander around' # ## Extraction # When in doubt, bring in `pandas` # ![](img/panda.png) # ### Easy mode: `read_csv` # In[3]: import pandas as pd url_from_odni = "https://www.opendatani.gov.uk/dataset/6058be29-b2e1-4253-bab8-8a018568560a/resource/b4fcfed1-2dc1-4f61-8968-573535522d53/download/school-level-post-primary-reference-data.csv" pd.read_csv(url_from_odni) # ### Hardmode: API's # In[4]: from ckanapi import RemoteCKAN # it's on pip def build_odni_connection(): """Be nice to OpenDataNI and tell them how old I am. (And that it's me)""" version_no = (pd.to_datetime('now') - pd.to_datetime('1988/05/17')).days/365 ua = f'@Bolster/{version_no:.2f} (+http://bolster.online/)' return RemoteCKAN('https://www.opendatani.gov.uk/', user_agent=ua) # In[5]: odni = build_odni_connection() odni.action.package_list() # In[6]: resource = 'school-census-post-primary' dfs = {} for dataset in odni.action.package_show(id=resource)['resources']: df = pd.read_csv(dataset['url'], index_col=False) dfs[dataset['name']] = df print(list(dfs.keys())) # In[7]: dataset # ### Those who ignore history... # There's a problem; We want to know about change over time. # # ![](img/doe_data_list.png) # # [Dept of Ed have this data 2009-18 on their website in Excel format](https://www.education-ni.gov.uk/articles/school-enrolments-school-level-data) # # OpenDataNI *do not* 😭 # ### We must deeeeel weeeth eet # # ![](img/fine.jpg) # In[8]: test_data = "https://www.education-ni.gov.uk/sites/default/files/publications/education/School%20level%20-%20post%20primary%20schools%20data%20supp%201718.xls" pd.read_excel(test_data).head() # ### Common Data Extraction Hellscapes # # * Inconsistent naming conventions # * Inconsistent format conventions # * Inconsistent schema conventions # * Inconsistent lexical conventions # * Inconsistent *policy changes* # * Shifting *definitions/classifications* # * Lack of meta-linkage during all of the above changes # ### Spot the difference... # # * `School%20level%20-%20post%20primary%20schools%20data%20supp%201718.xls` # * `School%20level%20-%20post%20primary%20schools%20data%201617%20supp_0.XLS` # * order of fields (`supp 1718` vs `201617 supp_0`) # * extra junk (`supp` vs `supp_0`) # * inconstent extensions (`xls` vs `XLS`) # ### Web Crawling to the rescue! # # ![](img/doe_data_list.png) # In[9]: from bs4 import BeautifulSoup import requests base_url="https://www.education-ni.gov.uk" listing_path = "/articles/school-enrolments-school-level-data" soup = BeautifulSoup(requests.get(base_url+listing_path).text, 'lxml') for link in soup.find_all('a'): if 'School enrolments - school level data 20' in ' '.join(map(str,link.contents)): print(link) # ![](img/yey_laura.webp) # ### Then for each of those, get the href associated with the year # In[11]: soup = BeautifulSoup(requests.get(base_url+listing_path).text, 'lxml') year_urls = {} for link in soup.find_all('a'): contents = ' '.join(map(str,link.contents)) if 'School enrolments - school level data 20' in contents: #year = link.get('href')[-6:-2] year = link.get_text()[-7:-3] year_urls[int(year)] = link.get('href') year_urls # ### Then walk *those* pages and grab the Excel files... # Using `urllib.request.urlretrieve` to store them on the local filesystem # In[13]: import urllib.request from pathlib import Path from collections import defaultdict year_files = defaultdict(list) dest = Path('./data/education-ni/') for year, path in year_urls.items(): dest.joinpath(str(year)).mkdir(parents=True, exist_ok=True) if path.startswith('http'): #new docs use real references soup = BeautifulSoup(requests.get(path).text, 'lxml') else: soup = BeautifulSoup(requests.get(base_url+path).text, 'lxml') for link in soup.find_all('a'): href=link.get('href',"") filename = href.split('/')[-1] contents = ' '.join(map(str,link.contents)) if 'xls' in href.split('.')[-1].lower(): # See this requirement right 👆 # Mix of xls, XLSX, xlsx and XLS if not dest.joinpath(str(year),filename).exists(): urllib.request.urlretrieve(href, dest.joinpath(str(year),filename)) year_files[year].append(filename) print(f"{year}:{len(year_files[year])} docs") # ## Transformation # # We've technically 'Extracted' the data we wanted from the source, # # Clean and twist the data until it's useful # # This is the most frustrating part of being a data scientist/engineer... # # ![](img/frustrated.webp) # ### Quick inspection of filename 'conventions' # In[14]: for year, filenames in year_files.items(): for file in filenames: print(f"{year}:/{file}") # ### Zero in on just post-primary data # In[15]: for year, filenames in year_files.items(): for file in filenames: if 'post' in file: print(f"{year}:/{file}") # ### Don't try and tackle it all at once # # Right, lets limit our scope to look at places available to get a better understanding of the data # In[16]: for year, filenames in year_files.items(): print(f"Year: {year}") for file in filenames: if file.lower().startswith('available'): print(file) # ![](img/available_2018.png) # ![](img/available_2015.png) # ### Try and identify subclasses of datasets that may be tackled automagically # # Ok, what about actually parsing the excel sheets? # # Will *they* be consistent? # In[17]: sheets = defaultdict(list) for year, files in year_files.items(): for file in files: if 'post' in file: xls = pd.ExcelFile(f'data/education-ni/{year}/{file}') sheets[year].extend(xls.sheet_names) dict(sheets) # In[18]: from collections import Counter all_sheet_names = Counter([_ for d in sheets.values() for _ in d]) all_sheet_names.most_common() # In[19]: df = pd.DataFrame.from_dict({ year: [sn in sheet_names for sn in all_sheet_names] for year, sheet_names in sheets.items() }, orient='index') df.columns=all_sheet_names.keys() df.T # ### Visualisation isn't just for output data... # Use visualisation tools to validate your transformation assumptions... # All. The. Time. # In[20]: import seaborn as sns f,ax = plt.subplots(figsize=graph_figsize) sns.heatmap(df.T, ax=ax) # ![](img/ahhfuck.webp) # ![](img/2000.png) # ## A few highlights from the hellscape # ### Best thing about standards is how many there are to choose from... # # ![](img/reference_year.png) # ### Sometimes understanding the data is more painful (and ugly) than expected # This was the only document that seemed to explain what 'supernumerary' actually means # ![](img/comic_sans.png) # It means 'extra'... # ### Using multiple tools simultanously can be problematic... # Excel on Mac helpfully puts down a lock-entry on open excel files which prevent it from being opened by other things. # # Pandas occasionally has unhelpful error messages in this regard # # ![](img/geethanks.png) # ### Sometimes you have to give up on stuff because it's too dangerous an area to walk in to. # # Special Educational Needs classification policy is >>100pg policy that appears to change every year. # # I am not an expert on education policy so I'm not sticking my foot in that one without being sure I understand the implications. # # It's your responsibility as a Data Science to be able to make that call, and know your limits. # ### "Machine Readable" # # ![](img/machinereadable.jpg) # ![](img/starman.png) # # # ![](img/footer.png) # In[31]: def join_n_strip_n_lower(l): return ' '.join(l).strip().lower() def strip_n_lower(s): return s.strip().lower() def unyearify(s): return int(s.replace('year ', '')) def parse_reference_table(xls): """ From an ExcelFile, clean up: * School Mgmt Type disaster * Inconsistent header depth * Multi-row header names * inconsistent headers (ref_key_map) * inconsistent col order * inconsistent caps/spacing (strip|lower) """ cols= [ 'de ref', 'school name', 'school type', 'address 1', 'postcode', 'urban_rural', 'school management type', 'district council', 'parliamentary constituency', 'town' ] categories = [ 'school type', 'urban_rural', 'school management type', 'district council', 'parliamentary constituency', 'town' ] ref_key_map={ 'denino':'de ref', 'urban/ rural': 'urban_rural', 'schoolname': 'school name' } reference_value_rename = { 'school management type':{ 'gmi':'integrated', 'controlled integrated':'integrated', 'roman catholic maintained':'rc maintained', 'grant maintained integrated':'integrated', 'voluntary - other managed':'voluntary', 'voluntary - rc managed':'voluntary', 'catholic maintained':'rc maintained' } } if 'reference data' in xls.sheet_names: df = pd.read_excel(xls, 'reference data', header=None) h_range = 2 if isinstance(df.ix[3,0], int) else 3 try: df.columns=df.ix[1:h_range].fillna('').apply(join_n_strip_n_lower, axis=0).values df.rename(columns=ref_key_map, inplace=True) df = df.drop(df.index[0:h_range+1]).reset_index(drop=True) df = df[cols] if hasattr(df['de ref'], 'str'): try: df = df[df['de ref'].str.isdigit().fillna(True)].dropna(how='all', axis=0) except KeyError: pass df['de ref'] = df['de ref'].astype(int) df.set_index('de ref', inplace=True) for c in df: df[c]=df[c].str.lower().str.strip() df.replace(reference_value_rename, inplace=True) for c in categories: df[c].fillna('NA', inplace=True) df[c] = df[c].astype('category') except TypeError as e: print(e) else: df=None return df def parse_enrolments_table(xls): """From an ExcelFile, clean up: * Inconsistent header depth * fucked up nans/nulls all over the place * inconsistent *footer* depth... * Set de ref as join index and drop pointless fields """ if 'enrolments' in xls.sheet_names: df = pd.read_excel(xls, 'enrolments', header=None, skip_footer=5) h_range = 2 if isinstance(df.ix[3, 0], int) else 3 try: df.columns = df.ix[3].fillna('').apply(strip_n_lower).values df = df.drop(df.index[0:h_range+1]).reset_index(drop=True) df.dropna(how='all', inplace=True, axis=0) if hasattr(df['de ref'], 'str'): try: df = df[df['de ref'].str.isdigit().fillna(True)].dropna(how='all', axis=0) except KeyError: pass df['de ref'] = df['de ref'].astype(int) df.drop('schoolname', axis=1, inplace=True) df.drop('total pupils', axis=1, inplace=True) df.set_index('de ref', inplace=True) df.rename(columns=unyearify, inplace=True) df = df.astype(float) except TypeError as e: print(e) else: df = None return df def parse_fsm_table(xls): """From an ExcelFile, clean up: * Inconsistent header depth * fucked up nans/nulls all over the place * inconsistent *footer* depth... * Set de ref as join index and drop pointless fields """ join_n_strip_n_lower = lambda l: ' '.join(l).strip().lower() strip_n_lower = lambda s: s.strip().lower() unyearify = lambda s: int(s.replace('year ','')) if 'free school meals' in xls.sheet_names: df = pd.read_excel(xls, 'free school meals', header=None, skip_footer=5) h_range = 2 if isinstance(df.ix[3,0], int) else 3 try: df.columns=df.ix[3].fillna('').apply(strip_n_lower).values df = df.drop(df.index[0:h_range+1]).reset_index(drop=True) df.dropna(how='all', inplace=True, axis=0) df['de ref'] = df['de ref'].astype(int) df.drop('schoolname',axis=1, inplace=True) df.drop('free school meals', axis=1, inplace=True) df.set_index('de ref', inplace=True) df.replace('#',pd.np.nan, inplace=True) # # = Undisclosed df.replace('*',2.0, inplace=True) # * == < 5 df.replace('!',1, inplace=True) # ! avoid identification, so it's prob one or two df=df.astype(float) except TypeError as e: print(e) else: df=None return df def parse_available_table(xls): """From an ExcelFile, clean up: * Inconsistent header depth * fucked up nans/nulls all over the place * inconsistent *footer* depth... * Set de ref as join index and drop pointless fields * Totally different schemas between years * Inconsistent metric naming * non numerical data flags (*/!) """ ref_key_map={ 'schoolname': 'school name', 'total unfilled places': 'available places', 'unfilled places': 'available places', 'total approved enrolment number': 'approved enrolments' } join_n_strip_n_lower = lambda l: ' '.join(l).strip().lower() if 'School level data' in xls.sheet_names: df = pd.read_excel(xls, 'School level data', header=None) h_range = 2 if isinstance(df.ix[3,0], int) else 3 elif 'unfilled places' in xls.sheet_names: df = pd.read_excel(xls, 'unfilled places', header=None) h_range = 2 if isinstance(df.ix[3,0], int) else 3 else: df=None if df is not None: try: df.columns=df.ix[1:h_range].fillna('').apply(join_n_strip_n_lower, axis=0).values df.rename(columns=ref_key_map, inplace=True) df = df.drop(df.index[0:h_range+1]).reset_index(drop=True) df=df.applymap(lambda x: np.nan if isinstance(x, str) and x.isspace() else x) df.dropna(how='all', axis=1, inplace=True) df.dropna(how='any', axis=0, inplace=True) if df.shape[1] == 6: # recent doesn't have fecking headers cols = list(df.columns) cols[0] = 'de ref' cols[1] = 'school name' df.columns=cols df.drop('school name', axis=1, inplace=True) df['de ref'] = df['de ref'].astype(int) df.set_index('de ref', inplace=True) df.replace('*',2.0, inplace=True) # * == < 5 df.replace('!',1, inplace=True) # ! avoid identification, so it's prob one or two df.dropna(how='all', inplace=True, axis=1) df.astype(int, inplace=True) except TypeError as e: print(e) return df # In[32]: re_dfs={} av_dfs={} en_dfs={} fsm_dfs={} for year, files in year_files.items(): for file in files: if 'post' in file: xls = pd.ExcelFile(f'data/education-ni/{year}/{file}') df = parse_reference_table(xls) if df is not None: print(f'Got reference data for {year}') re_dfs[year]=df df = parse_enrolments_table(xls) if df is not None: print(f'Got enrolment data for {year}') en_dfs[year]=df df = parse_available_table(xls) if df is not None: print(f'Got available data for {year}') av_dfs[year]=df df = parse_fsm_table(xls) if df is not None: print(f'Got fsm data for {year}') fsm_dfs[year]=df reference = pd.Panel(re_dfs).sort_index() available = pd.Panel(av_dfs).sort_index() enrolment = pd.Panel(en_dfs).sort_index() fsm = pd.Panel(fsm_dfs).sort_index() # # Bring it all together # Panels are 3D Dataframes, i.e. [Year x School x Metric] # # * Reference: School name, address, denomination, type, etc. # * Available: Allocated places information # * Enrolment: Actual student numbers # * FSM: Free School Meals: Laziest possible deprivation measure, probably doesn't need to be a panel # # Show me the data # # ## Line graphs & boxplots & pie charts, Oh My # # ![](img/ohmy.gif) # In[33]: enrolment.sum() # In[153]: enrolment.sum().sum().iplot() # In[154]: enrolment.sum().iplot(kind='bar') # In[155]: enrolment.sum().T.iplot(kind='bar',barmode='stack') # In[156]: # What is wrong with this "sum" table? enrolment.sum(axis=2).head() # ### Skippable aside: Danger Zone # ![](img/danger.gif) # # __When is a zero not nothing?__ # In[38]: [10]+[np.nan]*10 # In[39]: sum([10]+[np.nan]*10) # In[40]: pd.Series([10]+[np.nan]*10).sum() # In[41]: pd.Series([10]+[np.nan]*10).mean() # In[42]: pd.Series([np.nan]*10).sum() # In[43]: pd.Series([10]+[0.0]*10).mean() # In[44]: # These 0's are actually sum([np.nan]) # The school was closed in 2014 enrolment.sum(axis=2).head() # In[45]: reference[2013].loc[1210266] # In[46]: reference[2014].loc[1210266] # In[47]: # It *should* look like this enrolment.sum(axis=2).head().replace(0.0, np.nan) # ## AUDIENCE PARTICIPATION # # Which of the below is the correct answer to: # "What is the average number of pupils per school in Northern Ireland from 2009 to 2017?" # In[48]: #A Sum of years, swap 0 for nan, mean over schools, mean over classes enrolment.sum(axis=1).replace(0.0, np.nan).mean(axis=1).mean() # In[49]: #B Sum over classes, mean over schools, swap 0 for nan, mean over years enrolment.sum(axis=2).mean().replace(0.0,np.nan).mean() # In[50]: #C Sum over classes, swap 0.0 for nan, mean over schools, mean over years enrolment.sum(axis=2).replace(0.0, np.nan).mean().mean() # ## C: Sum over classes, swap 0.0 for nan, mean over schools, mean over years! # # But why not B, it's doing the same thing right? # In[51]: enrolment.sum(axis=2).head() # In[52]: enrolment.sum(axis=2).head().mean().to_frame().T # In[53]: enrolment.sum(axis=2).replace(0.0,np.nan).head() # In[54]: enrolment.sum(axis=2).replace(0.0,np.nan).mean().to_frame().T # It's your responsibility as a Data Scientist to understand what questions your algorithms are *really* answering, and make sure that the people who may be answering those question are *crystal clear* about what your data means. # # Data Science is as much about comprehension and communication as it is about statistics and programming. # ### Return of regular schedule - What happened to average cohort sizes in 2015? # In[158]: enrolment.mean(axis=2).mean().iplot(title='Annual Average Cohort (school,class,year)') # In[159]: enrolment.mean(axis=2).T.iplot(title='Annual Average Cohort (school,class,year) per school', legend=False) # In[59]: annual_school_enrolment = enrolment.sum(axis=2).T annual_school_enrolment.head() # In[160]: name_lookup = reference.minor_xs('school name').T.describe().ix['top'] annual_school_enrolment.rename(columns=name_lookup).iplot(legend=False) # In[161]: annual_school_enrolment.index.min() # In[66]: first = annual_school_enrolment.replace(0.0,np.nan).\ apply(pd.Series.first_valid_index).\ replace(annual_school_enrolment.index.min(),np.nan) last = annual_school_enrolment.replace(0.0,np.nan).\ apply(pd.Series.last_valid_index).\ replace(annual_school_enrolment.index.max(),np.nan) first.dropna() # In[67]: close_vec=pd.DataFrame({ 'opened':first.groupby(first).size(), 'closed':last.groupby(last).size()} ) close_vec # In[162]: close_vec.cumsum().fillna(method='ffill').iplot(title="Cumulative number of schools opened\closed since 2009") # ### Skippable Aside: Slicing Panels # In[69]: up_to_ks3=enrolment.minor_axis<=10 up_to_ks3 # In[70]: enrolment[:,:,up_to_ks3].sum() # In[71]: cowtown=reference.major_axis[(reference.minor_xs('town')=='ballymoney').any(axis=1)] cowtown # In[72]: enrolment.sum(axis=2).loc[cowtown].sum() # ## Mapping # * Maps are *hard* in NI # * No Open Data mapping between Postcode -> Boundary # * Also doesn't help that ESRI's API's have been broken recently.... # # However... we do get towns and administrative boundaries... from something called the "Gazetter" # In[73]: if not Path('data/gztr').exists(): gztr_json=requests.get("http://osni-spatial-ni.opendata.arcgis.com/datasets/117e5c3d0f0b41208d1caec1ddbd1330_1.geojson").json() gztr = gp.GeoDataFrame.from_features(gztr_json).set_index('OBJECTID') gztr['PLACENAME'] = gztr['PLACENAME'].str.lower() gztr.to_file('data/gztr') gztr = gp.read_file('data/gztr') gztr.head() # In[74]: gztr.plot() # ### And we have a base map... # In[75]: base_map_zip="http://osni-spatial-ni.opendata.arcgis.com/datasets/d9dfdaf77847401e81efc9471dcd09e1_0.zip" base_map_shp = "OSNI_Open_Data_Largescale_Boundaries__NI_Outline.shp" if not Path('data/'+base_map_shp).exists(): urllib.request.urlretrieve(base_map_zip, 'data/_tmp.zip') with zipfile.ZipFile('data/_tmp.zip') as z: z.extractall('data/') Path('data/_tmp.zip').unlink() base_map=gp.GeoDataFrame.from_file('data/'+base_map_shp) # In[77]: f,ax = plt.subplots(figsize=graph_figsize) base_map.plot(ax=ax, color='gray') gztr.plot(ax=ax, color='b') # In[78]: from scipy.stats import mode school_towns = reference.minor_xs('town').mode(axis=1)[0].value_counts() school_towns # In[79]: school_towns_gp = gztr[gztr['PLACENAME'].isin(school_towns.keys())].set_index('PLACENAME') school_towns_gp.head() # In[80]: school_towns_gp['school_count'] = school_towns.astype(float) school_towns_gp.head() # In[81]: f,ax = plt.subplots(figsize=graph_figsize) base_map.plot(ax=ax, color='gray') school_towns_gp.plot(ax=ax, column='school_count', legend=True) # ### Points aren't really very good for seeing trends though... # Need some regions # In[82]: cons_map_zip="http://osni-spatial-ni.opendata.arcgis.com/datasets/563dc2ec3d9943428e3fe68966d40deb_3.zip" cons_map_shp = "OSNI_Open_Data_Largescale_Boundaries__Parliamentary_Constituencies_2008.shp" if not Path('data/'+cons_map_shp).exists(): urllib.request.urlretrieve(cons_map_zip, 'data/_tmp.zip') with zipfile.ZipFile('data/_tmp.zip') as z: z.extractall('data/') Path('data/_tmp.zip').unlink() cons_map=gp.GeoDataFrame.from_file('data/'+cons_map_shp) cons_map.rename(columns={'PC_NAME':'constituency'}, inplace=True) cons_map.drop(['OBJECTID','PC_ID'], axis=1, inplace=True) cons_map['constituency'] = cons_map['constituency'].str.lower().str.strip() cons_map.set_index('constituency', inplace=True) cons_map.head() # In[83]: f,ax = plt.subplots(figsize=graph_figsize) cons_map.plot(ax=ax) cons_map.apply(lambda x: ax.annotate(x.name, xy=x.geometry.centroid.coords[0], ha='center'), axis=1); # In[84]: reference.minor_xs('parliamentary constituency').head() # In[85]: cons = set(reference.minor_xs('parliamentary constituency').values.ravel('k')) en_df= pd.DataFrame.from_dict({ con: enrolment[:,reference.major_axis[(reference.minor_xs('parliamentary constituency')==con).any(axis=1)],:].sum().sum() for con in cons }).T.drop(np.nan) av_df= pd.DataFrame.from_dict({ con: available[:,reference.major_axis[(reference.minor_xs('parliamentary constituency')==con).any(axis=1)],'available places'].sum() for con in cons }).T.drop(np.nan) cons_stats = gp.GeoDataFrame(geometry=cons_map.geometry) for c in av_df: cons_stats[f"av_{c}"] = av_df[c] cons_stats[f"av_{c}_rat"] = (av_df[c]/en_df[c]) for c in en_df: cons_stats[f"en_{c}"] = en_df[c] cons_stats[f"en_{c}_pk"] = en_df[c]/cons_map["Area_sqkm"] # In[86]: cons_stats.head() # In[87]: f,ax = plt.subplots(figsize=graph_figsize) ax.set_title("Per Constituency Total Enrolments [2017]") cons_stats.plot(column='en_2017',ax=ax, legend=True) cons_stats.apply(lambda x: ax.annotate(x.name, xy=x.geometry.centroid.coords[0], ha='center', color='#eeeeee', fontsize=50*x.geometry.area ), axis=1); # In[88]: def annotate_cons(x, ax=ax): ax.annotate(x.name, xy=x.geometry.centroid.coords[0], ha='center', color='#eeeeee', fontsize=50*x.geometry.area ) # In[89]: f,ax = plt.subplots(figsize=graph_figsize) ax.set_title("Per Constituency Total Enrolments per $km^2$[2017]") cons_stats.plot(column='en_2017_pk',ax=ax, legend=True) cons_stats.apply(annotate_cons, ax=ax, axis=1); # In[90]: cons_stats["en_2017_pk"].sort_values(ascending=False) # ## But what does it all mean # # Gather data from other sources to put things in context; # # How many people are in each region? # # ![](img/pop_cons.png) # In[102]: if not Path('data/cons_pop.csv').exists(): for dataset in odni.action.package_show(id='population-estimates-for-northern-ireland')['resources']: if dataset['name'] == "Parliamentary Constituencies by single year of age and gender (mid-2001 to mid-2018)": cons_pop = pd.read_csv(dataset['url']) cons_pop.to_csv("data/cons_pop.csv", index=False) cons_pop.head() cons_pop = pd.read_csv('data/cons_pop.csv') cons_pop['Mid_Year_Ending'] = cons_pop.Mid_Year_Ending.astype(int) cons_pop['Population_Estimate'] = cons_pop.Population_Estimate.astype(float) cons_pop['Age'] = cons_pop.Age.astype(int) cons_pop.rename(columns={'Geo_Name':'constituency'}, inplace=True) cons_pop['constituency']= cons_pop.constituency.str.strip().str.lower() cons_pop.head() # In[103]: cons_pop[(cons_pop.Gender == 'All persons') & (cons_pop.Mid_Year_Ending == 2018)].head() # ### What's the distribution of ages in NI? # In[104]: f,ax = plt.subplots(figsize=graph_figsize) ax.set_title('Population Distribution (2018)') cons_pop[(cons_pop.Gender == 'All persons') & (cons_pop.Mid_Year_Ending == 2018)]\ .groupby('Age')['Population_Estimate'].sum().plot.line(ax=ax) # ### Constituency breakdown # In[105]: f,ax = plt.subplots(figsize=graph_figsize) ax.set_title('Constituency Population Distribution (2018)') cons_pop[(cons_pop.Gender == 'All persons') & (cons_pop.Mid_Year_Ending == 2018)]\ .groupby(['Age','constituency'])['Population_Estimate'].sum().unstack().plot.bar(stacked=True, ax=ax) # ### That's not a great way of summarising age distributions.... # In[106]: def idxquantile(s, q=0.5, *args, **kwargs): qv = s.quantile(q, *args, **kwargs) return (s.sort_values()[::-1] <= qv).idxmax() f,ax = plt.subplots(figsize=graph_figsize) ax.set_title('Cumulative Population Distribution (2018)') pop_cumsum = cons_pop[(cons_pop.Gender == 'All persons') & (cons_pop.Mid_Year_Ending == 2018)]\ .groupby('Age')['Population_Estimate'].sum().cumsum() pop_cumsum.plot(ax=ax) age_quartiles = [idxquantile(pop_cumsum, q) for q in [0.25,0.5,0.75]] ax.vlines(age_quartiles, ymin=pop_cumsum.min(), ymax = pop_cumsum[age_quartiles]) age_quartiles # In[107]: f,ax = plt.subplots(figsize=graph_figsize) ax.set_title('Cumulative Population Distribution (2009-2018)') yrs=sorted(cons_pop.Mid_Year_Ending.unique()) colors = plt.cm.jet(np.linspace(0,1,len(yrs))) for yr,c in zip(yrs,colors): pop_cumsum = cons_pop[(cons_pop.Gender == 'All persons') \ & (cons_pop.Mid_Year_Ending == yr)]\ .groupby('Age')['Population_Estimate'].sum().cumsum() pop_cumsum.plot(ax=ax, label=yr, c=c) age_quartiles = [idxquantile(pop_cumsum, q) for q in [0.25,0.5,0.75]] ax.vlines(age_quartiles, ymin=pop_cumsum.min(), ymax = pop_cumsum[age_quartiles],color=c,alpha=0.2) ax.legend() # In[108]: def constituency_age_profile(constituency=None): # I have a hunch we might was to do this # with lists.... if constituency is None: constituency = cons_pop.constituency.unique() label = f"Population profile for NI" elif not isinstance(constituency, list): label = f"Population profile for {constituency}" constituency = [constituency] else: label = f"Population profile for {constituency}" f,ax = plt.subplots(figsize=graph_figsize) yrs=sorted(cons_pop.Mid_Year_Ending.unique()) colors = plt.cm.jet(np.linspace(0,1,len(yrs))) for yr,c in zip(yrs,colors): pop_cumsum = cons_pop[(cons_pop.Gender == 'All persons') \ & (cons_pop.constituency.isin(constituency))\ & (cons_pop.Mid_Year_Ending == yr)]\ .groupby('Age')['Population_Estimate'].sum().cumsum() pop_cumsum.plot(ax=ax, label=yr, c=c) age_quartiles = [idxquantile(pop_cumsum, q) for q in [0.25,0.5,0.75]] ax.vlines(age_quartiles, ymin=pop_cumsum.min(), ymax = pop_cumsum[age_quartiles],color=c,alpha=0.2) for i,age in enumerate(age_quartiles): ax.annotate(f"<={age}", xy=(age, i)) ax.legend() ax.set_title(label) # ### South Belfast: Kinky Constituency # In[109]: constituency_age_profile('belfast south') # ### Newry and Armagh: Middle Age Mourneing # In[110]: constituency_age_profile('newry and armagh') # ### Foyled Again: Cross-generational Brain Drain # In[111]: constituency_age_profile('foyle') # ### North Antrim: 40 is the new 30 # In[112]: constituency_age_profile('north antrim') # ### Growing Up in the (Lagan) Valley # In[113]: constituency_age_profile('lagan valley') # ### Fast times at Belfast Unis/Colleges # In[163]: df = cons_pop[(cons_pop.Gender == 'All persons') \ & (cons_pop.Mid_Year_Ending == 2018)]\ .groupby(['Age','constituency'])['Population_Estimate'].sum().unstack() df.iplot() # In[115]: f,ax = plt.subplots(figsize=graph_figsize) sns.heatmap(df.T, ax=ax) # In[164]: def colnorm(df): return (df-df.min())/(df.max()-df.min()) colnorm(df).iplot() # In[117]: f,ax = plt.subplots(figsize=graph_figsize) sns.heatmap(colnorm(df).T, ax=ax) # In[118]: def year_age_profile(yr=2016): constituencies = list(cons_pop.groupby("constituency")['Population_Estimate'].sum().sort_values(ascending=True).keys()) label = f"Population profile for NI in {yr}" f,ax = plt.subplots(figsize=graph_figsize) colors = plt.cm.jet(np.linspace(0,1,len(constituencies))) for constituency,c in zip(constituencies,colors): pop_cumsum = cons_pop[(cons_pop.Gender == 'All persons') \ & (cons_pop.constituency==constituency)\ & (cons_pop.Mid_Year_Ending == yr)]\ .groupby('Age')['Population_Estimate'].sum().cumsum() pop_cumsum.plot(ax=ax, label=constituency, c=c) age_quartiles = [idxquantile(pop_cumsum, q) for q in [0.25,0.5,0.75]] ax.vlines(age_quartiles, ymin=pop_cumsum.min(), ymax = pop_cumsum[age_quartiles],color=c,alpha=0.2) ax.legend() ax.set_title(label) # In[119]: year_age_profile() # In[120]: cons_pop_qilted=pd.DataFrame.from_dict( { qtile:cons_pop[(cons_pop.Gender == 'All persons') \ & (cons_pop.Mid_Year_Ending == 2018) \ & (lotile # # # Conclusions # ## Hypotheses / conjectures # * East Antrim Constituency is literally 'little britain' # * None wants to go to school in West Belfast for some reason # * Everyone wants to go to school in North Down # * The 'places' scandal is either over blown, or more related to 'preference' # * There is a demographic wave of <11's that is gonna hit education system in 2 years. Good luck. # * South Belfast steals all-the-18-year-olds, keeps em for a decade or so and then the go home # * Foyle has been experiencing a sustained brain drain of < 35's that is now impacting services # ## Final Comments # * Data is hard # * Open Data is hard without a budget # * 80/20 rule is more like 90/10 # # Idiots Guide to (Open) Data Science # # Andrew Bolster # * [bolster.online](https://bolster.online) # * Tweets [@bolster](https://twitter.bolster.online) # * GitHubs at [andrewbolster](https://github.com/andrewbolster) # * Machine Learning Team Lead at [WhiteHatSec](https://www.whitehatsec.com) # * **We're hiring** (Placements also available) # * Plays at [Farset Labs](https://www.farsetlabs.org.uk) # * **[THIS NOTEBOOK IS AT present.bolster.online](http://present.bolster.online)** # * [Also available in source at presentgh.bolster.online](http://presentgh.bolster.online) # ### NOTES FOR LATER # # #### Fucked Notes: # ##### DoE # * inconsistent file names # * inconsistent extensions # * inconsistent cases in extentions 🤦‍ # * changing schema almost every year # * recent years don't have headers for de ref/school name # * multi row headers # * Changing "All People" to "All persons" silently... # # # #### Housing Stock Data # https://www.opendatani.gov.uk/dataset/ni-housing-stock # # # #### SEN scaling is complex AF # https://www.education-ni.gov.uk/sites/default/files/publications/de/the-code-of-practice.pdf # # ~ hundred page doc to work out that 5 is 'confirmed and fully stated special educational needs. Everthing less is 'suspected and in need of advice and support' # # "Controlled and Maintained schools have [a budget] for this" # "Grant maintained integrated schools also receive this through their annual budget" # In[ ]: