Notebook

A Quick Look at Election Betting Data¶

For the hundred days or so leading up to the election, I was scraping oddschecker listed betting odds for most of the UK parliamentary constituencies (I didn't check that I was scraping them all; this was just a side-side-project...).

I didn't look at the data at all in the run-up to the election (my original plan was to look at timeseries within each constituency to try to detect sudden changes in odds that might indicate some sort of major shift in sentiment in each constituency), but with things all settled now, I thought I'd look to see what tales - if any - the betting data might tell, at least at a high level. For example, what did the betting odds have to say about the likely number of seats taken by each party...?

This notebook was developed as part of an exploration into possible forms a student produced notebook could take as part of an assessment process for a course on data management and analysis using a dataset selected by a student. Intended student worktime for the assessment: <10 hours. Comments appreciated...

In [320]:

#I'm going to use the pandas library to analyse the data
import pandas as pd

Data Acquisition¶

The data can be pulled directly from the scraper - https://morph.io/psychemedia/electionodds. Using the API, we can write a SQLite query to grab the odds for a particular day:

select * from 'Constituency2015GE' constituency where strftime('%d-%m-%Y',time)="06-05-2015"

The data can be pulled directly into a pandas dataframe from the morph.io scraper API:

ASSESSMENT - IDENTIFY AN APPROPRIATE DATASET FOR USE IN A DATA INVESTIGATION ASSESSMENT - DEMONSTRATE HOW TO USE AN SQL STATEMENT TO RETRIEVE A DATASET NOTE: WE COULD USE pandasql TO RUN A SQL QUERY ON A DATAFRAME PUSHED INTO A SQLITE DATABASE

In [321]:

#df=pd.read_csv('https://api.morph.io/psychemedia/electionodds/data.csv?key='+SECRETKEY+'&query=select%20*%20from%20%27Constituency2015GE%27%20constituency%20where%20strftime(%27%25d-%25m-%25Y%27%2Ctime)%3D%2206-05-2015%22)

In [322]:

#Or you can download the electionodds data and pop it into a file...
#Don't believe the filename - the data is data collected at some point on May 5th
df=pd.read_csv('electionodds_thurs.csv')

ASSESSMENT - DEMONSTRATE HOW TO LOAD IN FROM, OR SAVE DATA TO, A DATA FILE IN A RECOGNISED FORMAT

Data Review¶

Let's just get a quick view over the data to familiarise ourselves with what it contains.

In [323]:

#Preview the data
df.head()

Out[323]:

	time	bookie	party	odds	oddsraw	constituency
0	2015-05-05T09:54:19+00:00	LD	green	100.00	100	aberavon
1	2015-05-05T09:54:19+00:00	FB	pc	50.00	50	aberavon
2	2015-05-05T09:54:19+00:00	LD	pc	50.00	50	aberavon
3	2015-05-05T09:54:19+00:00	WH	pc	50.00	50	aberavon
4	2015-05-05T09:54:19+00:00	FB	labour	0.01	1/100	aberavon

The odds column is simply a decimalised version of the oddsraw value. Note that this is not strictly the decimal odds, which would be 1 greater.

In [324]:

#What bookies did we collect data for?
df['bookie'].unique()

Out[324]:

array(['LD', 'FB', 'WH', 'B3'], dtype=object)

In [325]:

#How many constituencies did we grab data for?
len(df['constituency'].unique())

Out[325]:

In [326]:

#What is the range of odds offered?
(df['odds'].min(), df['odds'].max())

Out[326]:

(0.001, 9999.0)

In [327]:

#What parties did we collect data for and in what numbers?
#Note that we are likely to count the same party several times in each constituency,
# once for each bookmaker offering odds on that party in that constituency
df['party'].value_counts().head(15)

Out[327]:

labour                          2103
liberal democrats               2014
ukip                            1977
conservatives                   1945
any other party or candidate     600
greens                           580
green                            273
snp                              224
conservative                     159
tusc                             130
green party                      105
pc                                91
liberal democrat                  85
alliance                          57
sinn fein                         57
dtype: int64

The count of parties appearances in the dataset is not necessarily very useful, becuase the same party may count more than several times in the same constituency given the odds from different bookmakers. However, the count does clearly show us that there are multiple possible representations of what are presumably the same party name.

ASSESSMENT - DEMONSTRATE TWO OR MORE TECHNIQUES THAT PROVIDE AN OVERVIEW OF A NEW DATASET

Cleaning the Data¶

Note there are several opportunities in the party column at least for cleaning the data - for example, green, greens and green party are likely the same party, as are conservative and conservatives. A quick and dirty cleaning approach would be to right strip "s", replace occurrences of "party" at the end of the string, and then strip() just to clear away any whitespace.

In [328]:

df['party_clean']=df['party'].str.rstrip('s').str.replace(r'party$','').str.strip()
df['party_clean'].value_counts()

Out[328]:

conservative                    2104
labour                          2103
liberal democrat                2099
ukip                            1977
green                            958
any other party or candidate     600
snp                              224
tusc                             130
pc                                91
alliance                          57
sinn fein                         57
sdlp                              56
dup                               53
uup                               46
any other                         45
...
lorraine morgan-brinkhurst    1
alfred okam                   1
chaka artwell                 1
john neville hobb             1
any other independant         1
les tallon-morri              1
the whig                      1
christopher tompson           1
robin lambert                 1
residents for uttlesford      1
james kirkcaldy               1
europeans                     1
roy ivinson                   1
christopher gray              1
criag pond                    1
Length: 309, dtype: int64

In [329]:

#Inspect the full range of parties, if required, perhaps as basis for further cleaning
#df['party_clean'].unique()
#A more advanced approach might be to run the names through a clustering algorithm,
#or partial string matcher, to see if there aare any near collisions that should be combined

ASSESSMENT - DEMONSTRATE TWO OR MORE TECHNIQUES THAT CAN BE APPLIED TO CLEAN A DATASET ASSESSMENT - DEMONSTRATE HOW TO SORT A DATASET

If we wanted a strict decimal odds column, we could simply add 1 to the odds column:

In [330]:

df['decimal_odds'] = df['odds']+1
df.head()

Out[330]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
0	2015-05-05T09:54:19+00:00	LD	green	100.00	100	aberavon	green	101.00
1	2015-05-05T09:54:19+00:00	FB	pc	50.00	50	aberavon	pc	51.00
2	2015-05-05T09:54:19+00:00	LD	pc	50.00	50	aberavon	pc	51.00
3	2015-05-05T09:54:19+00:00	WH	pc	50.00	50	aberavon	pc	51.00
4	2015-05-05T09:54:19+00:00	FB	labour	0.01	1/100	aberavon	labour	1.01

ASSESSMENT - ADD A NEW COLUMN TO A DATASET ASSESSMENT - GENERATE A NEW COLUMN FROM A PRE-EXISTING COLUMN

Filtering / Reducing the Data¶

To start with, let's focus on the odds offered by a single bookmaker. To choose which bookie, let's see how many constituencies each of them have prices for:

In [331]:

for bookie in df['bookie'].unique():
    print('{}: {}'.format(bookie,len(df[df['bookie']==bookie]['constituency'].unique())))

LD: 627
FB: 648
WH: 648
B3: 239

ASSESSMENT - DEMO

So a good candidate would be WH (William Hill) or LD (Ladbrokes). Let's go with the former...

In [332]:

#Grab the data for a particular bookie into a separate dataframe
df_wh=df[df['bookie']=='WH']

ASSESSMENT - DEMONSTRATE A WAY OF SUBSETTING A DATASET BASED ON ONE OR MORE ROW BASED CRITERIA

One thing we might want to do over the full dataset is see what the odds were for a partcular constituency offered by a particular bookmarker. We can write a simple convenience function to help us do that.

In [354]:

#Write a convenience function to look up odds by constituency
def oddsForConstituency(df,bookie,constituency):
    ''' Function to find rows associated with a particular bookie in a particular constituency '''
    filterView= df[(df['bookie'].str.upper()==bookie.upper()) & 
                   (df['constituency'].str.lower()==constituency.lower())] 
    return filterView

oddsForConstituency(df,'WH','berwickshire-roxburgh-and-selkirk')

Out[354]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
919	2015-05-05T09:56:50+00:00	WH	snp	2.25	9/4	berwickshire-roxburgh-and-selkirk	snp	3.25
923	2015-05-05T09:56:50+00:00	WH	labour	150.00	150	berwickshire-roxburgh-and-selkirk	labour	151.00
927	2015-05-05T09:56:50+00:00	WH	liberal democrats	2.00	2	berwickshire-roxburgh-and-selkirk	liberal democrat	3.00
935	2015-05-05T09:56:50+00:00	WH	conservatives	1.20	6/5	berwickshire-roxburgh-and-selkirk	conservative	2.20

ASSESSMENT - DEFINE AND APPLY A SIMPLE PYTHON FUNCTION THAT ACCEPTS ONE OR MORE PARAMETERS AND RETURNS ONE OR MORE VALUES

Reshaping the Data¶

The data as it stands is in a relatively tidy (Third Normal Form) long format. Each row contains a single observation that associates the odds for a single party with a particular bookmaker in each constituency.

If we wanted to compare the odds for particular parties within each constituency, it might be more convenient to put the data into a wide format, with a separate column for the odds offered for each party, indexed by cosntituency:

In [333]:

dfp=df_wh.pivot('constituency','party_clean','odds')
dfp.head()

Out[333]:

party_clean	al murray - the pub landlord	alliance	bez	bnp	claire wright	conservative	dup	green	ian steven	john bercow	...	respect	robin scott	sdlp	sinn fein	snp	stephen picton	sylvia hermon	tuv	ukip	uup
constituency
aberavon	NaN	NaN	NaN	NaN	NaN	100.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16	NaN
aberconwy	NaN	NaN	NaN	NaN	NaN	0.4	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	80	NaN
aberdeen-north	NaN	NaN	NaN	NaN	NaN	150.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	0.100000	NaN	NaN	NaN	150	NaN
aberdeen-south	NaN	NaN	NaN	NaN	NaN	50.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	0.142857	NaN	NaN	NaN	250	NaN
aberdeenshire-west-and-kincardine	NaN	NaN	NaN	NaN	NaN	5.5	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	0.142857	NaN	NaN	NaN	NaN	NaN

5 rows × 30 columns

In [334]:

#Preview all the columns
dfp.columns

Out[334]:

Index([u'al murray - the pub landlord', u'alliance', u'bez', u'bnp', u'claire wright', u'conservative', u'dup', u'green', u'ian steven', u'john bercow', u'john blackie', u'kerry smith', u'labour', u'lib dem', u'liberal democrat', u'national health action', u'ni conservative', u'pc', u'pirate', u'plaid cymru', u'respect', u'robin scott', u'sdlp', u'sinn fein', u'snp', u'stephen picton', u'sylvia hermon', u'tuv', u'ukip', u'uup'], dtype='object')

ASSESSMENT - DEMONSTRATE HOW TO RESHAPE A DATASET, EG USING PIVOT, MELT, STACK OR UNSTACK OPERATORS

If we wanted to limit the dataset to just major parties - that is, parties contesting a large number of seats - we could reduce the dataset as follows:

In [335]:

party_subset=['green','labour','liberal democrat','conservative','snp','ukip']

In [336]:

#Limit rows in long dataset to selected parties
example=df_wh[df_wh['party_clean'].isin(party_subset)]
example['party_clean'].unique()

Out[336]:

array(['labour', 'liberal democrat', 'ukip', 'conservative', 'snp', 'green'], dtype=object)

In [337]:

#Limit columns in wide dataset to selected parties
example=dfp[party_subset]
example.columns

Out[337]:

Index([u'green', u'labour', u'liberal democrat', u'conservative', u'snp', u'ukip'], dtype='object')

ASSESSMENT - DEMONSTRATE A WAY OF SUBSETTING A DATASET BASED ON ONE OR MORE COLUMN BASED CRITERIA

However, there are significant risks in taking this sort of approach. For example, where an individual is standing in a particular consituency, perhaps under their own name, who has a good chance of winning, their candidacy would not be captured in the reduced dataset.

Perhaps a better dataset would be one that includes all major parties and any candidates who appear to have a reasonable chance of winning (say, 10 to 1 or better).

ASSESSMENT - CRITIQUE THE APPROPRIATENESS OF A PARTICULAR QUESTION ASKED OF THE DATA IN A PARTICULAR WAY

Reducing the Dataset to Scottish Constituencies¶

One reduced dataset we might be interested in working with are the Scottish constituencies. We can use the fact that the SNP had a candidate in the seat as a proxy for which constituencies should be included in this set.

One way of identifying those seats is to filter the wide data set (which uses constituencies as the index values) to exclude rows in the SNP column with a null (NaN) value:

In [338]:

dfp['snp'].dropna().index.values

Out[338]:

array(['aberdeen-north', 'aberdeen-south',
       'aberdeenshire-west-and-kincardine', 'airdrie-and-shotts', 'angus',
       'argyll-and-bute', 'ayr-carrick-and-cumnock', 'ayrshire-central',
       'ayrshire-north-and-arran', 'banff-and-buchan',
       'berwickshire-roxburgh-and-selkirk',
       'caithness-sutherland-and-easter-ross',
       'coatbridge-chryston-and-bellshill',
       'cumbernauld-kilsyth-and-kirkintill', 'dumfries-and-galloway',
       'dumfriesshire-clydesdale-and-tweeddale', 'dunbartonshire-east',
       'dunbartonshire-west', 'dundee-east', 'dundee-west',
       'dunfermline-and-west-fife', 'east-kilbride-strathaven-and-lesma',
       'east-lothian', 'edinburgh-east', 'edinburgh-north-and-leith',
       'edinburgh-south', 'edinburgh-south-west', 'edinburgh-west',
       'falkirk', 'fife-north-east', 'glasgow-central', 'glasgow-east',
       'glasgow-north', 'glasgow-north-east', 'glasgow-north-west',
       'glasgow-south', 'glasgow-south-west', 'glenrothes', 'gordon',
       'inverclyde', 'inverness-nairn-badenoch-and-strathspey',
       'kilmarnock-and-loudoun', 'kirkcaldy-and-cowdenbeath',
       'lanark-and-hamilton-east', 'linlithgow-and-falkirk-ea',
       'livingston', 'midlothian', 'moray', 'motherwell-and-wishaw',
       'na-h-eileanan-an-iar', 'ochill-and-sth-perthshire',
       'orkney-and-shetland', 'paisley-and-renf-north',
       'paisley-and-renf-south', 'perth-and-n-perthshire',
       'renfrewshire-east', 'ross-skye-and-lochaber',
       'rutherglen-and-hamilton-west', 'stirling'], dtype=object)

ASSESSMENT - DEMONSTRATE A TECHNIQUE FOR CLEANING OR REDUCING A DATASET BASED ON THE PRESENCE OF NULL VALUES

Which Seats Were Safest?¶

The safest seats are the seats with the shortest (smallest) odds. If we sort the table by increasing odds and select the top few, that should give us the most secure seats.

In [339]:

df_wh.sort('odds',ascending=True).head(10)

Out[339]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
8438	2015-05-05T10:14:59+00:00	WH	conservatives	0.005	1/200	runnymede-and-weybridge	conservative	1.005
5376	2015-05-05T10:07:32+00:00	WH	conservatives	0.005	1/200	hertfordshire-ne	conservative	1.005
5362	2015-05-05T10:07:27+00:00	WH	conservatives	0.005	1/200	hertford-and-stortford	conservative	1.005
5319	2015-05-05T10:07:20+00:00	WH	conservatives	0.005	1/200	henley	conservative	1.005
1628	2015-05-05T09:58:35+00:00	WH	conservatives	0.005	1/200	brentwood-and-ongar	conservative	1.005
10097	2015-05-05T10:19:14+00:00	WH	labour	0.005	1/200	tyneside-north	labour	1.005
8941	2015-05-05T10:16:17+00:00	WH	labour	0.005	1/200	south-shields	labour	1.005
5029	2015-05-05T10:06:43+00:00	WH	conservatives	0.005	1/200	hampshire-east	conservative	1.005
4989	2015-05-05T10:06:39+00:00	WH	labour	0.005	1/200	halton	labour	1.005
4916	2015-05-05T10:06:27+00:00	WH	labour	0.005	1/200	hackney-south-and-shoreditch	labour	1.005

In Which Seats Was The Uncertainty Largest?¶

This is not simply a question of finding the constituency with the longest odds, but a question about finding the constituency with the longest (largest) favourite's odds.

If we find the minimum value across each row in the wide dataset, ignoring the missing values, we get the odds of the favourite. We can then sort on that value in a descending fashion to find the constituencies with the longest favourite odds.

In [340]:

dfp.min(axis=1).order(ascending=False).head(10)

Out[340]:

constituency
berwickshire-roxburgh-and-selkirk    1.200000
dumfries-and-galloway                0.909091
edinburgh-south                      0.909091
northampton-north                    0.909091
torbay                               0.909091
st-ives                              0.833333
halesowen-and-rowley-regis           0.833333
pudsey                               0.833333
finchley-and-golders-green           0.800000
cornwall-north                       0.800000
dtype: float64

ASSESSMENT - DEMONSTRATE HOW TO PERFORM OPERATIONS ACROSS A ROW

In Which Seats Did the Green Party Have Odds of Better Than 10 to 1 ?¶

We can ask this question by filtering the long dataset using two criteria combined using a Boolean operator:

In [341]:

df_wh[ (df_wh['party_clean']=='green') & (df_wh['odds']<=10) ]

Out[341]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
1713	2015-05-05T09:58:49+00:00	WH	greens	0.222222	2/9	brighton-pavilion	green	1.222222
1788	2015-05-05T09:58:56+00:00	WH	greens	4.500000	9/2	bristol-west	green	5.500000
7454	2015-05-05T10:12:36+00:00	WH	greens	5.000000	5	norwich-south	green	6.000000

ASSESSMENT - USE A BOOLEAN OPERATOR TO FILTER A DATASET BASED ON TWO OR MORE CRITERIA

How Many Seats Were Each Party Favourite In?¶

Trivially, we might think to sort the parties by each constituency in terms of increasing odds, then pick the one with the lowest odds.

If we sort the data frame in order of increasing odds, and then group by constituency, the order of the rows within each group will be in increasing order of odds.

ASSESSMENT - GENERATE ONE OR MORE QUESTIONS TO ASK OF A SELECTED DATASET

In [342]:

df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).get_group("aberavon")

Out[342]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
6	2015-05-05T09:54:19+00:00	WH	labour	0.01	1/100	aberavon	labour	1.01
13	2015-05-05T09:54:19+00:00	WH	ukip	16.00	16	aberavon	ukip	17.00
3	2015-05-05T09:54:19+00:00	WH	pc	50.00	50	aberavon	pc	51.00
10	2015-05-05T09:54:19+00:00	WH	liberal democrats	100.00	100	aberavon	liberal democrat	101.00
16	2015-05-05T09:54:19+00:00	WH	conservatives	100.00	100	aberavon	conservative	101.00

ASSESSMENT - DEMONSTRATE HOW TO GROUP A DATASET ACCORDING TO ONE OR MORE CRITERIA ASSESSMENT - DEMONSTRATE HOW TO ACCESS A PARTICULAR GROUP AS A GROUP

If we pick the first() row in each group, we can generate a dataframe that contains the a single row for each consituency identifying a party with those best odds.

We can then group these rows according to the cleaned party name, and count how many rows correspond to each party, ordering the result to show the most heavily favourited party first.

In [343]:

likelyparty=df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).first()
likelyparty.groupby('party_clean').size().order(ascending=False)

Out[343]:

party_clean
conservative        278
labour              263
snp                  55
liberal democrat     26
dup                   9
sinn fein             4
ukip                  3
sdlp                  3
pc                    2
sylvia hermon         1
respect               1
plaid cymru           1
john bercow           1
green                 1
dtype: int64

Interpreting this naively, we see there are 278 seats with the Conservatives as favourite, 263 with Labour as favourite, and so on.

ASSESSMENT - DEMONSTRATE HOW TO PROCESS ELEMENTS IN A GROUP BY GROUP ASSESSMENT - INTERPRET THE RESULTS GENERATED BY ASKING A PARTICULAR QUESTION OF A SELECTED DATASET

However, this approach would incorrectly predict seats where there are joint favourites, if there are any. Let's see if we can identify constituency seats where low odds are tied...

In [344]:

#Start by considering short odds
#then group by odds in each constituency
#count the rows in each group
#order the result
#and show the top few results
df_wh[df_wh['odds']<=2] \
.groupby(['odds','constituency']) \
.size() \
.order(ascending=False) \
.head()

Out[344]:

odds      constituency              
0.833333  pudsey                        2
0.909091  northampton-north             2
          torbay                        2
0.833333  halesowen-and-rowley-regis    2
0.015152  workington                    1
dtype: int64

My reading of this is that there are two parties tied on odds of 0.8333 in Pudsey. Let's check:

In [345]:

df_wh[df_wh['constituency']=='pudsey']

Out[345]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
8003	2015-05-05T10:13:53+00:00	WH	labour	0.833333	5/6	pudsey	labour	1.833333
8009	2015-05-05T10:13:53+00:00	WH	liberal democrats	100.000000	100	pudsey	liberal democrat	101.000000
8013	2015-05-05T10:13:53+00:00	WH	ukip	40.000000	40	pudsey	ukip	41.000000
8017	2015-05-05T10:13:53+00:00	WH	conservatives	0.833333	5/6	pudsey	conservative	1.833333

Finding Rows in Constituencies Where Odds Are Tied¶

If we wanted a long dataset containing rows where the odds are tied within a constituency, we could use the following filter command:

In [346]:

#Limit rows to rows in constituencies where there are parties with the same odds
#That is, where there is more than one member in groups of odds by constituency
df_sameodds=df_wh.groupby(['odds','constituency']).filter(lambda x: len(x) > 1)
df_sameodds.sort(['odds','constituency']).head()

Out[346]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
4933	2015-05-05T10:06:29+00:00	WH	labour	0.833333	5/6	halesowen-and-rowley-regis	labour	1.833333
4948	2015-05-05T10:06:29+00:00	WH	conservatives	0.833333	5/6	halesowen-and-rowley-regis	conservative	1.833333
8003	2015-05-05T10:13:53+00:00	WH	labour	0.833333	5/6	pudsey	labour	1.833333
8017	2015-05-05T10:13:53+00:00	WH	conservatives	0.833333	5/6	pudsey	conservative	1.833333
7387	2015-05-05T10:12:26+00:00	WH	labour	0.909091	10/11	northampton-north	labour	1.909091

ASSESSMENT - DEMONSTRATE HOW TO PROCESS A GROUP BASED ON GROUP PROPERTIES

Which party is "second" favourite, ordered by shortest odds first?¶

If we want to get a feel for which party is second favourite (or tied on joint odds with the "first" favourite), we can use the nth() rather than first() method on the odds sorted, constituency grouped form of the dataset, noting that nth() starts counting with index 0, so nth(1) corresponds to the second item in the ordered list:

In [347]:

secondparty=df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).nth(1)
secondparty.head()

Out[347]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
4933	2015-05-05T10:06:29+00:00	WH	labour	0.833333	5/6	halesowen-and-rowley-regis	labour	1.833333
8003	2015-05-05T10:13:53+00:00	WH	labour	0.833333	5/6	pudsey	labour	1.833333
4364	2015-05-05T10:05:10+00:00	WH	labour	0.909091	10/11	finchley-and-golders-green	labour	1.909091
9960	2015-05-05T10:18:59+00:00	WH	conservative	0.909091	10/11	torbay	conservative	1.909091
2871	2015-05-05T10:01:50+00:00	WH	liberal democrats	0.909091	10/11	cornwall-north	liberal democrat	1.909091

That result for pudsey looks a little odd? In the data frame above, the conservative entry had a higher index value, so why is the labour party listed as the second item? Let's check:

In [348]:

df_wh.sort('odds', ascending=True).groupby('constituency', as_index=False).get_group("pudsey")

Out[348]:

	time	bookie	party	odds	oddsraw	constituency	party_clean	decimal_odds
8017	2015-05-05T10:13:53+00:00	WH	conservatives	0.833333	5/6	pudsey	conservative	1.833333
8003	2015-05-05T10:13:53+00:00	WH	labour	0.833333	5/6	pudsey	labour	1.833333
8013	2015-05-05T10:13:53+00:00	WH	ukip	40.000000	40	pudsey	ukip	41.000000
8009	2015-05-05T10:13:53+00:00	WH	liberal democrats	100.000000	100	pudsey	liberal democrat	101.000000

Hmm, the ordering here does seem to put the conservative row first. But I'm not sure why?

Comparing Joint/Close First and Second Favourites¶

What happens if there is a swing from first to second favourite in constituencies with short odds across the joint first, or first and second, favourites? Are there particular swings likely from one party to another?

Let's start by creating a dataframe where each row is an observation for a constituency that shows the joint or first and second favorites, along with their odds:

In [349]:

col_subset=['constituency','party_clean','odds']
m=pd.merge(secondparty[col_subset],likelyparty[col_subset],on='constituency')
m.columns = ['constituency','secondparty','odds_second','firstparty','odds_first']
m.head()

Out[349]:

	constituency	secondparty	odds_second	firstparty	odds_first
0	halesowen-and-rowley-regis	labour	0.833333	conservative	0.833333
1	pudsey	labour	0.833333	conservative	0.833333
2	finchley-and-golders-green	labour	0.909091	conservative	0.800000
3	torbay	conservative	0.909091	liberal democrat	0.909091
4	cornwall-north	liberal democrat	0.909091	conservative	0.800000

ASSESSMENT - DEMONSTRATE HOW TO MERGE TWO OR MORE DATASETS ASSESSMENT - DEMONSTRATE HOW TO MANIPULATE THE MARGINAL PROPERTIES OF A DATA TABLE (EG INDICES, COLUMN HEADINGS)

We can now look to see what the possible swings are by party away from the favourite to a joint of close second favourite.

In [350]:

#Start by finding the rows where the odds are short for the second favourite
#then count group sizes swinging from first to second party
m[m['odds_second']<=1.5].groupby(['firstparty','secondparty']).size()

Out[350]:

firstparty        secondparty     
conservative      labour              13
                  liberal democrat     2
labour            conservative        10
                  liberal democrat     1
                  snp                  2
liberal democrat  conservative         6
                  labour               1
                  plaid cymru          1
plaid cymru       labour               1
respect           labour               1
snp               conservative         1
                  labour               4
ukip              conservative         1
dtype: int64

This shows, for example, that there are good chances that up to 13 conservative seats could go to labour, and 3 seats from liberal democrat to conservative.

Conclusion¶

In this notebook, I have investigated a data set containing election odds for the majority of UK constituencies on a single day prior to the UK General Election, 2015.

The data predicited that the Conservatives would win the largest number of seats, though not a majority. The data predicted that the SNP would win a large number of Scottish constituency seats, and that Ed Ball's Pudsey constituency was unsafe for Labour.

Analysis of constituencies with low-odds, joint or close first and second favourites suggested more possible "swings" from Conservative to Labour than vice versa and mosts swings from Liberal Democrats to the Conservatives.

ASSESSMENT NOTES Sports such as gymnastics use criteria based scoring where participants must demonstrate several elements from different difficulty groups (eg http://www.british-gymnastics.org/technical-information/selection/womens-artistic/cat_view/334-regions-and-home-countries/467-south-east/578-event-info ). One approach to asssessing notebooks on an investigation around a free-data-choice activity might be to require students to demonstrate a range of technical skills (perhaps self-identifying them to reinforce reflection about their work) in an appropriate context. I have tried to identify - and abstract - assessment opportunities along the way; should students be required to do the same as part of the assessment as part of a critique of their own work? Looking back over the assessment points, many of the steps were included *becuase the data needed treating in some way in order to ask a particular question or perform a particular transformation*. How can we capture the relationship between questions asked of the data and how those questions prompt certain transformations of the data in order to answer them? Many data anlayses are likely to include false starts that still take time to explore. Students should be allowed to include 'false-start' components in their script if they derive from a plausible initial line of investigation and demonstrate a required element. This notebook has focussed on the demonstration of particular skills using a particular programming language (Python) and programming library within that language (pandas). Some (many? all?) of the questions could have been asked directly of the dataset using SQL. Should the notebook require students to demonstrate solutions to the same problem in different languages? What assessment points are missing? The notebook does not include any graphical representations of the data (no charts). The notebook does not include any statistical analyses, other than simple rankings, sorting and extrema detection. The notebook does not require the student to model any data form or ingest any data into, a database. The notebook does not require students to do any more than single line programming at each step. That is, the student is not required to develop any functions (other than one line lambda functions) at any stage. As a rule of thumb, I estimate that each question cell, code cell, intepretation cell combination will take of the order 5-15 minutes to produce.