GitHub link for the talk. You can clone the data and play with it yourself. Please submit any improvements as pull requests
import datetime
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas
from scipy import stats
np.set_printoptions(precision=4, suppress=True)
pandas.set_printoptions(notebook_repr_html=False,
precision=4,
max_columns=12, column_space=10,
max_colwidth=25)
today = datetime.datetime(2012, 10, 2)
Methodology was obtained from the old 538 Blog with updates at the new site hosted by the New York Times
Polling Average: Aggregate polling data, and weight it according to our reliability scores.
Trend Adjustment: Adjust the polling data for current trends.
Regression: Analyze demographic data in each state by means of regression analysis.
Snapshot: Combine the polling data with the regression analysis to produce an electoral snapshot. This is our estimate of what would happen if the election were held today.
Projection: Translate the snapshot into a projection of what will happen in November, by allocating out undecided voters and applying a discount to current polling leads based on historical trends.
Simulation: Simulate our results 10,000 times based on the results of the projection to account for the uncertainty in our estimates. The end result is a robust probabilistic assessment of what will happen in each state as well as in the nation as a whole.
The process for creating an economic index for the 538 model is described here.
forecasts = pandas.read_table("/home/skipper/school/seaboldgit/"
"talks/pydata/data/wsj_forecast.csv", skiprows=2)
forecasts
Forecaster Institution Q3 2012 Q4 2012 0 Paul Ashworth Capital Economics 2.0 1.5 1 Nariman Behravesh IHS Global Insight 1.5 1.6 2 Richard Berner/ David... Morgan Stanley NaN NaN 3 Ram Bhagavatula Combinatorics Capital 2.0 4.0 4 Beth Ann Bovino * Standard and Poor's NaN NaN 5 Jay Brinkmann Mortgage Bankers Asso... 1.8 1.9 6 Michael Carey Credit Agricole CIB 1.7 1.6 7 Joseph Carson AllianceBernstein 2.5 3.5 8 Julia Coronado BNP Paribas 1.4 1.6 9 Mike Cosgrove Econoclast 1.6 1.6 10 Lou Crandall Wrightson ICAP 1.8 1.8 11 J. Dewey Daane Vanderbilt University 1.5 1.5 12 Douglas Duncan Fannie Mae 1.8 1.7 13 Robert Dye Comerica Bank 2.5 2.2 14 Maria Fiorini Ramirez... MFR, Inc. 1.4 1.2 15 Ethan Harris Bank of America Secur... 1.3 1.0 16 Maury Harris UBS 1.5 1.8 17 Jan Hatzius Goldman, Sachs & Co. 2.3 1.5 18 Tracy Herrick Avidbank 1.8 1.8 19 Stuart Hoffman * PNC Financial Service... NaN NaN 20 Gene Huang FedEx Corp. 1.9 1.7 21 William B. Hummer Wintrust Wealth Manag... 1.7 1.9 22 Bruce Kasman JP Morgan Chase & Co. 1.5 2.0 23 Joseph LaVorgna Deutsche Bank Securit... 2.7 2.8 24 Edward Leamer/David S... UCLA Anderson Forecast 1.3 1.5 25 Don Leavens/Tim Gill NEMA Business Informa... 1.7 1.7 26 John Lonski Moody's Investors Ser... 1.5 1.3 27 Dean Maki Barclays Capital 2.0 2.5 28 Aneta Markowska * Societe Generale NaN NaN 29 Jim Meil/Arun Raha Eaton Corp. 1.2 2.1 30 Mark Nielson MacroEcon Global Advi... 2.2 2.8 31 Michael P. Niemira International Council... 2.3 2.2 32 Jim O'Sullivan High Frequency Economics 2.5 2.0 33 Nicholas S. Perna Perna Associates 2.2 1.5 34 Dr. Joel Prakken/ Chr... Macroeconomic Advisers 1.5 1.4 35 David Resler Nomura Securities Int... 1.9 1.7 36 John Ryding/Conrad De... RDQ Economics 2.1 2.4 37 John Silvia Wells Fargo & Co. 1.6 1.7 38 Allen Sinai Decision Economics, Inc. 2.1 2.7 39 James F. Smith Parsec Financial Mana... 3.8 4.8 40 Sean M. Snaith University of Central... 1.7 1.9 41 Sung Won Sohn California State Univ... 1.8 1.7 42 Neal Soss CSFB 1.5 2.2 43 Stephen Stanley Pierpont Securities 1.0 2.1 44 Susan M. Sterne Economic Analysis Ass... 2.2 1.9 45 Diane Swonk Mesirow Financial 1.3 1.5 46 Carl Tannenbaum The Northern Trust 1.7 2.0 47 Bart van Ark The Conference Board 1.6 1.6 48 Brian S. Wesbury/ Rob... First Trust Advisors,... 2.5 3.0 49 William T. Wilson Skolkovo Institute fo... 1.9 2.2 50 Lawrence Yun National Association ... 1.7 2.1
forecasts.rename(columns={"Q3 2012" : "gdp_q3_2012",
"Q4 2012" : "gdp_q4_2012"}, inplace=True)
Forecaster Institution gdp_q3_2012 gdp_q4_2012 0 Paul Ashworth Capital Economics 2.0 1.5 1 Nariman Behravesh IHS Global Insight 1.5 1.6 2 Richard Berner/ David... Morgan Stanley NaN NaN 3 Ram Bhagavatula Combinatorics Capital 2.0 4.0 4 Beth Ann Bovino * Standard and Poor's NaN NaN 5 Jay Brinkmann Mortgage Bankers Asso... 1.8 1.9 6 Michael Carey Credit Agricole CIB 1.7 1.6 7 Joseph Carson AllianceBernstein 2.5 3.5 8 Julia Coronado BNP Paribas 1.4 1.6 9 Mike Cosgrove Econoclast 1.6 1.6 10 Lou Crandall Wrightson ICAP 1.8 1.8 11 J. Dewey Daane Vanderbilt University 1.5 1.5 12 Douglas Duncan Fannie Mae 1.8 1.7 13 Robert Dye Comerica Bank 2.5 2.2 14 Maria Fiorini Ramirez... MFR, Inc. 1.4 1.2 15 Ethan Harris Bank of America Secur... 1.3 1.0 16 Maury Harris UBS 1.5 1.8 17 Jan Hatzius Goldman, Sachs & Co. 2.3 1.5 18 Tracy Herrick Avidbank 1.8 1.8 19 Stuart Hoffman * PNC Financial Service... NaN NaN 20 Gene Huang FedEx Corp. 1.9 1.7 21 William B. Hummer Wintrust Wealth Manag... 1.7 1.9 22 Bruce Kasman JP Morgan Chase & Co. 1.5 2.0 23 Joseph LaVorgna Deutsche Bank Securit... 2.7 2.8 24 Edward Leamer/David S... UCLA Anderson Forecast 1.3 1.5 25 Don Leavens/Tim Gill NEMA Business Informa... 1.7 1.7 26 John Lonski Moody's Investors Ser... 1.5 1.3 27 Dean Maki Barclays Capital 2.0 2.5 28 Aneta Markowska * Societe Generale NaN NaN 29 Jim Meil/Arun Raha Eaton Corp. 1.2 2.1 30 Mark Nielson MacroEcon Global Advi... 2.2 2.8 31 Michael P. Niemira International Council... 2.3 2.2 32 Jim O'Sullivan High Frequency Economics 2.5 2.0 33 Nicholas S. Perna Perna Associates 2.2 1.5 34 Dr. Joel Prakken/ Chr... Macroeconomic Advisers 1.5 1.4 35 David Resler Nomura Securities Int... 1.9 1.7 36 John Ryding/Conrad De... RDQ Economics 2.1 2.4 37 John Silvia Wells Fargo & Co. 1.6 1.7 38 Allen Sinai Decision Economics, Inc. 2.1 2.7 39 James F. Smith Parsec Financial Mana... 3.8 4.8 40 Sean M. Snaith University of Central... 1.7 1.9 41 Sung Won Sohn California State Univ... 1.8 1.7 42 Neal Soss CSFB 1.5 2.2 43 Stephen Stanley Pierpont Securities 1.0 2.1 44 Susan M. Sterne Economic Analysis Ass... 2.2 1.9 45 Diane Swonk Mesirow Financial 1.3 1.5 46 Carl Tannenbaum The Northern Trust 1.7 2.0 47 Bart van Ark The Conference Board 1.6 1.6 48 Brian S. Wesbury/ Rob... First Trust Advisors,... 2.5 3.0 49 William T. Wilson Skolkovo Institute fo... 1.9 2.2 50 Lawrence Yun National Association ... 1.7 2.1
forecasts
Forecaster Institution gdp_q3_2012 gdp_q4_2012 0 Paul Ashworth Capital Economics 2.0 1.5 1 Nariman Behravesh IHS Global Insight 1.5 1.6 2 Richard Berner/ David... Morgan Stanley NaN NaN 3 Ram Bhagavatula Combinatorics Capital 2.0 4.0 4 Beth Ann Bovino * Standard and Poor's NaN NaN 5 Jay Brinkmann Mortgage Bankers Asso... 1.8 1.9 6 Michael Carey Credit Agricole CIB 1.7 1.6 7 Joseph Carson AllianceBernstein 2.5 3.5 8 Julia Coronado BNP Paribas 1.4 1.6 9 Mike Cosgrove Econoclast 1.6 1.6 10 Lou Crandall Wrightson ICAP 1.8 1.8 11 J. Dewey Daane Vanderbilt University 1.5 1.5 12 Douglas Duncan Fannie Mae 1.8 1.7 13 Robert Dye Comerica Bank 2.5 2.2 14 Maria Fiorini Ramirez... MFR, Inc. 1.4 1.2 15 Ethan Harris Bank of America Secur... 1.3 1.0 16 Maury Harris UBS 1.5 1.8 17 Jan Hatzius Goldman, Sachs & Co. 2.3 1.5 18 Tracy Herrick Avidbank 1.8 1.8 19 Stuart Hoffman * PNC Financial Service... NaN NaN 20 Gene Huang FedEx Corp. 1.9 1.7 21 William B. Hummer Wintrust Wealth Manag... 1.7 1.9 22 Bruce Kasman JP Morgan Chase & Co. 1.5 2.0 23 Joseph LaVorgna Deutsche Bank Securit... 2.7 2.8 24 Edward Leamer/David S... UCLA Anderson Forecast 1.3 1.5 25 Don Leavens/Tim Gill NEMA Business Informa... 1.7 1.7 26 John Lonski Moody's Investors Ser... 1.5 1.3 27 Dean Maki Barclays Capital 2.0 2.5 28 Aneta Markowska * Societe Generale NaN NaN 29 Jim Meil/Arun Raha Eaton Corp. 1.2 2.1 30 Mark Nielson MacroEcon Global Advi... 2.2 2.8 31 Michael P. Niemira International Council... 2.3 2.2 32 Jim O'Sullivan High Frequency Economics 2.5 2.0 33 Nicholas S. Perna Perna Associates 2.2 1.5 34 Dr. Joel Prakken/ Chr... Macroeconomic Advisers 1.5 1.4 35 David Resler Nomura Securities Int... 1.9 1.7 36 John Ryding/Conrad De... RDQ Economics 2.1 2.4 37 John Silvia Wells Fargo & Co. 1.6 1.7 38 Allen Sinai Decision Economics, Inc. 2.1 2.7 39 James F. Smith Parsec Financial Mana... 3.8 4.8 40 Sean M. Snaith University of Central... 1.7 1.9 41 Sung Won Sohn California State Univ... 1.8 1.7 42 Neal Soss CSFB 1.5 2.2 43 Stephen Stanley Pierpont Securities 1.0 2.1 44 Susan M. Sterne Economic Analysis Ass... 2.2 1.9 45 Diane Swonk Mesirow Financial 1.3 1.5 46 Carl Tannenbaum The Northern Trust 1.7 2.0 47 Bart van Ark The Conference Board 1.6 1.6 48 Brian S. Wesbury/ Rob... First Trust Advisors,... 2.5 3.0 49 William T. Wilson Skolkovo Institute fo... 1.9 2.2 50 Lawrence Yun National Association ... 1.7 2.1
median_forecast = forecasts[['gdp_q3_2012', 'gdp_q4_2012']].median()
median_forecast
gdp_q3_2012 1.8 gdp_q4_2012 1.8
Job Growth (Nonfarm-payrolls) PAYEMS
Personal Income PI
Industrial production INDPRO
Consumption PCEC96
Inflation CPIAUCSL
from pandas.io.data import DataReader
series = dict(jobs = "PAYEMS",
income = "PI",
prod = "INDPRO",
cons = "PCEC96",
prices = "CPIAUCSL")
#indicators = []
#for variable in series:
# data = DataReader(series[variable], "fred", start="2010-1-1")
# # renaming not necessary in master
# data.rename(columns={"VALUE" : variable}, inplace=True)
# indicators.append(data)
#indicators = pandas.concat(indicators, axis=1)
#indicators
I used Python to scrape the Real Clear Politics website and download data for the 2004 and 2008 elections. The scraping scripts are available in the github repository for this talk. State by state historical data for the 2004 and 2008 Presidential elections was obtained from electoral-vote.com.
Details can be found at the 538 blog here.
tossup = ["Colorado", "Florida", "Iowa", "New Hampshire", "Nevada",
"Ohio", "Virginia", "Wisconsin"]
national_data2012 = pandas.read_table("/home/skipper/school/seaboldgit/talks/pydata/"
"data/2012_poll_data.csv")
national_data2012
<class 'pandas.core.frame.DataFrame'> Int64Index: 290 entries, 0 to 289 Data columns: Poll 290 non-null values Date 290 non-null values Sample 290 non-null values MoE 290 non-null values Obama (D) 290 non-null values Romney (R) 290 non-null values Spread 290 non-null values dtypes: float64(2), object(5)
national_data2012.rename(columns={"Poll" : "Pollster"}, inplace=True)
national_data2012["obama_spread"] = national_data2012["Obama (D)"] - national_data2012["Romney (R)"]
national_data2012["State"] = "USA"
state_data2012 = pandas.read_table("/home/skipper/school/seaboldgit/talks/pydata/data/2012_poll_data_states.csv")
state_data2012
<class 'pandas.core.frame.DataFrame'> Int64Index: 767 entries, 0 to 766 Data columns: Date 767 non-null values MoE 767 non-null values Obama (D) 767 non-null values Poll 767 non-null values Romney (R) 767 non-null values Sample 767 non-null values Spread 767 non-null values State 767 non-null values dtypes: float64(2), object(6)
state_data2012["obama_spread"] = state_data2012["Obama (D)"] - state_data2012["Romney (R)"]
state_data2012.rename(columns=dict(Poll="Pollster"), inplace=True);
state_data2012.MoE
0 -- 1 4.5 2 4.6 3 5.0 4 4.4 5 4.4 6 4.0 7 3.0 8 5.0 9 4.4 10 4.2 11 2.8 12 4.2 13 5.0 14 4.3 ... 752 2.8 753 -- 754 4.0 755 3.8 756 4.5 757 4.5 758 5.0 759 3.2 760 4.5 761 4.5 762 2.4 763 3.4 764 2.9 765 3.5 766 3.4 Name: MoE, Length: 767
state_data2012.MoE = state_data2012.MoE.replace("--", "nan").astype(float)
state_data2012
<class 'pandas.core.frame.DataFrame'> Int64Index: 767 entries, 0 to 766 Data columns: Date 767 non-null values MoE 736 non-null values Obama (D) 767 non-null values Pollster 767 non-null values Romney (R) 767 non-null values Sample 767 non-null values Spread 767 non-null values State 767 non-null values obama_spread 767 non-null values dtypes: float64(4), object(5)
state_data2012 = state_data2012.set_index(["Pollster", "State", "Date"]).drop("RCP Average", level=0).reset_index()
state_data2012.head(5)
Pollster State Date MoE Obama (D) Romney (R) Sample Spread obama_spread 0 Rasmussen Reports WA 9/26 - 9/26 4.5 52 41 500 LV Obama +11 11 1 Gravis Marketing WA 9/21 - 9/22 4.6 56 39 625 LV Obama +17 17 2 Elway Poll WA 9/9 - 9/12 5.0 53 36 405 RV Obama +17 17 3 SurveyUSA WA 9/7 - 9/9 4.4 54 38 524 LV Obama +16 16 4 SurveyUSA WA 8/1 - 8/2 4.4 54 37 524 LV Obama +17 17
Clean up the sample numbers to make it a number.
state_data2012.Sample
0 500 LV 1 625 LV 2 405 RV 3 524 LV 4 524 LV 5 630 RV 6 1073 RV 7 408 RV 8 500 LV 9 557 RV 10 1264 RV 11 572 RV 12 405 RV 13 549 LV 14 469 RV ... 724 600 LV 725 1224 RV 726 625 LV 727 656 LV 728 500 LV 729 500 LV 730 450 LV 731 934 RV 732 500 LV 733 500 LV 734 1625 RV 735 819 RV 736 1176 RV 737 796 LV 738 817 RV Name: Sample, Length: 739
state_data2012.Sample = state_data2012.Sample.str.replace("\s*([L|R]V)|A", "") # 20 RV
state_data2012.Sample = state_data2012.Sample.str.replace("\s*--", "nan") # --
state_data2012.Sample = state_data2012.Sample.str.replace("^$", "nan")
national_data2012.Sample = national_data2012.Sample.str.replace("\s*([L|R]V)|A", "") # 20 RV
national_data2012.Sample = national_data2012.Sample.str.replace("\s*--", "nan") # --
national_data2012.Sample = national_data2012.Sample.str.replace("^$", "nan")
state_data2012.Sample.astype(float)
0 500 1 625 2 405 3 524 4 524 5 630 6 1073 7 408 8 500 9 557 10 1264 11 572 12 405 13 549 14 469 ... 724 600 725 1224 726 625 727 656 728 500 729 500 730 450 731 934 732 500 733 500 734 1625 735 819 736 1176 737 796 738 817 Name: Sample, Length: 739
state_data2012.Sample = state_data2012.Sample.astype(float)
national_data2012.Sample = national_data2012.Sample.astype(float)
The 2012 data is currently in order of time by state but doesn't have any years.
#dates2012.get_group(("OH", "NBC News/Marist"))
state_data2012["start_date"] = ""
state_data2012["end_date"] = ""
dates2012 = state_data2012.groupby(["State", "Pollster"])["Date"]
for _, date in dates2012:
year = 2012
# checked by hand, none straddle years
changes = np.r_[False, np.diff(map(int, [i[0].split('/')[0] for
i in date.str.split(' - ')])) > 0]
for j, (idx, dt) in enumerate(date.iteritems()):
dt1, dt2 = dt.split(" - ")
year -= changes[j]
# check for ones that haven't polled in a year - soft check
# could be wrong for some...
if year == 2012 and (int(dt1.split("/")[0]) > today.month and
int(dt1.split("/")[1]) > today.day):
year -= 1
dt1 += "/" + str(year)
dt2 += "/" + str(year)
state_data2012.set_value(idx, "start_date", dt1)
state_data2012.set_value(idx, "end_date", dt2)
national_data2012["start_date"] = ""
national_data2012["end_date"] = ""
dates2012 = national_data2012.groupby(["Pollster"])["Date"]
for _, date in dates2012:
year = 2012
# checked by hand, none straddle years
changes = np.r_[False, np.diff(map(int, [i[0].split('/')[0] for
i in date.str.split(' - ')])) > 0]
for j, (idx, dt) in enumerate(date.iteritems()):
dt1, dt2 = dt.split(" - ")
year -= changes[j]
dt1 += "/" + str(year)
dt2 += "/" + str(year)
national_data2012.set_value(idx, "start_date", dt1)
national_data2012.set_value(idx, "end_date", dt2)
state_data2012.head(10)
Pollster State Date MoE Obama (D) Romney (R) Sample Spread obama_spread start_date end_date 0 Rasmussen Reports WA 9/26 - 9/26 4.5 52 41 500 Obama +11 11 9/26/2012 9/26/2012 1 Gravis Marketing WA 9/21 - 9/22 4.6 56 39 625 Obama +17 17 9/21/2012 9/22/2012 2 Elway Poll WA 9/9 - 9/12 5.0 53 36 405 Obama +17 17 9/9/2012 9/12/2012 3 SurveyUSA WA 9/7 - 9/9 4.4 54 38 524 Obama +16 16 9/7/2012 9/9/2012 4 SurveyUSA WA 8/1 - 8/2 4.4 54 37 524 Obama +17 17 8/1/2012 8/2/2012 5 SurveyUSA WA 7/16 - 7/18 4.0 46 37 630 Obama +9 9 7/16/2012 7/18/2012 6 PPP (D) WA 6/14 - 6/17 3.0 54 41 1073 Obama +13 13 6/14/2012 6/17/2012 7 Elway Poll WA 6/13 - 6/16 5.0 49 41 408 Obama +8 8 6/13/2012 6/16/2012 8 Strategies 360 (D) WA 5/22 - 5/24 4.4 51 40 500 Obama +11 11 5/22/2012 5/24/2012 9 SurveyUSA WA 5/8 - 5/9 4.2 50 36 557 Obama +14 14 5/8/2012 5/9/2012
state_data2012.start_date = state_data2012.start_date.apply(pandas.datetools.parse)
state_data2012.end_date = state_data2012.end_date.apply(pandas.datetools.parse)
national_data2012.start_date = national_data2012.start_date.apply(pandas.datetools.parse)
national_data2012.end_date = national_data2012.end_date.apply(pandas.datetools.parse)
def median_date(row):
dates = pandas.date_range(row["start_date"], row["end_date"])
median_idx = int(np.median(range(len(dates)))+.5)
return dates[median_idx]
state_data2012["poll_date"] = [median_date(row) for i, row in state_data2012.iterrows()]
del state_data2012["Date"]
del state_data2012["start_date"]
del state_data2012["end_date"]
national_data2012["poll_date"] = [median_date(row) for i, row in national_data2012.iterrows()]
del national_data2012["Date"]
del national_data2012["start_date"]
del national_data2012["end_date"]
state_data2012.head(5)
Pollster State MoE Obama (D) Romney (R) Sample Spread obama_spread poll_date 0 Rasmussen Reports WA 4.5 52 41 500 Obama +11 11 2012-09-26 00:00:00 1 Gravis Marketing WA 4.6 56 39 625 Obama +17 17 2012-09-22 00:00:00 2 Elway Poll WA 5.0 53 36 405 Obama +17 17 2012-09-11 00:00:00 3 SurveyUSA WA 4.4 54 38 524 Obama +16 16 2012-09-08 00:00:00 4 SurveyUSA WA 4.4 54 37 524 Obama +17 17 2012-08-02 00:00:00
pollsters = state_data2012.Pollster.unique()
pollsters.sort()
len(pollsters)
120
print pandas.Series(pollsters)
0 AFP/Magellan (R) 1 AIF/McLaughlin (R) 2 ARG 3 Albuquerque Journal* 4 Arizona State 5 Baltimore Sun 6 Baydoun/Foster (D) 7 Behavior Research Center 8 Bloomberg News 9 Boston Globe 10 CBS/NYT/Quinnipiac 11 CNN/Opinion Research 12 CNN/Time 13 CNU/Times-Dispatch 14 Caddell/McLaughlin/SA... 15 Castleton State College 16 Chicago Tribune 17 Civitas (R) 18 Clarus Research 19 Columbus Dispatch* 20 Courier-Journal/Surve... 21 Critical Insights 22 Daily Kos/PPP (D) 23 Dartmouth 24 Denver Post/SurveyUSA 25 Des Moines Register 26 Deseret News 27 Deseret News/KSL 28 Detroit News 29 EPIC-MRA 30 Elon Univ./Charlotte ... 31 Elway Poll 32 FOX Chicago/WAA 33 FOX News 34 Fairleigh Dickinson 35 Field 36 Florida Times-Union/I... 37 Franklin & Marshall 38 Glengariff Group (R) 39 Gonzales Research 40 Gravis Marketing 41 Gravis Marketing* 42 Hartford Courant/UConn 43 High Point 44 High Point/SurveyUSA 45 HighGround/Moore (R)* 46 Howey/DePauw 47 Inside MI Politcs/MRG 48 InsiderAdvantage 49 KSTP/SurveyUSA 50 Keating (D) 51 LA Times/USC 52 LVRJ/SurveyUSA 53 Landmark/Rosetta Stone 54 Las Vegas Review-Journal 55 MPRC (D) 56 MPRC (D)* 57 MRG 58 Magellan (R) 59 Magellan Strategies (R) 60 Marist 61 Marquette University 62 Mason-Dixon 63 Mason-Dixon* 64 Mass Insight/Opinion ... 65 Mercyhurst University 66 Miami Herald/Mason-Dixon 67 Middle Tn. State U. 68 Mitchell Research 69 Monmouth University 70 Morning Call 71 NBC News/Marist 72 NBC/WSJ/Marist 73 Ohio Newspapers/Univ ... 74 Ohio Poll/Univ of Cin. 75 Omaha World-Herald 76 PPIC 77 PPP (D) 78 Philadelphia Inquirer 79 Post-Dispatch/Mason-D... 80 Post-Dispatch/Mason-D... 81 Project New America/K... 82 Project New America/M... 83 Project New America/P... 84 Purple Strategies 85 Quinnipiac 86 Rasmussen Reports 87 Retail Assoc. of Neva... 88 Roanoke College 89 Rutgers-Eagleton 90 Siena 91 Sooner Poll 92 St. Cloud State U. 93 Star Tribune/Mason-Di... 94 Strategies 360 (D) 95 Suffolk University 96 Suffolk/7News 97 Suffolk/WSVN 98 Suffolk/WWBT 99 Sunshine State News/VSS 100 SurveyUSA 101 SurveyUSA/Civitas (R) 102 Talk Business Poll 103 Tennessean/Vanderbilt 104 The Simon Poll/SIU 105 The Washington Poll 106 Tribune-Review/Susque... 107 UMass/Boston Herald 108 Virginian-Pilot/ODU 109 Voter/Consumer Res/TI... 110 WBUR/MassINC 111 WMUR/UNH 112 WPA 113 WPR/St. Norbert 114 WPRI 115 WPRI/Fleming 116 Washington Post 117 WeAskAmerica 118 WeAskAmerica* 119 Western NE University Length: 120
weights = pandas.read_table("/home/skipper/school/seaboldgit/talks/pydata/data/pollster_weights.csv")
weights
Pollster Weight PIE 0 ABC / Washington Post 0.95 1.41 1 American Research Group 0.65 1.76 2 CBS / New York Times 0.66 1.84 3 Chicago Trib. / Marke... 1.16 1.13 4 CNN / Opinion Research 0.77 1.59 5 Columbus Dispatch (OH) 0.50 6.76 6 EPIC-MRA 0.75 1.65 7 Fairleigh-Dickinson (NJ) 0.71 1.72 8 Field Poll (CA) 1.33 0.88 9 Fox / Opinion Dynamics 0.79 1.60 10 Franklin Pierce (NH) 0.74 1.60 11 Insider Advantage 0.95 1.29 12 Keystone (PA) 0.64 1.55 13 LA Times / Bloomberg 0.83 1.44 14 Marist (NY) 0.69 1.73 15 Mason-Dixon 1.10 1.15 16 Mitchell 0.96 1.43 17 Ohio Poll 1.24 1.05 18 Public Opinion Strate... 0.63 1.81 19 Public Policy Polling... 1.05 1.60 20 Quinnipiac 0.95 1.34 21 Rasmussen 1.30 0.88 22 Research 2000 1.01 1.20 23 Selzer 1.47 0.92 24 Star Tribune (MN) 0.81 2.01 25 Strategic Vision 0.95 1.45 26 Suffolk (NH/MA) 0.77 1.37 27 SurveyUSA 1.91 0.72 28 Univ. New Hampshire 1.08 1.26 29 USA Today / Gallup 0.63 2.01 30 Zogby 0.64 1.72 31 Zogby Interactive 0.43 4.74
weights.mean()
Weight 0.908 PIE 1.707
Clean up the pollster names a bit so we can merge with the weights.
import pickle
pollster_map = pickle.load(open("/home/skipper/school/seaboldgit/talks/pydata/data/pollster_map.pkl", "rb"))
state_data2012.Pollster.replace(pollster_map, inplace=True);
national_data2012.Pollster.replace(pollster_map, inplace=True);
Inner merge the data with the weights
state_data2012 = state_data2012.merge(weights, how="inner", on="Pollster")
state_data2012.head(5)
Pollster State MoE Obama (D) Romney (R) Sample Spread obama_spread poll_date Weight PIE 0 American Research Group FL 4.0 50 45 600 Obama +5 5 2012-09-21 00:00:00 0.65 1.76 1 American Research Group NH 4.0 50 45 600 Obama +5 5 2012-09-26 00:00:00 0.65 1.76 2 American Research Group NH 4.5 48 47 463 Obama +1 1 2012-09-16 00:00:00 0.65 1.76 3 American Research Group NH 4.2 49 46 417 Obama +3 3 2012-06-23 00:00:00 0.65 1.76 4 American Research Group NH 4.2 48 41 557 Obama +7 7 2012-03-17 00:00:00 0.65 1.76
The first adjustment is an exponential decay for recency of the poll. Based on research in prior elections, a weight with a half-life of 30 days since the median date the poll has been in the field is assigned to each poll.
def exp_decay(days):
# defensive coding, accepts timedeltas
days = getattr(days, "days", days)
return .5 ** (days/30.)
fig, ax = plt.subplots(figsize=(12,8), subplot_kw={"xlabel" : "Days",
"ylabel" : "Weight"})
days = np.arange(0, 45)
ax.plot(days, exp_decay(days));
ax.vlines(30, 0, .99, color='r', linewidth=4)
ax.set_ylim(0,1)
ax.set_xlim(0, 45);
The second adjustment is for the sample size of the poll. Polls with a higher sample size receive a higher weight.
Binomial sampling error = +/- $50 * \frac{1}{\sqrt{nobs}}$ where the 50 depends on the underlying probability or population preferences, in this case assumed to be 50:50 (another way of calculating Margin of Error)
def average_error(nobs, p=50.):
return p*nobs**-.5
The thinking here is that having 5 polls of 1200 is a lot like having one poll of 6000. However, we downweight older polls by only including the marginal effective sample size. Where the effective sample size is the size of the methodologically perfect poll for which we would be indifferent between it and the one we have with our current total error. Total error is determined as $TE = \text{Average Error} + \text{Long Run Pollster Induced Error}$. See here for the detailed calculations of Pollster Induced Error.
def effective_sample(total_error, p=50.):
return p**2 * (total_error**-2.)
state_pollsters = state_data2012.groupby(["State", "Pollster"])
ppp_az = state_pollsters.get_group(("AZ", "Public Policy Polling (PPP)"))
var_idx = ["Pollster", "State", "Obama (D)", "Romney (R)", "Sample", "poll_date"]
ppp_az[var_idx]
Pollster State Obama (D) Romney (R) Sample poll_date 198 Public Policy Polling... AZ 44 53 993 2012-09-08 00:00:00 199 Public Policy Polling... AZ 41 52 833 2012-07-24 00:00:00 200 Public Policy Polling... AZ 43 50 500 2012-05-19 00:00:00 201 Public Policy Polling... AZ 47 47 743 2012-02-18 00:00:00 202 Public Policy Polling... AZ 42 49 500 2011-11-19 00:00:00 203 Public Policy Polling... AZ 44 48 623 2011-04-30 00:00:00 204 Public Policy Polling... AZ 43 49 599 2011-01-29 00:00:00 205 Public Policy Polling... AZ 43 50 617 2010-09-20 00:00:00
ppp_az.sort("poll_date", ascending=False, inplace=True);
ppp_az["cumulative"] = ppp_az["Sample"].cumsum()
ppp_az["average_error"] = average_error(ppp_az["cumulative"])
ppp_az["total_error"] = ppp_az["PIE"] + ppp_az["average_error"]
ppp_az[var_idx + ["cumulative"]]
Pollster State Obama (D) Romney (R) Sample poll_date cumulative 198 Public Policy Polling... AZ 44 53 993 2012-09-08 00:00:00 993 199 Public Policy Polling... AZ 41 52 833 2012-07-24 00:00:00 1826 200 Public Policy Polling... AZ 43 50 500 2012-05-19 00:00:00 2326 201 Public Policy Polling... AZ 47 47 743 2012-02-18 00:00:00 3069 202 Public Policy Polling... AZ 42 49 500 2011-11-19 00:00:00 3569 203 Public Policy Polling... AZ 44 48 623 2011-04-30 00:00:00 4192 204 Public Policy Polling... AZ 43 49 599 2011-01-29 00:00:00 4791 205 Public Policy Polling... AZ 43 50 617 2010-09-20 00:00:00 5408
ppp_az["ESS"] = effective_sample(ppp_az["total_error"])
ppp_az["MESS"] = ppp_az["ESS"].diff()
# fill in first one
ppp_az["MESS"].fillna(ppp_az["ESS"].head(1).item(), inplace=True);
ppp_az[["poll_date", "Sample", "cumulative", "ESS", "MESS"]]
poll_date Sample cumulative ESS MESS 198 2012-09-08 00:00:00 993 993 246.182 246.182 199 2012-07-24 00:00:00 833 1826 325.801 79.618 200 2012-05-19 00:00:00 500 2326 359.591 33.791 201 2012-02-18 00:00:00 743 3069 399.185 39.594 202 2011-11-19 00:00:00 500 3569 420.968 21.783 203 2011-04-30 00:00:00 623 4192 444.241 23.273 204 2011-01-29 00:00:00 599 4791 463.531 19.291 205 2010-09-20 00:00:00 617 5408 480.955 17.424
Now let's do it for every polling firm in every state.
def calculate_mess(group):
cumulative = group["Sample"].cumsum()
ae = average_error(cumulative)
total_error = ae + group["PIE"]
ess = effective_sample(total_error)
mess = ess.diff()
mess.fillna(ess.head(1).item(), inplace=True)
#from IPython.core.debugger import Pdb; Pdb().set_trace()
return pandas.concat((ess, mess), axis=1)
#state_data2012["ESS", "MESS"]
df = state_pollsters.apply(calculate_mess)
df.rename(columns={0 : "ESS", 1 : "MESS"}, inplace=True);
state_data2012 = state_data2012.join(df)
Give them the time weight
td = today - state_data2012["poll_date"].head(1).item()
state_data2012["poll_date"].head(1).item()
<Timestamp: 2012-09-21 00:00:00>
td
datetime.timedelta(11)
state_data2012["time_weight"] = (today - state_data2012["poll_date"]).apply(exp_decay)
Now aggregate all of these. Weight them based on the sample size but also based on the time_weight.
def weighted_mean(group):
weights1 = group["time_weight"]
weights2 = group["MESS"]
return np.sum(weights1*weights2*group["obama_spread"]/(weights1*weights2).sum())
state_pollsters = state_data2012.groupby(["State", "Pollster"])
state_polls = state_pollsters.apply(weighted_mean)
state_polls
State Pollster AZ Public Policy Polling (PPP) -9.168 Rasmussen -10.209 CA Field Poll (CA) 23.344 Public Policy Polling (PPP) 20.999 Rasmussen 22.000 SurveyUSA 22.123 CO American Research Group 2.000 Public Policy Polling (PPP) 5.470 Rasmussen -1.574 CT Public Policy Polling (PPP) 12.758 Quinnipiac 7.294 Rasmussen 8.000 FL American Research Group 5.000 Mason-Dixon -3.543 Public Policy Polling (PPP) 3.125 Quinnipiac 3.076 Rasmussen 0.883 Suffolk (NH/MA) -0.003 SurveyUSA 4.169 GA Insider Advantage -19.174 Mason-Dixon -17.000 Public Policy Polling (PPP) -3.000 SurveyUSA -7.984 HI Public Policy Polling (PPP) 27.000 IA American Research Group 7.000 Mason-Dixon -3.000 Public Policy Polling (PPP) 5.879 Rasmussen -2.749 IL Chicago Trib. / MarketShares 21.000 IN Rasmussen -16.000 KS SurveyUSA -15.875 MA Public Policy Polling (PPP) 17.580 Rasmussen 15.107 MD Public Policy Polling (PPP) 23.000 ME Public Policy Polling (PPP) 16.038 Rasmussen 12.000 MI CNN / Opinion Research 8.000 EPIC-MRA 7.430 Mitchell 0.897 Public Policy Polling (PPP) 7.694 Rasmussen 11.072 SurveyUSA 11.000 MN Public Policy Polling (PPP) 7.335 MO Public Policy Polling (PPP) -11.225 Rasmussen -2.486 SurveyUSA -1.000 MS Public Policy Polling (PPP) -17.973 MT Mason-Dixon -9.000 Public Policy Polling (PPP) -5.003 Rasmussen -15.641 NC American Research Group -4.000 Public Policy Polling (PPP) 0.261 Rasmussen -5.676 SurveyUSA 1.987 ND Mason-Dixon -13.000 Rasmussen -15.000 NE Public Policy Polling (PPP) -12.005 Rasmussen -14.308 NH American Research Group 4.150 LA Times / Bloomberg -10.000 Mason-Dixon -11.000 Public Policy Polling (PPP) 6.273 Rasmussen -2.439 NJ Fairleigh-Dickinson (NJ) 13.859 Public Policy Polling (PPP) 14.006 Quinnipiac 7.504 Rasmussen 6.000 SurveyUSA 14.000 NM Public Policy Polling (PPP) 10.621 Rasmussen 11.651 NV American Research Group 7.000 CNN / Opinion Research 3.000 Public Policy Polling (PPP) 7.345 Rasmussen 2.524 NY Marist (NY) 22.047 Quinnipiac 27.345 SurveyUSA 30.000 OH American Research Group 1.000 Columbus Dispatch (OH) 8.616 Ohio Poll 3.000 Public Policy Polling (PPP) 4.142 Quinnipiac 7.729 Rasmussen 0.866 OR Public Policy Polling (PPP) 9.130 SurveyUSA 8.676 PA Public Policy Polling (PPP) 6.160 Quinnipiac 6.047 Rasmussen 10.875 SurveyUSA 0.000 RI Public Policy Polling (PPP) 17.000 SC Public Policy Polling (PPP) -14.558 SD Public Policy Polling (PPP) -6.000 TN Public Policy Polling (PPP) -7.000 TX Public Policy Polling (PPP) -6.999 UT Mason-Dixon -51.000 Public Policy Polling (PPP) -32.000 VA American Research Group 2.000 Mason-Dixon 1.000 Public Policy Polling (PPP) 5.096 Quinnipiac 0.578 Rasmussen 0.892 VT Public Policy Polling (PPP) 20.000 WA Public Policy Polling (PPP) 13.051 Rasmussen 11.000 SurveyUSA 15.310 WI CNN / Opinion Research 4.000 Public Policy Polling (PPP) 5.393 Rasmussen 2.116 WV Public Policy Polling (PPP) -19.757 Length: 109
state_data2004 = pandas.read_csv("/home/skipper/school/seaboldgit/talks/pydata/data/2004-pres-polls.csv")
state_data2004
<class 'pandas.core.frame.DataFrame'> Int64Index: 879 entries, 0 to 878 Data columns: State 879 non-null values Kerry 879 non-null values Bush 879 non-null values Date 879 non-null values Pollster 879 non-null values dtypes: int64(2), object(3)
state_data2004.head(5)
State Kerry Bush Date Pollster 0 AL 39 57 Oct 25 SurveyUSA 1 AL 32 56 Oct 12 Capital Survey 2 AL 34 62 Oct 01 SurveyUSA 3 AL 40 54 Sep 14 ARG 4 AL 42 53 Sep 06 Rasmussen
state_data2008 = pandas.read_csv("/home/skipper/school/seaboldgit/talks/pydata/data/2008-pres-polls.csv")
state_data2008
<class 'pandas.core.frame.DataFrame'> Int64Index: 1189 entries, 0 to 1188 Data columns: State 1189 non-null values Obama 1189 non-null values McCain 1189 non-null values Start 1189 non-null values End 1189 non-null values Pollster 1189 non-null values dtypes: int64(2), object(4)
state_data2008.head(5)
State Obama McCain Start End Pollster 0 AL 36 61 Oct 27 Oct 28 SurveyUSA 1 AL 34 54 Oct 15 Oct 16 Capital Survey 2 AL 35 62 Oct 08 Oct 09 SurveyUSA 3 AL 35 55 Oct 06 Oct 07 Capital Survey 4 AL 39 60 Sep 22 Sep 22 Rasmussen
state_data2008.End + " 2008"
0 Oct 28 2008 1 Oct 16 2008 2 Oct 09 2008 3 Oct 07 2008 4 Sep 22 2008 5 Sep 17 2008 6 Sep 16 2008 7 Sep 15 2008 8 Sep 09 2008 9 Aug 04 2008 10 Jul 31 2008 11 Jun 26 2008 12 Jun 26 2008 13 Jun 02 2008 14 May 27 2008 ... 1174 May 05 2008 1175 Apr 24 2008 1176 Apr 13 2008 1177 Apr 05 2008 1178 Mar 26 2008 1179 Mar 16 2008 1180 Feb 28 2008 1181 Feb 21 2008 1182 Feb 17 2008 1183 Oct 19 2008 1184 Oct 14 2008 1185 Sep 11 2008 1186 Sep 10 2008 1187 Aug 15 2008 1188 Feb 28 2008 Name: End, Length: 1189
(state_data2008.End + " 2008").apply(pandas.datetools.parse)
0 2008-10-28 00:00:00 1 2008-10-16 00:00:00 2 2008-10-09 00:00:00 3 2008-10-07 00:00:00 4 2008-09-22 00:00:00 5 2008-09-17 00:00:00 6 2008-09-16 00:00:00 7 2008-09-15 00:00:00 8 2008-09-09 00:00:00 9 2008-08-04 00:00:00 10 2008-07-31 00:00:00 11 2008-06-26 00:00:00 12 2008-06-26 00:00:00 13 2008-06-02 00:00:00 14 2008-05-27 00:00:00 ... 1174 2008-05-05 00:00:00 1175 2008-04-24 00:00:00 1176 2008-04-13 00:00:00 1177 2008-04-05 00:00:00 1178 2008-03-26 00:00:00 1179 2008-03-16 00:00:00 1180 2008-02-28 00:00:00 1181 2008-02-21 00:00:00 1182 2008-02-17 00:00:00 1183 2008-10-19 00:00:00 1184 2008-10-14 00:00:00 1185 2008-09-11 00:00:00 1186 2008-09-10 00:00:00 1187 2008-08-15 00:00:00 1188 2008-02-28 00:00:00 Name: End, Length: 1189
Need to clean some of the dates in this data. Luckily, pandas makes this easy to do.
state_data2004.Date = state_data2004.Date.str.replace("Nov 00", "Nov 01")
state_data2004.Date = state_data2004.Date.str.replace("Oct 00", "Oct 01")
state_data2008["poll_date"] = (state_data2008.End + " 2008").apply(pandas.datetools.parse)
state_data2004["poll_date"] = (state_data2004.Date + " 2004").apply(pandas.datetools.parse)
del state_data2008["End"]
del state_data2008["Start"]
del state_data2004["Date"]
state_data2008
<class 'pandas.core.frame.DataFrame'> Int64Index: 1189 entries, 0 to 1188 Data columns: State 1189 non-null values Obama 1189 non-null values McCain 1189 non-null values Pollster 1189 non-null values poll_date 1189 non-null values dtypes: int64(2), object(3)
state_data2004
<class 'pandas.core.frame.DataFrame'> Int64Index: 879 entries, 0 to 878 Data columns: State 879 non-null values Kerry 879 non-null values Bush 879 non-null values Pollster 879 non-null values poll_date 879 non-null values dtypes: int64(2), object(3)
state_groups = state_data2008.groupby("State")
state_groups.aggregate(dict(Obama=np.mean, McCain=np.mean))
McCain Obama State AK 52.000 39.429 AL 56.826 34.348 AR 51.000 37.250 AZ 49.333 39.190 CA 37.633 53.267 CO 44.467 48.289 CT 36.923 52.692 DC 13.000 82.000 DE 38.625 55.500 FL 46.394 46.121 GA 51.346 43.154 HI 30.000 64.000 IA 41.407 50.037 ID 60.000 30.500 IL 36.900 55.600 IN 47.500 44.962 KS 53.562 37.750 KY 54.842 37.526 LA 52.167 39.083 MA 38.800 52.200 MD 38.667 53.833 ME 38.188 50.562 MI 42.053 47.368 MN 41.739 50.261 MO 47.429 45.571 MS 51.200 40.500 MT 48.214 43.857 NC 47.523 46.091 ND 45.571 42.714 NE 51.714 37.143 NH 42.757 48.919 NJ 39.767 49.767 NM 43.593 48.741 NV 44.844 46.938 NY 36.865 52.432 OH 44.975 46.658 OK 61.700 32.000 OR 40.852 50.333 PA 42.080 48.893 RI 32.000 53.000 SC 53.300 41.000 SD 50.375 39.875 TN 54.364 36.364 TX 50.200 40.400 UT 58.600 30.000 VA 45.817 47.933 VT 34.750 59.750 WA 40.424 51.515 WI 41.921 49.684 WV 48.692 42.538 WY 59.333 32.667
Means for the entire country (without weighting by population)
state_groups.aggregate(dict(Obama=np.mean, McCain=np.mean)).mean()
McCain 45.338 Obama 46.082
state_data2004.Pollster.replace(pollster_map, inplace=True)
state_data2008.Pollster.replace(pollster_map, inplace=True);
state_data2004 = state_data2004.merge(weights, how="inner", on="Pollster")
state_data2008 = state_data2008.merge(weights, how="inner", on="Pollster")
len(state_data2004.Pollster.unique())
26
len(state_data2008.Pollster.unique())
21
import datetime
date2004 = datetime.datetime(2004, 11, 2)
date2004
datetime.datetime(2004, 11, 2, 0, 0)
(date2004 - state_data2004.poll_date) < datetime.timedelta(21)
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 True 10 True 11 False 12 False 13 False 14 False ... 719 False 720 False 721 False 722 False 723 False 724 True 725 True 726 False 727 False 728 False 729 False 730 False 731 False 732 False 733 False Name: poll_date, Length: 734
Restrict the samples to the 3 weeks leading up to the election
state_data2004 = state_data2004.ix[(date2004 - state_data2004.poll_date) <= datetime.timedelta(21)]
state_data2004.reset_index(drop=True, inplace=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 213 entries, 0 to 212 Data columns: State 213 non-null values Kerry 213 non-null values Bush 213 non-null values Pollster 213 non-null values poll_date 213 non-null values Weight 213 non-null values PIE 213 non-null values dtypes: float64(2), int64(2), object(3)
date2008 = datetime.datetime(2008, 11, 4)
state_data2008 = state_data2008.ix[(date2008 - state_data2008.poll_date) <= datetime.timedelta(21)]
state_data2008.reset_index(drop=True, inplace=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 210 entries, 0 to 209 Data columns: State 210 non-null values Obama 210 non-null values McCain 210 non-null values Pollster 210 non-null values poll_date 210 non-null values Weight 210 non-null values PIE 210 non-null values dtypes: float64(2), int64(2), object(3)
state_data2008
<class 'pandas.core.frame.DataFrame'> Int64Index: 210 entries, 0 to 209 Data columns: State 210 non-null values Obama 210 non-null values McCain 210 non-null values Pollster 210 non-null values poll_date 210 non-null values Weight 210 non-null values PIE 210 non-null values dtypes: float64(2), int64(2), object(3)
state_data2004["time_weight"] =(date2004 - state_data2004.poll_date).apply(exp_decay)
state_data2008["time_weight"] =(date2008 - state_data2008.poll_date).apply(exp_decay)
state_data2004[["time_weight", "poll_date"]].head(5)
time_weight poll_date 0 0.955 2004-10-31 00:00:00 1 0.794 2004-10-23 00:00:00 2 0.831 2004-10-25 00:00:00 3 0.955 2004-10-31 00:00:00 4 0.891 2004-10-28 00:00:00
def max_date(x):
return x == x.max()
state_data2004["newest_poll"] = state_data2004.groupby(("State", "Pollster")).poll_date.transform(max_date)
state_data2008["newest_poll"] = state_data2008.groupby(("State", "Pollster")).poll_date.transform(max_date)
Partican Voting Index data obtained from Wikipedia
pvi = pandas.read_csv("/home/skipper/school/seaboldgit/talks/pydata/data/partisan_voting.csv")
pvi.set_index("State", inplace=True);
pvi
PVI State Alabama R+13 Alaska R+13 Arizona R+6 Arkansas R+9 California D+7 Colorado EVEN Connecticut D+7 Delaware D+7 District of Columbia D+39 Florida R+2 Georgia R+7 Hawaii D+12 Idaho R+17 Illinois D+8 Indiana R+6 Iowa D+1 Kansas R+12 Kentucky R+10 Louisiana R+10 Maine D+5 Maryland D+9 Massachusetts D+12 Michigan D+4 Minnesota D+2 Mississippi R+10 Missouri R+3 Montana R+7 Nebraska R+13 Nevada D+1 New Hampshire D+2 New Jersey D+4 New Mexico D+2 New York D+10 North Carolina R+4 North Dakota R+10 Ohio R+1 Oklahoma R+17 Oregon D+4 Pennsylvania D+2 Rhode Island D+11 South Carolina R+8 South Dakota R+9 Tennessee R+9 Texas R+10 Utah R+20 Vermont D+13 Virginia R+2 Washington D+5 West Virginia R+8 Wisconsin D+2 Wyoming R+20
pvi.PVI = pvi.PVI.replace({"EVEN" : "0"})
pvi.PVI = pvi.PVI.str.replace("R\+", "-")
pvi.PVI = pvi.PVI.str.replace("D\+", "")
pvi.PVI = pvi.PVI.astype(float)
pvi.PVI
State Alabama -13 Alaska -13 Arizona -6 Arkansas -9 California 7 Colorado 0 Connecticut 7 Delaware 7 District of Columbia 39 Florida -2 Georgia -7 Hawaii 12 Idaho -17 Illinois 8 Indiana -6 Iowa 1 Kansas -12 Kentucky -10 Louisiana -10 Maine 5 Maryland 9 Massachusetts 12 Michigan 4 Minnesota 2 Mississippi -10 Missouri -3 Montana -7 Nebraska -13 Nevada 1 New Hampshire 2 New Jersey 4 New Mexico 2 New York 10 North Carolina -4 North Dakota -10 Ohio -1 Oklahoma -17 Oregon 4 Pennsylvania 2 Rhode Island 11 South Carolina -8 South Dakota -9 Tennessee -9 Texas -10 Utah -20 Vermont 13 Virginia -2 Washington 5 West Virginia -8 Wisconsin 2 Wyoming -20 Name: PVI, Length: 51
Party affliation of electorate obtained from Gallup.
party_affil = pandas.read_csv("/home/skipper/school/seaboldgit/talks/pydata/data/gallup_electorate.csv")
party_affil.Democrat = party_affil.Democrat.str.replace("%", "").astype(float)
party_affil.Republican = party_affil.Republican.str.replace("%", "").astype(float)
party_affil.set_index("State", inplace=True);
party_affil.rename(columns={"Democrat Advantage" : "dem_adv"}, inplace=True);
party_affil["no_party"] = 100 - party_affil.Democrat - party_affil.Republican
party_affil
Democrat Republican dem_adv N no_party State District of Columbia 79.0 12.7 66.30 416 8.3 Rhode Island 52.5 26.5 26.00 623 21.0 Hawaii 54.3 28.7 25.60 466 17.0 New York 52.0 30.8 21.20 8674 17.2 Maryland 54.0 33.8 20.20 3571 12.2 Massachusetts 52.5 33.4 19.10 3583 14.1 Delaware 50.5 33.1 17.40 540 16.4 Connecticut 49.8 34.4 15.40 2020 15.8 Vermont 48.8 34.9 13.90 550 16.3 California 48.3 34.6 13.70 16197 17.1 Illinois 48.4 35.8 12.60 5888 15.8 New Jersey 47.4 35.9 11.50 4239 16.7 Michigan 47.7 36.6 11.10 5056 15.7 Minnesota 48.4 38.2 10.20 3873 13.4 Washington 47.5 37.7 9.80 5333 14.8 Oregon 47.2 39.1 8.10 3002 13.7 Pennsylvania 46.4 41.2 5.20 8443 12.4 Maine 43.8 39.4 4.40 1040 16.8 New Mexico 44.7 41.1 3.60 1555 14.2 Ohio 44.1 40.5 3.60 6426 15.4 West Virginia 45.3 41.9 3.40 1202 12.8 Wisconsin 45.0 42.2 2.80 4140 12.8 Iowa 43.2 41.4 1.80 2337 15.4 Florida 43.0 42.3 0.70 9965 14.7 Arkansas 41.5 40.8 0.70 2071 17.7 Kentucky 43.5 43.1 0.40 2898 13.4 North Carolina 43.4 43.2 0.20 6213 13.4 New Hampshire 42.3 43.8 -1.50 873 13.9 Virginia 41.2 44.2 -3.00 5313 14.6 Missouri 40.1 44.0 -3.90 3727 15.9 Georgia 40.3 44.3 -4.00 5110 15.4 Nevada 39.2 43.4 -4.20 1348 17.4 Louisiana 40.3 45.1 -4.80 2655 14.6 Colorado 39.9 45.1 -5.20 3671 15.0 Texas 38.3 44.1 -5.80 11325 17.6 South Dakota 41.5 47.5 -6.00 607 11.0 Indiana 39.0 45.7 -6.70 4197 15.3 Mississippi 40.1 47.1 -7.00 1763 12.8 Arizona 39.8 47.3 -7.50 4325 12.9 Tennessee 38.1 46.5 -8.40 4231 15.4 Alaska 35.9 44.3 -8.44 NaN 19.8 Oklahoma 38.6 48.0 -9.40 2583 13.4 South Carolina 36.9 48.8 -11.90 2858 14.3 North Dakota 35.8 49.0 -13.20 547 15.2 Alabama 36.0 49.6 -13.60 3197 14.4 Montana 35.9 49.6 -13.70 1137 14.5 Kansas 34.4 51.3 -16.90 1937 14.3 Nebraska 33.1 52.1 -19.00 1351 14.8 Wyoming 26.7 56.6 -29.90 600 16.7 Idaho 27.5 57.8 -30.30 1336 14.7 Utah 24.5 63.8 -39.30 2256 11.7
census_data = pandas.read_csv("/home/skipper/school/seaboldgit/talks/pydata/data/census_demographics.csv")
def capitalize(s):
s = s.title()
s = s.replace("Of", "of")
return s
census_data["State"] = census_data.state.map(capitalize)
del census_data["state"]
census_data.set_index("State", inplace=True)
per_black per_hisp per_white educ_hs educ_coll average_income median_income pop_density vote_pop older_pop State Alabama 26.5 4.0 66.8 81.4 21.7 22984 42081 94.4 3001712.500 672383.600 Alaska 3.6 5.8 63.7 90.7 27.0 30726 66521 1.2 475548.444 58540.158 Arizona 4.5 30.1 57.4 85.0 26.3 25680 50448 56.3 3934880.535 920515.710 Arkansas 15.6 6.6 74.2 81.9 19.1 21274 39267 56.0 1798043.148 428944.934 California 6.6 38.1 39.7 80.7 30.1 29188 60883 239.1 24009747.944 4409953.704 Colorado 4.3 20.9 69.7 89.3 35.9 30151 56456 48.5 3310567.012 578197.948 Connecticut 11.1 13.8 70.9 88.4 35.2 36775 67740 738.1 2263008.088 515622.096 Delaware 21.9 8.4 65.1 87.0 27.7 29007 57599 460.8 568773.645 133348.845 District of Columbia 50.7 9.5 35.3 86.5 49.2 42078 58526 9856.5 442485.136 70451.544 Florida 16.5 22.9 57.5 85.3 25.9 26551 47661 350.6 11701330.788 3354127.392 Georgia 31.0 9.1 55.5 83.5 27.2 25134 49347 168.4 6242473.560 1079673.100 Hawaii 2.0 9.2 22.9 89.8 29.4 28882 66420 211.8 867505.110 202097.070 Idaho 0.8 11.5 83.6 88.2 24.3 22518 46423 19.0 954160.970 202878.080 Illinois 14.8 16.2 63.3 86.2 30.3 28782 55735 231.1 8133370.424 1634395.639 Indiana 9.4 6.2 81.3 86.2 22.4 24058 47697 181.0 4060042.406 860233.704 Iowa 3.1 5.2 88.4 89.9 24.5 25335 48872 54.5 1880257.726 456284.041 Kansas 6.1 10.8 77.8 89.2 29.3 25907 49424 34.9 1765811.370 381874.654 Kentucky 8.0 3.2 86.1 81.0 20.3 22515 41576 109.9 2757063.636 589863.060 Louisiana 32.4 4.4 60.1 81.0 20.9 23094 43445 104.9 2886721.516 571854.500 Maine 1.3 1.4 94.3 89.8 26.5 25385 46933 43.1 842071.192 216494.644 Maryland 30.0 8.4 54.4 87.8 35.7 34849 70647 594.8 3753418.116 728536.125 Massachusetts 7.8 9.9 76.4 88.7 38.3 33966 64509 839.4 4262135.792 922255.040 Michigan 14.3 4.5 76.4 88.0 25.0 25135 48432 174.8 6192369.249 1392542.367 Minnesota 5.4 4.9 82.8 91.3 31.4 29582 57243 66.6 3367262.430 700176.791 Mississippi 37.3 2.9 57.7 79.6 19.5 19977 37881 63.2 1840720.416 387206.560 Missouri 11.7 3.7 80.8 86.2 25.0 24724 46262 87.1 3744658.624 853517.696 Montana 0.5 3.1 87.5 91.0 27.9 23836 43872 6.8 623874.375 151726.248 Nebraska 4.7 9.5 81.8 90.0 27.7 25229 49342 23.8 1131381.574 250599.176 Nevada 8.6 27.1 53.6 84.3 21.8 27589 55726 24.6 1718416.182 340415.250 New Hampshire 1.3 2.9 92.2 90.9 32.9 31422 63277 147.0 854189.712 184547.160 New Jersey 14.6 18.1 58.9 87.3 34.6 34858 69811 1195.5 5566148.805 1208498.235 New Mexico 2.5 46.7 40.2 82.7 25.5 22966 43820 17.0 1280567.760 283182.464 New York 17.5 18.0 58.0 84.4 32.1 30948 55603 411.2 12516121.671 2666731.989 North Carolina 22.0 8.6 65.0 83.6 26.1 24745 45570 196.1 6093189.031 1274644.932 North Dakota 1.3 2.2 88.6 89.4 26.3 25803 46781 9.7 434296.820 98486.208 Ohio 12.4 3.2 81.0 87.4 24.1 25113 47358 282.3 7204049.424 1650927.993 Oklahoma 7.7 9.2 68.2 85.4 22.6 23094 42979 54.7 2335568.928 519436.596 Oregon 2.0 12.0 78.1 88.6 28.6 26171 49260 39.9 2454758.606 553675.837 Pennsylvania 11.3 5.9 79.2 87.4 26.4 27049 50398 283.9 7989789.522 1987890.216 Rhode Island 7.2 12.8 76.5 83.7 30.3 28707 54902 1018.1 677038.488 154541.394 South Carolina 28.1 5.3 64.0 83.0 24.0 23443 43939 153.9 2938556.440 659771.430 South Dakota 1.4 2.9 84.4 89.3 25.3 24110 46369 10.7 501865.938 118667.808 Tennessee 16.9 4.7 75.4 82.5 22.7 23722 43314 153.9 4034112.390 877259.361 Texas 12.2 38.1 44.8 80.0 25.8 24870 49646 96.3 16021000.944 2695841.505 Utah 1.3 13.2 80.1 90.6 29.4 23139 56330 33.6 1679064.312 259184.424 Vermont 1.1 1.6 94.2 90.6 33.3 27478 51841 67.9 406553.719 93964.650 Virginia 19.8 8.2 64.5 86.1 33.8 32145 61406 202.6 5230406.184 1012075.500 Washington 3.8 11.6 72.1 89.6 31.0 29733 57244 101.2 4378054.358 867414.826 West Virginia 3.5 1.3 93.0 81.9 17.3 21232 38380 77.1 1170734.684 300568.968 Wisconsin 6.5 6.1 83.1 89.4 25.8 26624 51598 105.0 3592701.443 793935.613 Wyoming 1.1 9.1 85.5 91.3 23.6 27860 53802 5.8 361348.488 72156.066
#loadpy https://raw.github.com/gist/3912533/d958b515f602f6e73f7b16d8bc412bc8d1f433d9/state_abbrevs.py;
states_abbrev_dict = {
'AK': 'Alaska',
'AL': 'Alabama',
'AR': 'Arkansas',
'AS': 'American Samoa',
'AZ': 'Arizona',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DC': 'District of Columbia',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'GU': 'Guam',
'HI': 'Hawaii',
'IA': 'Iowa',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'MA': 'Massachusetts',
'MD': 'Maryland',
'ME': 'Maine',
'MI': 'Michigan',
'MN': 'Minnesota',
'MO': 'Missouri',
'MP': 'Northern Mariana Islands',
'MS': 'Mississippi',
'MT': 'Montana',
'NA': 'National',
'NC': 'North Carolina',
'ND': 'North Dakota',
'NE': 'Nebraska',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NV': 'Nevada',
'NY': 'New York',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'PR': 'Puerto Rico',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VA': 'Virginia',
'VI': 'Virgin Islands',
'VT': 'Vermont',
'WA': 'Washington',
'WI': 'Wisconsin',
'WV': 'West Virginia',
'WY': 'Wyoming'
}
Campaign Contributions from FEC.
obama_give = pandas.read_csv("/home/skipper/school/seaboldgit/talks/pydata/data/obama_indiv_state.csv",
header=None, names=["State", "obama_give"])
romney_give = pandas.read_csv("/home/skipper/school/seaboldgit/talks/pydata/data/romney_indiv_state.csv",
header=None, names=["State", "romney_give"])
obama_give.State.replace(states_abbrev_dict, inplace=True);
romney_give.State.replace(states_abbrev_dict, inplace=True);
obama_give.set_index("State", inplace=True)
romney_give.set_index("State", inplace=True);
demo_data = census_data.join(party_affil[["dem_adv", "no_party"]]).join(pvi)
demo_data = demo_data.join(obama_give).join(romney_give)
giving = demo_data[["obama_give", "romney_give"]].div(demo_data[["vote_pop", "older_pop"]].sum(1), axis=0)
giving
obama_give romney_give State Alabama 0.245 0.366 Alaska 1.112 0.499 Arizona 0.569 0.673 Arkansas 0.247 0.217 California 1.128 0.618 Colorado 1.056 0.797 Connecticut 1.207 1.545 Delaware 0.767 0.359 District of Columbia 326.864 2.535 Florida 0.503 0.875 Georgia 0.468 0.526 Hawaii 1.007 0.225 Idaho 0.366 0.990 Illinois 0.934 0.590 Indiana 0.341 0.257 Iowa 0.487 0.286 Kansas 0.392 0.470 Kentucky 0.289 0.393 Louisiana 0.260 0.529 Maine 0.800 0.246 Maryland 1.518 0.594 Massachusetts 1.735 1.105 Michigan 0.512 0.501 Minnesota 0.660 0.232 Mississippi 0.189 0.327 Missouri 0.413 0.482 Montana 0.764 0.534 Nebraska 0.336 0.351 Nevada 0.484 0.639 New Hampshire 0.962 0.734 New Jersey 0.736 0.704 New Mexico 1.052 0.379 New York 1.199 0.809 North Carolina 0.549 0.355 North Dakota 0.238 0.343 Ohio 0.378 0.428 Oklahoma 0.325 0.801 Oregon 0.971 0.342 Pennsylvania 0.588 0.467 Rhode Island 0.713 0.358 South Carolina 0.317 0.351 South Dakota 0.271 0.519 Tennessee 0.377 0.522 Texas 0.477 0.691 Utah 0.379 2.395 Vermont 1.602 0.250 Virginia 1.000 0.939 Washington 1.191 0.476 West Virginia 0.260 0.321 Wisconsin 0.455 0.238 Wyoming 0.746 1.080
demo_data[["obama_give", "romney_give"]] = giving
from scipy import cluster as sp_cluster
from sklearn import cluster, neighbors
clean_data = sp_cluster.vq.whiten(demo_data.values)
clean_data.var(axis=0)
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
KNN = neighbors.NearestNeighbors(n_neighbors=7)
KNN.fit(clean_data)
KNN.kneighbors(clean_data[0], return_distance=True)
(array([[ 0. , 0.8763, 1.0233, 1.3971, 1.8694, 2.1093, 2.2603]]), array([[ 0, 40, 18, 42, 24, 33, 10]], dtype=int32))
idx = _[1]
demo_data.index[0], demo_data.index[idx]
('Alabama', array([['Alabama', 'South Carolina', 'Louisiana', 'Tennessee', 'Mississippi', 'North Carolina', 'Georgia']], dtype=object))
nearest_neighbor = {}
for i, state in enumerate(demo_data.index):
neighborhood = KNN.kneighbors(clean_data[i], return_distance=True)
nearest_neighbor.update({state : (demo_data.index[neighborhood[1]],
neighborhood[0])})
nearest_neighbor
{'Alabama': (array([['Alabama', 'South Carolina', 'Louisiana', 'Tennessee', 'Mississippi', 'North Carolina', 'Georgia']], dtype=object), array([[ 0. , 0.8763, 1.0233, 1.3971, 1.8694, 2.1093, 2.2603]])), 'Alaska': (array([['Alaska', 'Wyoming', 'Nebraska', 'Washington', 'Kansas', 'Delaware', 'Colorado']], dtype=object), array([[ 0. , 3.3349, 3.6854, 3.7395, 3.7689, 3.7821, 3.8213]])), 'Arizona': (array([['Arizona', 'Nevada', 'Oklahoma', 'New Mexico', 'Oregon', 'Colorado', 'Kansas']], dtype=object), array([[ 0. , 2.6992, 2.8728, 2.8766, 3.0299, 3.0351, 3.0367]])), 'Arkansas': (array([['Arkansas', 'Tennessee', 'Alabama', 'Kentucky', 'Indiana', 'Missouri', 'South Carolina']], dtype=object), array([[ 0. , 1.8792, 2.3171, 2.4043, 2.4555, 2.5235, 2.5635]])), 'California': (array([['California', 'Texas', 'New York', 'Florida', 'Illinois', 'New Jersey', 'Pennsylvania']], dtype=object), array([[ 0. , 3.9121, 4.461 , 4.7217, 5.8944, 6.7702, 7.0775]])), 'Colorado': (array([['Colorado', 'Washington', 'Virginia', 'Oregon', 'Kansas', 'Minnesota', 'New Hampshire']], dtype=object), array([[ 0. , 1.8541, 2.4517, 2.5881, 2.655 , 2.7096, 2.7496]])), 'Connecticut': (array([['Connecticut', 'Massachusetts', 'New Jersey', 'Virginia', 'Colorado', 'Washington', 'New Hampshire']], dtype=object), array([[ 0. , 1.882 , 2.4986, 2.8973, 3.2209, 3.4531, 3.4642]])), 'Delaware': (array([['Delaware', 'Washington', 'Michigan', 'Illinois', 'Oregon', 'Virginia', 'Rhode Island']], dtype=object), array([[ 0. , 2.5376, 2.7875, 2.8076, 2.8882, 2.9771, 3.0112]])), 'District of Columbia': (array([['District of Columbia', 'Maryland', 'Massachusetts', 'Connecticut', 'New Jersey', 'Virginia', 'Delaware']], dtype=object), array([[ 0. , 12.3449, 12.6305, 12.6762, 13.2193, 13.5237, 13.7589]])), 'Florida': (array([['Florida', 'New York', 'Illinois', 'Texas', 'Pennsylvania', 'North Carolina', 'Ohio']], dtype=object), array([[ 0. , 2.9107, 3.038 , 3.1893, 3.3073, 3.4896, 3.5918]])), 'Georgia': (array([['Georgia', 'North Carolina', 'South Carolina', 'Louisiana', 'Alabama', 'Tennessee', 'Missouri']], dtype=object), array([[ 0. , 1.5857, 1.7042, 2.0122, 2.2603, 2.2821, 2.7322]])), 'Hawaii': (array([['Hawaii', 'Delaware', 'Washington', 'New Jersey', 'Illinois', 'Nevada', 'Alaska']], dtype=object), array([[ 0. , 3.5885, 3.861 , 4.0249, 4.2819, 4.3453, 4.3871]])), 'Idaho': (array([['Idaho', 'Nebraska', 'Kansas', 'Wyoming', 'Oklahoma', 'Montana', 'North Dakota']], dtype=object), array([[ 0. , 1.9803, 2.0754, 2.0787, 2.1757, 2.2378, 2.3297]])), 'Illinois': (array([['Illinois', 'New York', 'Washington', 'Michigan', 'Virginia', 'Pennsylvania', 'New Jersey']], dtype=object), array([[ 0. , 2.0673, 2.137 , 2.2883, 2.4543, 2.59 , 2.6832]])), 'Indiana': (array([['Indiana', 'Missouri', 'Tennessee', 'Ohio', 'Iowa', 'Michigan', 'Wisconsin']], dtype=object), array([[ 0. , 0.907 , 1.5921, 1.6033, 1.7819, 1.9781, 2.0616]])), 'Iowa': (array([['Iowa', 'Maine', 'Wisconsin', 'Oregon', 'North Dakota', 'Missouri', 'Indiana']], dtype=object), array([[ 0. , 1.1152, 1.5243, 1.594 , 1.6018, 1.755 , 1.7819]])), 'Kansas': (array([['Kansas', 'Nebraska', 'North Dakota', 'Montana', 'Idaho', 'Indiana', 'Missouri']], dtype=object), array([[ 0. , 0.6719, 1.5316, 1.6431, 2.0754, 2.154 , 2.2018]])), 'Kentucky': (array([['Kentucky', 'West Virginia', 'Tennessee', 'Indiana', 'Oklahoma', 'Alabama', 'Arkansas']], dtype=object), array([[ 0. , 1.1876, 1.7593, 2.1843, 2.2642, 2.3361, 2.4043]])), 'Louisiana': (array([['Louisiana', 'Alabama', 'South Carolina', 'Mississippi', 'Tennessee', 'Georgia', 'North Carolina']], dtype=object), array([[ 0. , 1.0233, 1.136 , 1.5793, 1.9074, 2.0122, 2.2222]])), 'Maine': (array([['Maine', 'Iowa', 'Vermont', 'North Dakota', 'Montana', 'Oregon', 'Missouri']], dtype=object), array([[ 0. , 1.1152, 1.7709, 2.0009, 2.1852, 2.2556, 2.2738]])), 'Maryland': (array([['Maryland', 'Virginia', 'New Jersey', 'Massachusetts', 'Connecticut', 'Delaware', 'Washington']], dtype=object), array([[ 0. , 2.9192, 2.9716, 3.0229, 3.4926, 3.5658, 3.7933]])), 'Massachusetts': (array([['Massachusetts', 'Connecticut', 'New Jersey', 'Washington', 'Virginia', 'Colorado', 'New Hampshire']], dtype=object), array([[ 0. , 1.882 , 2.5823, 2.6111, 2.7211, 2.8532, 2.8816]])), 'Michigan': (array([['Michigan', 'Ohio', 'Missouri', 'Pennsylvania', 'Indiana', 'Wisconsin', 'Iowa']], dtype=object), array([[ 0. , 0.914 , 1.5797, 1.9268, 1.9781, 2.1194, 2.1783]])), 'Minnesota': (array([['Minnesota', 'Washington', 'Wisconsin', 'Oregon', 'New Hampshire', 'Iowa', 'Vermont']], dtype=object), array([[ 0. , 1.4419, 1.6072, 1.8303, 1.9445, 2.2889, 2.363 ]])), 'Mississippi': (array([['Mississippi', 'Louisiana', 'Alabama', 'South Carolina', 'Tennessee', 'North Carolina', 'Georgia']], dtype=object), array([[ 0. , 1.5793, 1.8694, 2.1346, 3.0436, 3.163 , 3.2884]])), 'Missouri': (array([['Missouri', 'Indiana', 'Ohio', 'Tennessee', 'Michigan', 'Iowa', 'North Dakota']], dtype=object), array([[ 0. , 0.907 , 1.414 , 1.537 , 1.5797, 1.755 , 2.1177]])), 'Montana': (array([['Montana', 'North Dakota', 'Nebraska', 'Kansas', 'Iowa', 'South Dakota', 'Maine']], dtype=object), array([[ 0. , 1.0132, 1.4011, 1.6431, 1.7944, 1.8935, 2.1852]])), 'Nebraska': (array([['Nebraska', 'Kansas', 'North Dakota', 'Montana', 'Idaho', 'Iowa', 'Indiana']], dtype=object), array([[ 0. , 0.6719, 1.1533, 1.4011, 1.9803, 2.0716, 2.1521]])), 'Nevada': (array([['Nevada', 'Arizona', 'Delaware', 'Illinois', 'New Mexico', 'Colorado', 'Indiana']], dtype=object), array([[ 0. , 2.6992, 3.2186, 3.3355, 3.4181, 3.4779, 3.4977]])), 'New Hampshire': (array([['New Hampshire', 'Minnesota', 'Washington', 'Vermont', 'Colorado', 'Wisconsin', 'Massachusetts']], dtype=object), array([[ 0. , 1.9445, 2.3854, 2.6837, 2.7496, 2.8483, 2.8816]])), 'New Jersey': (array([['New Jersey', 'Virginia', 'Connecticut', 'Massachusetts', 'Illinois', 'Washington', 'Maryland']], dtype=object), array([[ 0. , 2.4212, 2.4986, 2.5823, 2.6832, 2.9395, 2.9716]])), 'New Mexico': (array([['New Mexico', 'Arizona', 'Nevada', 'Oregon', 'Oklahoma', 'North Carolina', 'Colorado']], dtype=object), array([[ 0. , 2.8766, 3.4181, 4.7554, 4.8641, 4.8808, 4.9258]])), 'New York': (array([['New York', 'Illinois', 'Florida', 'New Jersey', 'Virginia', 'Michigan', 'Pennsylvania']], dtype=object), array([[ 0. , 2.0673, 2.9107, 3.3612, 3.6949, 3.819 , 3.9422]])), 'North Carolina': (array([['North Carolina', 'Georgia', 'South Carolina', 'Tennessee', 'Alabama', 'Ohio', 'Missouri']], dtype=object), array([[ 0. , 1.5857, 1.6217, 1.8337, 2.1093, 2.1905, 2.1934]])), 'North Dakota': (array([['North Dakota', 'Montana', 'Nebraska', 'Kansas', 'Iowa', 'Maine', 'Indiana']], dtype=object), array([[ 0. , 1.0132, 1.1533, 1.5316, 1.6018, 2.0009, 2.0738]])), 'Ohio': (array([['Ohio', 'Michigan', 'Missouri', 'Indiana', 'Pennsylvania', 'Wisconsin', 'North Carolina']], dtype=object), array([[ 0. , 0.914 , 1.414 , 1.6033, 1.6991, 2.1596, 2.1905]])), 'Oklahoma': (array([['Oklahoma', 'Tennessee', 'Idaho', 'Indiana', 'Kentucky', 'Kansas', 'Missouri']], dtype=object), array([[ 0. , 2.0111, 2.1757, 2.21 , 2.2642, 2.3057, 2.3747]])), 'Oregon': (array([['Oregon', 'Wisconsin', 'Iowa', 'Washington', 'Minnesota', 'Missouri', 'Kansas']], dtype=object), array([[ 0. , 1.2415, 1.594 , 1.6631, 1.8303, 2.2442, 2.2534]])), 'Pennsylvania': (array([['Pennsylvania', 'Ohio', 'Michigan', 'Wisconsin', 'North Carolina', 'Oregon', 'Illinois']], dtype=object), array([[ 0. , 1.6991, 1.9268, 1.9716, 2.297 , 2.5221, 2.59 ]])), 'Rhode Island': (array([['Rhode Island', 'Delaware', 'Vermont', 'Maine', 'Nevada', 'Illinois', 'Washington']], dtype=object), array([[ 0. , 3.0112, 3.6719, 3.8235, 3.8411, 3.8785, 3.887 ]])), 'South Carolina': (array([['South Carolina', 'Alabama', 'Louisiana', 'Tennessee', 'North Carolina', 'Georgia', 'Mississippi']], dtype=object), array([[ 0. , 0.8763, 1.136 , 1.4947, 1.6217, 1.7042, 2.1346]])), 'South Dakota': (array([['South Dakota', 'Montana', 'North Dakota', 'Wisconsin', 'Kansas', 'Nebraska', 'Oklahoma']], dtype=object), array([[ 0. , 1.8935, 2.1282, 2.1654, 2.2243, 2.2675, 2.4938]])), 'Tennessee': (array([['Tennessee', 'Alabama', 'South Carolina', 'Missouri', 'Indiana', 'Kentucky', 'North Carolina']], dtype=object), array([[ 0. , 1.3971, 1.4947, 1.537 , 1.5921, 1.7593, 1.8337]])), 'Texas': (array([['Texas', 'Florida', 'California', 'New York', 'Arizona', 'Illinois', 'Georgia']], dtype=object), array([[ 0. , 3.1893, 3.9121, 4.2069, 4.5867, 4.6339, 4.7973]])), 'Utah': (array([['Utah', 'Idaho', 'Wyoming', 'Kansas', 'Oklahoma', 'Nebraska', 'South Dakota']], dtype=object), array([[ 0. , 3.8272, 4.1158, 4.8072, 4.8669, 5.0276, 5.046 ]])), 'Vermont': (array([['Vermont', 'Maine', 'Iowa', 'Minnesota', 'Oregon', 'Washington', 'New Hampshire']], dtype=object), array([[ 0. , 1.7709, 2.3538, 2.363 , 2.5073, 2.6424, 2.6837]])), 'Virginia': (array([['Virginia', 'New Jersey', 'Colorado', 'Illinois', 'Washington', 'Massachusetts', 'Connecticut']], dtype=object), array([[ 0. , 2.4212, 2.4517, 2.4543, 2.5525, 2.7211, 2.8973]])), 'Washington': (array([['Washington', 'Minnesota', 'Oregon', 'Colorado', 'Wisconsin', 'Illinois', 'New Hampshire']], dtype=object), array([[ 0. , 1.4419, 1.6631, 1.8541, 2.0414, 2.137 , 2.3854]])), 'West Virginia': (array([['West Virginia', 'Kentucky', 'Tennessee', 'Indiana', 'Oklahoma', 'Arkansas', 'Missouri']], dtype=object), array([[ 0. , 1.1876, 2.7545, 2.8103, 2.8824, 2.9104, 3.1176]])), 'Wisconsin': (array([['Wisconsin', 'Oregon', 'Iowa', 'Minnesota', 'Pennsylvania', 'Washington', 'Indiana']], dtype=object), array([[ 0. , 1.2415, 1.5243, 1.6072, 1.9716, 2.0414, 2.0616]])), 'Wyoming': (array([['Wyoming', 'Idaho', 'Nebraska', 'Kansas', 'North Dakota', 'Montana', 'Alaska']], dtype=object), array([[ 0. , 2.0787, 2.4089, 2.6214, 2.6543, 2.8861, 3.3349]]))}
k_means = cluster.KMeans(n_clusters=5, n_init=50)
k_means.fit(clean_data)
values = k_means.cluster_centers_.squeeze()
labels = k_means.labels_
clusters = sp_cluster.vq.kmeans(clean_data, 5)[0]
def choose_group(data, clusters):
"""
Return the index of the cluster to which the rows in data
are "closest" (in the sense of the L2-norm)
"""
data = data[:,None] # add an axis for broadcasting
distances = data - clusters
groups = []
for row in distances:
dists = map(np.linalg.norm, row)
groups.append(np.argmin(dists))
return groups
groups = choose_group(clean_data, clusters)
np.array(groups)
array([1, 0, 3, 1, 4, 0, 0, 0, 2, 4, 1, 0, 3, 0, 3, 3, 3, 1, 1, 3, 0, 0, 3, 0, 1, 3, 3, 3, 0, 0, 0, 1, 4, 1, 3, 3, 1, 3, 3, 0, 1, 3, 1, 4, 3, 0, 0, 0, 1, 3, 3])
Or use a one-liner
groups = [np.argmin(map(np.linalg.norm, (clean_data[:,None] - clusters)[i])) for i in range(51)]
demo_data["kmeans_group"] = groups
demo_data["kmeans_labels"] = labels
for _, group in demo_data.groupby("kmeans_group"):
group = group.index
group.values.sort()
print group.values
['Alaska' 'Colorado' 'Connecticut' 'Delaware' 'Hawaii' 'Illinois' 'Maryland' 'Massachusetts' 'Minnesota' 'Nevada' 'New Hampshire' 'New Jersey' 'Rhode Island' 'Vermont' 'Virginia' 'Washington'] ['Alabama' 'Arkansas' 'Georgia' 'Kentucky' 'Louisiana' 'Mississippi' 'New Mexico' 'North Carolina' 'Oklahoma' 'South Carolina' 'Tennessee' 'West Virginia'] ['District of Columbia'] ['Arizona' 'Idaho' 'Indiana' 'Iowa' 'Kansas' 'Maine' 'Michigan' 'Missouri' 'Montana' 'Nebraska' 'North Dakota' 'Ohio' 'Oregon' 'Pennsylvania' 'South Dakota' 'Utah' 'Wisconsin' 'Wyoming'] ['California' 'Florida' 'New York' 'Texas']
labels
array([0, 1, 0, 0, 3, 1, 1, 1, 2, 3, 0, 1, 4, 1, 4, 4, 4, 0, 0, 4, 1, 1, 4, 4, 0, 4, 4, 4, 1, 4, 1, 0, 3, 0, 4, 4, 0, 4, 4, 1, 0, 4, 0, 3, 4, 4, 1, 1, 0, 4, 4])
demo_data["kmeans_labels"] = labels
for _, group in demo_data.groupby("kmeans_labels"):
group = group.index.copy()
group.values.sort()
print group.values
['Alabama' 'Arizona' 'Arkansas' 'Georgia' 'Kentucky' 'Louisiana' 'Mississippi' 'New Mexico' 'North Carolina' 'Oklahoma' 'South Carolina' 'Tennessee' 'West Virginia'] ['Alaska' 'Colorado' 'Connecticut' 'Delaware' 'Hawaii' 'Illinois' 'Maryland' 'Massachusetts' 'Nevada' 'New Jersey' 'Rhode Island' 'Virginia' 'Washington'] ['District of Columbia'] ['California' 'Florida' 'New York' 'Texas'] ['Idaho' 'Indiana' 'Iowa' 'Kansas' 'Maine' 'Michigan' 'Minnesota' 'Missouri' 'Montana' 'Nebraska' 'New Hampshire' 'North Dakota' 'Ohio' 'Oregon' 'Pennsylvania' 'South Dakota' 'Utah' 'Vermont' 'Wisconsin' 'Wyoming']
demo_data = demo_data.reset_index()
state_data2012.State.replace(states_abbrev_dict, inplace=True);
state_data2012 = state_data2012.merge(demo_data[["State", "kmeans_labels"]], on="State")
kmeans_groups = state_data2012.groupby("kmeans_labels")
group = kmeans_groups.get_group(kmeans_groups.groups.keys()[2])
group.State.unique()
array(['California', 'Florida', 'New York', 'Texas'], dtype=object)
def edit_tick_label(tick_val, tick_pos):
if tick_val < 0:
text = str(int(tick_val)).replace("-", "Romney+")
else:
text = "Obama+"+str(int(tick_val))
return text
from pandas import lib
from matplotlib.ticker import FuncFormatter
fig, axes = plt.subplots(figsize=(12,8))
data = group[["poll_date", "obama_spread"]]
data = pandas.concat((data, national_data2012[["poll_date", "obama_spread"]]))
data.sort("poll_date", inplace=True)
dates = pandas.DatetimeIndex(data.poll_date).asi8
loess_res = sm.nonparametric.lowess(data.obama_spread.values, dates,
frac=.2, it=3)
dates_x = lib.ints_to_pydatetime(dates)
axes.scatter(dates_x, data["obama_spread"])
axes.plot(dates_x, loess_res[:,1], color='r')
axes.yaxis.get_major_locator().set_params(nbins=12)
axes.yaxis.set_major_formatter(FuncFormatter(edit_tick_label))
axes.grid(False, axis='x')
axes.hlines(0, dates_x[0], dates_x[-1], color='black', lw=3)
axes.margins(0, .05)
loess_res[-7:,1].mean()
2.3144535643345003
from pandas import lib
from matplotlib.ticker import FuncFormatter
fig, axes = plt.subplots(figsize=(12,8))
national_data2012.sort("poll_date", inplace=True)
dates = pandas.DatetimeIndex(national_data2012.poll_date).asi8
loess_res = sm.nonparametric.lowess(national_data2012.obama_spread.values, dates,
frac=.075, it=3)
dates_x = lib.ints_to_pydatetime(dates)
axes.scatter(dates_x, national_data2012["obama_spread"])
axes.plot(dates_x, loess_res[:,1], color='r')
axes.yaxis.get_major_locator().set_params(nbins=12)
axes.yaxis.set_major_formatter(FuncFormatter(edit_tick_label))
axes.grid(False, axis='x')
axes.hlines(0, dates_x[0], dates_x[-1], color='black', lw=3)
axes.margins(0, .05)
trends = []
for i, group in kmeans_groups:
data = group[["poll_date", "obama_spread"]]
data = pandas.concat((data, national_data2012[["poll_date", "obama_spread"]]))
data.sort("poll_date", inplace=True)
dates = pandas.DatetimeIndex(data.poll_date).asi8
loess_res = sm.nonparametric.lowess(data.obama_spread.values, dates,
frac=.1, it=3)
states = group.State.unique()
for state in states:
trends.append([state, loess_res[-7:,1].mean()])
trends
[['Arizona', 2.3149200179538716], ['Georgia', 2.3149200179538716], ['Mississippi', 2.3149200179538716], ['New Mexico', 2.3149200179538716], ['North Carolina', 2.3149200179538716], ['South Carolina', 2.3149200179538716], ['Tennessee', 2.3149200179538716], ['West Virginia', 2.3149200179538716], ['Colorado', 18.412063676088412], ['Connecticut', 18.412063676088412], ['Hawaii', 18.412063676088412], ['Illinois', 18.412063676088412], ['Maryland', 18.412063676088412], ['Massachusetts', 18.412063676088412], ['Nevada', 18.412063676088412], ['New Jersey', 18.412063676088412], ['Rhode Island', 18.412063676088412], ['Virginia', 18.412063676088412], ['Washington', 18.412063676088412], ['California', 2.73263672729736], ['Florida', 2.73263672729736], ['New York', 2.73263672729736], ['Texas', 2.73263672729736], ['Indiana', 6.5865280433068092], ['Iowa', 6.5865280433068092], ['Kansas', 6.5865280433068092], ['Maine', 6.5865280433068092], ['Michigan', 6.5865280433068092], ['Minnesota', 6.5865280433068092], ['Missouri', 6.5865280433068092], ['Montana', 6.5865280433068092], ['Nebraska', 6.5865280433068092], ['New Hampshire', 6.5865280433068092], ['North Dakota', 6.5865280433068092], ['Ohio', 6.5865280433068092], ['Oregon', 6.5865280433068092], ['Pennsylvania', 6.5865280433068092], ['South Dakota', 6.5865280433068092], ['Utah', 6.5865280433068092], ['Vermont', 6.5865280433068092], ['Wisconsin', 6.5865280433068092]]
where $S_i$ are Pollster:State dummies. In a state with a time-dependent trend, you might write
$$\text{Margin}=X_i+m*Z_t$$where $m$ is a multiplier representing uncertainty in the time-trend parameter. Solving for $m$ gives
$$m=\text{Margin}-\frac{X_i}{Z_t}$$from statsmodels.formula.api import ols, wls
#pollster_state_dummy = state_data2012.groupby(["Pollster", "State"])["obama_spread"].mean()
#daily_dummy = state_data2012.groupby(["poll_date"])["obama_spread"].mean()
state_data2012["pollster_state"] = state_data2012["Pollster"] + "-" + state_data2012["State"]
There's actually a bug in pandas when you merge on datetimes. In order to avoid it, we need to sort our data now and once again after we merge on dates.
state_data2012.sort(columns=["pollster_state", "poll_date"], inplace=True);
dummy_model = ols("obama_spread ~ C(pollster_state) + C(poll_date)", data=state_data2012).fit()
The base case is American Research Group-Colorado
state_data2012.irow(0)
Pollster American Research Group State Colorado MoE 4 Obama (D) 49 Romney (R) 47 Sample 600 Spread Obama +2 obama_spread 2 poll_date 2012-09-11 00:00:00 Weight 0.65 PIE 1.76 ESS 173 MESS 173 time_weight 0.6156 kmeans_labels 1 pollster_state American Research Gro... Name: 25
pollster_state = state_data2012["pollster_state"].unique()
pollster_state.sort()
pollster_state_params = dummy_model.params[1:len(pollster_state)] + dummy_model.params[0]
intercept = dummy_model.params[0]
X = pandas.DataFrame(zip(pollster_state, np.r_[intercept, pollster_state_params]),
columns=["pollster_state", "X"])
dates = state_data2012.poll_date.unique()
dates.sort()
dates_params = intercept + dummy_model.params[-len(dates):]
Z = pandas.DataFrame(zip(dates, dates_params), columns=["poll_date", "Z"])
Drop the ones less than 1.
Z = Z.ix[np.abs(Z.Z) > 1]
state_data2012 = state_data2012.merge(X, on="pollster_state", sort=False)
state_data2012 = state_data2012.merge(Z, on="poll_date", sort=False)
state_data2012.sort(columns=["pollster_state", "poll_date"], inplace=True);
state_data2012["m"] = state_data2012["obama_spread"].sub(state_data2012["X"].div(state_data2012["Z"]))
#m_dataframe.ix[m_dataframe.pollster_state == "American Research Group-New Hampshire"].values
m_dataframe = state_data2012[["State", "m", "poll_date", "Pollster", "pollster_state"]]
m_dataframe["m"].describe()
count 355.000 mean 3.281 std 9.168 min -52.000 25% -0.808 50% 2.697 75% 8.145 max 38.723
m_size = m_dataframe.groupby("pollster_state").size()
m_size
pollster_state American Research Group-Colorado 1 American Research Group-Florida 1 American Research Group-Iowa 1 American Research Group-Nevada 1 American Research Group-New Hampshire 3 American Research Group-North Carolina 1 American Research Group-Ohio 1 American Research Group-Virginia 1 CNN / Opinion Research-Wisconsin 1 Chicago Trib. / MarketShares-Illinois 1 Columbus Dispatch (OH)-Ohio 2 EPIC-MRA-Michigan 8 Fairleigh-Dickinson (NJ)-New Jersey 3 Field Poll (CA)-California 6 Insider Advantage-Georgia 2 LA Times / Bloomberg-New Hampshire 1 Marist (NY)-New York 3 Mason-Dixon-Florida 3 Mason-Dixon-Georgia 1 Mason-Dixon-New Hampshire 1 Mason-Dixon-North Dakota 1 Mason-Dixon-Utah 1 Mason-Dixon-Virginia 1 Mitchell-Michigan 3 Ohio Poll-Ohio 2 Public Policy Polling (PPP)-Arizona 7 Public Policy Polling (PPP)-California 2 Public Policy Polling (PPP)-Colorado 6 Public Policy Polling (PPP)-Connecticut 3 Public Policy Polling (PPP)-Florida 8 Public Policy Polling (PPP)-Georgia 1 Public Policy Polling (PPP)-Hawaii 1 Public Policy Polling (PPP)-Iowa 8 Public Policy Polling (PPP)-Maine 2 Public Policy Polling (PPP)-Maryland 1 Public Policy Polling (PPP)-Massachusetts 6 Public Policy Polling (PPP)-Michigan 6 Public Policy Polling (PPP)-Minnesota 5 Public Policy Polling (PPP)-Mississippi 2 Public Policy Polling (PPP)-Missouri 7 Public Policy Polling (PPP)-Montana 3 Public Policy Polling (PPP)-Nebraska 1 Public Policy Polling (PPP)-Nevada 4 Public Policy Polling (PPP)-New Hampshire 3 Public Policy Polling (PPP)-New Mexico 6 Public Policy Polling (PPP)-North Carolina 22 Public Policy Polling (PPP)-Ohio 9 Public Policy Polling (PPP)-Oregon 2 Public Policy Polling (PPP)-Pennsylvania 5 Public Policy Polling (PPP)-Rhode Island 1 Public Policy Polling (PPP)-South Carolina 3 Public Policy Polling (PPP)-South Dakota 1 Public Policy Polling (PPP)-Tennessee 1 Public Policy Polling (PPP)-Texas 3 Public Policy Polling (PPP)-Utah 1 Public Policy Polling (PPP)-Virginia 7 Public Policy Polling (PPP)-Washington 3 Public Policy Polling (PPP)-West Virginia 3 Public Policy Polling (PPP)-Wisconsin 6 Quinnipiac-Connecticut 4 Quinnipiac-Florida 12 Quinnipiac-New Jersey 8 Quinnipiac-New York 5 Quinnipiac-Ohio 11 Quinnipiac-Pennsylvania 9 Quinnipiac-Virginia 5 Rasmussen-Arizona 3 Rasmussen-California 1 Rasmussen-Colorado 3 Rasmussen-Connecticut 1 Rasmussen-Florida 5 Rasmussen-Indiana 1 Rasmussen-Iowa 3 Rasmussen-Maine 1 Rasmussen-Massachusetts 4 Rasmussen-Michigan 2 Rasmussen-Missouri 6 Rasmussen-Montana 5 Rasmussen-Nebraska 2 Rasmussen-Nevada 3 Rasmussen-New Hampshire 1 Rasmussen-New Jersey 1 Rasmussen-New Mexico 3 Rasmussen-North Carolina 4 Rasmussen-North Dakota 1 Rasmussen-Ohio 7 Rasmussen-Pennsylvania 4 Rasmussen-Virginia 5 Rasmussen-Washington 1 Rasmussen-Wisconsin 7 Suffolk (NH/MA)-Florida 2 SurveyUSA-California 4 SurveyUSA-Florida 2 SurveyUSA-Georgia 4 SurveyUSA-Kansas 2 SurveyUSA-Michigan 1 SurveyUSA-New Jersey 1 SurveyUSA-New York 1 SurveyUSA-North Carolina 2 SurveyUSA-Oregon 4 SurveyUSA-Pennsylvania 1 SurveyUSA-Washington 4 Length: 102
drop_idx = m_size.ix[m_size == 1]
m_dataframe = m_dataframe.