This is an example of logistic regression in Python with the scikit-learn module, performed for an assignment with my General Assembly Data Science class.

The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy.

The dataset contains 6366 observations of 9 variables:

`rate_marriage`

: woman's rating of her marriage (1 = very poor, 5 = very good)`age`

: woman's age`yrs_married`

: number of years married`children`

: number of children`religious`

: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)`educ`

: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)`occupation`

: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)`occupation_husb`

: husband's occupation (same coding as above)`affairs`

: time spent in extra-marital affairs

I decided to treat this as a classification problem by creating a new binary variable `affair`

(did the woman have at least one affair?) and trying to predict the classification for each woman.

Skipper Seabold, one of the primary contributors to Statsmodels, did a similar classification in his Statsmodels demo at a Statistical Programming DC Meetup. However, he used Statsmodels for the classification (whereas I'm using scikit-learn), and he treated the occupation variables as continuous (whereas I'm treating them as categorical).

In [1]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
```

First, let's load the dataset and add a binary `affair`

column.

In [2]:

```
# load dataset
dta = sm.datasets.fair.load_pandas().data
# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)
```

In [3]:

```
dta.groupby('affair').mean()
```

Out[3]:

rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|

affair | |||||||||

0 | 4.329701 | 28.390679 | 7.989335 | 1.238813 | 2.504521 | 14.322977 | 3.405286 | 3.833758 | 0.000000 |

1 | 3.647345 | 30.537019 | 11.152460 | 1.728933 | 2.261568 | 13.972236 | 3.463712 | 3.884559 | 2.187243 |

`rate_marriage`

variable.

In [4]:

```
dta.groupby('rate_marriage').mean()
```

Out[4]:

age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | affair | |
---|---|---|---|---|---|---|---|---|---|

rate_marriage | |||||||||

1 | 33.823232 | 13.914141 | 2.308081 | 2.343434 | 13.848485 | 3.232323 | 3.838384 | 1.201671 | 0.747475 |

2 | 30.471264 | 10.727011 | 1.735632 | 2.330460 | 13.864943 | 3.327586 | 3.764368 | 1.615745 | 0.635057 |

3 | 30.008056 | 10.239174 | 1.638469 | 2.308157 | 14.001007 | 3.402820 | 3.798590 | 1.371281 | 0.550856 |

4 | 28.856601 | 8.816905 | 1.369536 | 2.400981 | 14.144514 | 3.420161 | 3.835861 | 0.674837 | 0.322926 |

5 | 28.574702 | 8.311662 | 1.252794 | 2.506334 | 14.399776 | 3.454918 | 3.892697 | 0.348174 | 0.181446 |

`age`

, `yrs_married`

, and `children`

appears to correlate with a declining marriage rating.

In [5]:

```
# show plots in the notebook
%matplotlib inline
```

Let's start with histograms of education and marriage rating.

In [6]:

```
# histogram of education
dta.educ.hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')
```

Out[6]:

<matplotlib.text.Text at 0x16e48ac8>

In [7]:

```
# histogram of marriage rating
dta.rate_marriage.hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
```

Out[7]:

<matplotlib.text.Text at 0x16eac550>

In [8]:

```
# barplot of marriage rating grouped by affair (True or False)
pd.crosstab(dta.rate_marriage, dta.affair.astype(bool)).plot(kind='bar')
plt.title('Marriage Rating Distribution by Affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
```

Out[8]:

<matplotlib.text.Text at 0x1710c5f8>

In [9]:

```
affair_yrs_married = pd.crosstab(dta.yrs_married, dta.affair.astype(bool))
affair_yrs_married.div(affair_yrs_married.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Affair Percentage by Years Married')
plt.xlabel('Years Married')
plt.ylabel('Percentage')
```

Out[9]:

<matplotlib.text.Text at 0x17d83a20>

To prepare the data, I want to add an intercept column as well as dummy variables for `occupation`

and `occupation_husb`

, since I'm treating them as categorial variables. The dmatrices function from the patsy module can do that using formula language.

In [10]:

```
# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
religious + educ + C(occupation) + C(occupation_husb)',
dta, return_type="dataframe")
print X.columns
```

The column names for the dummy variables are ugly, so let's rename those.

In [11]:

```
# fix column names of X
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
'C(occupation)[T.3.0]':'occ_3',
'C(occupation)[T.4.0]':'occ_4',
'C(occupation)[T.5.0]':'occ_5',
'C(occupation)[T.6.0]':'occ_6',
'C(occupation_husb)[T.2.0]':'occ_husb_2',
'C(occupation_husb)[T.3.0]':'occ_husb_3',
'C(occupation_husb)[T.4.0]':'occ_husb_4',
'C(occupation_husb)[T.5.0]':'occ_husb_5',
'C(occupation_husb)[T.6.0]':'occ_husb_6'})
```

`y`

into a 1-D array, so that scikit-learn will properly understand it as the response variable.

In [12]:

```
# flatten y into a 1-D array
y = np.ravel(y)
```

Let's go ahead and run logistic regression on the entire data set, and see how accurate it is!

In [13]:

```
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)
# check the accuracy on the training set
model.score(X, y)
```

Out[13]:

0.72588752748978946

73% accuracy seems good, but what's the null error rate?

In [14]:

```
# what percentage had affairs?
y.mean()
```

Out[14]:

0.32249450204209867

Only 32% of the women had affairs, which means that you could obtain 68% accuracy by always predicting "no". So we're doing better than the null error rate, but not by much.

Let's examine the coefficients to see what we learn.

In [15]:

```
# examine the coefficients
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))
```

Out[15]:

0 | 1 | |
---|---|---|

0 | Intercept | [1.48988372957] |

1 | occ_2 | [0.188045598942] |

2 | occ_3 | [0.498926222393] |

3 | occ_4 | [0.25064662018] |

4 | occ_5 | [0.838982982602] |

5 | occ_6 | [0.833921262629] |

6 | occ_husb_2 | [0.190546828287] |

7 | occ_husb_3 | [0.297744578502] |

8 | occ_husb_4 | [0.161319424129] |

9 | occ_husb_5 | [0.187683007035] |

10 | occ_husb_6 | [0.193916860892] |

11 | rate_marriage | [-0.70312052711] |

12 | age | [-0.0584177439191] |

13 | yrs_married | [0.10567679013] |

14 | children | [0.0169195866351] |

15 | religious | [-0.371135218074] |

16 | educ | [0.0040161519299] |

So far, we have trained and tested on the same set. Let's instead split the data into a training set and a testing set.

In [16]:

```
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
```

Out[16]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

In [17]:

```
# predict class labels for the test set
predicted = model2.predict(X_test)
print predicted
```

[ 1. 0. 0. ..., 0. 0. 0.]

In [18]:

```
# generate class probabilities
probs = model2.predict_proba(X_test)
print probs
```

As you can see, the classifier is predicting a 1 (having an affair) any time the probability in the second column is greater than 0.5.

Now let's generate some evaluation metrics.

In [19]:

```
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, probs[:, 1])
```

0.729842931937 0.74596198609

The accuracy is 73%, which is the same as we experienced when training and predicting on the same data.

We can also see the confusion matrix and a classification report with other metrics.

In [20]:

```
print metrics.confusion_matrix(y_test, predicted)
print metrics.classification_report(y_test, predicted)
```

Now let's try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

In [21]:

```
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
```

Looks good. It's still performing at 73% accuracy.

Just for fun, let's predict the probability of an affair for a random woman not present in the dataset. She's a 25-year-old teacher who graduated college, has been married for 3 years, has 1 child, rates herself as strongly religious, rates her marriage as fair, and her husband is a farmer.

In [22]:

```
model.predict_proba(np.array([1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 3, 25, 3, 1, 4,
16]))
```

Out[22]:

array([[ 0.77472334, 0.22527666]])

The predicted probability of an affair is 23%.

There are many different steps that could be tried in order to improve the model:

- including interaction terms
- removing features
- regularization techniques
- using a non-linear model