Bite Size Bayes¶

MIT License: https://opensource.org/licenses/MIT

In [1]:

import pandas as pd
import numpy as np

The dataset includes variables I selected from the General Social Survey, available from this project on the GSS site: https://gssdataexplorer.norc.org/projects/54786

I also store the data in the GitHub repository for this book; the following cell downloads it, if necessary.

In [2]:

# Load the data file

import os

if not os.path.exists('gss_bayes.tar.gz'):
    !wget https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.tar.gz
    !tar -xzf gss_bayes.tar.gz

utils.py provides read_stata, which reads the data from the Stata format.

In [3]:

from utils import read_stata

gss = read_stata('GSS.dct', 'GSS.dat')
gss.rename(columns={'id_': 'caseid'}, inplace=True)
gss.index = gss['caseid']
gss.head()

Out[3]:

	year	relig	srcbelt	region	adults	wtssall	ballot	cohort	feminist	polviews	partyid	race	sex	educ	age	indus10	occ10	caseid	realinc
caseid
1	1972	3	3	3	1	0.4446	0	1949	0	0	2	1	2	16	23	5170	520	1	18951.0
2	1972	2	3	3	2	0.8893	0	1902	0	0	1	1	1	10	70	6470	7700	2	24366.0
3	1972	1	3	3	2	0.8893	0	1924	0	0	3	1	2	12	48	7070	4920	3	24366.0
4	1972	5	3	3	2	0.8893	0	1945	0	0	1	1	2	17	27	5170	800	4	30458.0
5	1972	1	3	3	2	0.8893	0	1911	0	0	0	1	2	12	61	6680	5020	5	50763.0

In [4]:

def replace_invalid(series, bad_vals, replacement=np.nan):
    """Replace invalid values with NaN

    Modifies series in place.

    series: Pandas Series
    bad_vals: list of values to replace
    replacement: value to replace
    """
    series.replace(bad_vals, replacement, inplace=True)

The following cell replaces invalid responses for the variables we'll use.

In [5]:

replace_invalid(gss['feminist'], [0, 8, 9])
replace_invalid(gss['polviews'], [0, 8, 9])
replace_invalid(gss['partyid'], [8, 9])
replace_invalid(gss['indus10'], [0, 9997, 9999])
replace_invalid(gss['age'], [0, 98, 99])

In [6]:

def values(series):
    """Make a series of values and the number of times they appear.
    
    series: Pandas Series
    
    returns: Pandas Series
    """
    return series.value_counts(dropna=False).sort_index()

feminist¶

https://gssdataexplorer.norc.org/variables/1698/vshow

This question was only asked during one year, so we're limited to a small number of responses.

In [7]:

values(gss['feminist'])

Out[7]:

1.0      298
2.0     1083
NaN    61085
Name: feminist, dtype: int64

polviews¶

https://gssdataexplorer.norc.org/variables/178/vshow

In [8]:

values(gss['polviews'])

Out[8]:

1.0     1560
2.0     6236
3.0     6754
4.0    20515
5.0     8407
6.0     7876
7.0     1733
NaN     9385
Name: polviews, dtype: int64

partyid¶

https://gssdataexplorer.norc.org/variables/141/vshow

In [9]:

values(gss['partyid'])

Out[9]:

0.0     9999
1.0    12942
2.0     7485
3.0     9474
4.0     5462
5.0     9661
6.0     6063
7.0      995
NaN      385
Name: partyid, dtype: int64

race¶

https://gssdataexplorer.norc.org/variables/82/vshow

In [10]:

values(gss['race'])

Out[10]:

1    50340
2     8802
3     3324
Name: race, dtype: int64

sex¶

https://gssdataexplorer.norc.org/variables/81/vshow

In [11]:

values(gss['sex'])

Out[11]:

1    27562
2    34904
Name: sex, dtype: int64

age¶

In [12]:

values(gss['age'])

Out[12]:

18.0     219
19.0     835
20.0     870
21.0     987
22.0    1042
        ... 
86.0     172
87.0     143
88.0     113
89.0     335
NaN      221
Name: age, Length: 73, dtype: int64

indus10¶

https://gssdataexplorer.norc.org/variables/17/vshow

In [13]:

values(gss['indus10'])

Out[13]:

170.0      458
180.0      444
190.0       37
270.0       69
280.0       36
          ... 
9770.0      13
9780.0       8
9790.0      53
9870.0      22
NaN       4704
Name: indus10, Length: 271, dtype: int64

Select subset¶

Here's the subset of the data with valid responses for the variables we'll use.

In [14]:

varnames = ['year', 'age', 'sex', 'polviews', 'partyid', 'indus10']

valid = gss.dropna(subset=varnames)
valid.shape

Out[14]:

(49290, 19)

In [15]:

subset = valid[varnames]
subset.head()

Out[15]:

	year	age	sex	polviews	partyid	indus10
caseid
1	1974	21.0	1	4.0	2.0	4970.0
2	1974	41.0	1	5.0	0.0	9160.0
5	1974	58.0	2	6.0	1.0	2670.0
6	1974	30.0	1	5.0	4.0	6870.0
7	1974	48.0	1	5.0	4.0	7860.0

Save the data¶

In [20]:

subset.to_csv('gss_bayes.csv')

In [21]:

!ls -l gss_bayes.csv

-rw-rw-r-- 1 downey downey 1546290 Jan 21 10:11 gss_bayes.csv

In [ ]: