import pandas as pd
import numpy as np
The dataset includes variables I selected from the General Social Survey, available from this project on the GSS site: https://gssdataexplorer.norc.org/projects/54786
I also store the data in the GitHub repository for this book; the following cell downloads it, if necessary.
# Load the data file
import os
if not os.path.exists('gss_bayes.tar.gz'):
!wget https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.tar.gz
!tar -xzf gss_bayes.tar.gz
utils.py
provides read_stata
, which reads the data from the Stata format.
from utils import read_stata
gss = read_stata('GSS.dct', 'GSS.dat')
gss.rename(columns={'id_': 'caseid'}, inplace=True)
gss.index = gss['caseid']
gss.head()
year | relig | srcbelt | region | adults | wtssall | ballot | cohort | feminist | polviews | partyid | race | sex | educ | age | indus10 | occ10 | caseid | realinc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
caseid | |||||||||||||||||||
1 | 1972 | 3 | 3 | 3 | 1 | 0.4446 | 0 | 1949 | 0 | 0 | 2 | 1 | 2 | 16 | 23 | 5170 | 520 | 1 | 18951.0 |
2 | 1972 | 2 | 3 | 3 | 2 | 0.8893 | 0 | 1902 | 0 | 0 | 1 | 1 | 1 | 10 | 70 | 6470 | 7700 | 2 | 24366.0 |
3 | 1972 | 1 | 3 | 3 | 2 | 0.8893 | 0 | 1924 | 0 | 0 | 3 | 1 | 2 | 12 | 48 | 7070 | 4920 | 3 | 24366.0 |
4 | 1972 | 5 | 3 | 3 | 2 | 0.8893 | 0 | 1945 | 0 | 0 | 1 | 1 | 2 | 17 | 27 | 5170 | 800 | 4 | 30458.0 |
5 | 1972 | 1 | 3 | 3 | 2 | 0.8893 | 0 | 1911 | 0 | 0 | 0 | 1 | 2 | 12 | 61 | 6680 | 5020 | 5 | 50763.0 |
def replace_invalid(series, bad_vals, replacement=np.nan):
"""Replace invalid values with NaN
Modifies series in place.
series: Pandas Series
bad_vals: list of values to replace
replacement: value to replace
"""
series.replace(bad_vals, replacement, inplace=True)
The following cell replaces invalid responses for the variables we'll use.
replace_invalid(gss['feminist'], [0, 8, 9])
replace_invalid(gss['polviews'], [0, 8, 9])
replace_invalid(gss['partyid'], [8, 9])
replace_invalid(gss['indus10'], [0, 9997, 9999])
replace_invalid(gss['age'], [0, 98, 99])
def values(series):
"""Make a series of values and the number of times they appear.
series: Pandas Series
returns: Pandas Series
"""
return series.value_counts(dropna=False).sort_index()
https://gssdataexplorer.norc.org/variables/1698/vshow
This question was only asked during one year, so we're limited to a small number of responses.
values(gss['feminist'])
1.0 298 2.0 1083 NaN 61085 Name: feminist, dtype: int64
values(gss['polviews'])
1.0 1560 2.0 6236 3.0 6754 4.0 20515 5.0 8407 6.0 7876 7.0 1733 NaN 9385 Name: polviews, dtype: int64
values(gss['partyid'])
0.0 9999 1.0 12942 2.0 7485 3.0 9474 4.0 5462 5.0 9661 6.0 6063 7.0 995 NaN 385 Name: partyid, dtype: int64
values(gss['race'])
1 50340 2 8802 3 3324 Name: race, dtype: int64
values(gss['sex'])
1 27562 2 34904 Name: sex, dtype: int64
values(gss['age'])
18.0 219 19.0 835 20.0 870 21.0 987 22.0 1042 ... 86.0 172 87.0 143 88.0 113 89.0 335 NaN 221 Name: age, Length: 73, dtype: int64
values(gss['indus10'])
170.0 458 180.0 444 190.0 37 270.0 69 280.0 36 ... 9770.0 13 9780.0 8 9790.0 53 9870.0 22 NaN 4704 Name: indus10, Length: 271, dtype: int64
Here's the subset of the data with valid responses for the variables we'll use.
varnames = ['year', 'age', 'sex', 'polviews', 'partyid', 'indus10']
valid = gss.dropna(subset=varnames)
valid.shape
(49290, 19)
subset = valid[varnames]
subset.head()
year | age | sex | polviews | partyid | indus10 | |
---|---|---|---|---|---|---|
caseid | ||||||
1 | 1974 | 21.0 | 1 | 4.0 | 2.0 | 4970.0 |
2 | 1974 | 41.0 | 1 | 5.0 | 0.0 | 9160.0 |
5 | 1974 | 58.0 | 2 | 6.0 | 1.0 | 2670.0 |
6 | 1974 | 30.0 | 1 | 5.0 | 4.0 | 6870.0 |
7 | 1974 | 48.0 | 1 | 5.0 | 4.0 | 7860.0 |
subset.to_csv('gss_bayes.csv')
!ls -l gss_bayes.csv
-rw-rw-r-- 1 downey downey 1546290 Jan 21 10:11 gss_bayes.csv