Using the h_seq
field to aggregate by household.
Data: CPS | Tax year: 2017 | Author: Max Ghenis | Date run: 2018-04-08
import taxcalc as tc
import pandas as pd
import numpy as np
tc.__version__
'0.19.0'
pd.set_option('precision', 1)
recs = tc.Records.cps_constructor()
calc = tc.Calculator(records=recs, policy=tc.Policy(), verbose=False)
calc.advance_to_year(2017)
calc.calc_all()
tu = calc.dataframe(['s006', 'h_seq', 'ffpos',
# 'FLPDYR', # 2014 for all records.
'XTOT', 'expanded_income'])
tu['filers'] = 1
tu['XTOT_m'] = tu.XTOT * tu.s006 / 1e6
tu['expanded_income_b'] = tu.expanded_income * tu.s006 / 1e9
hh = tu.drop(columns='ffpos').groupby(['h_seq']).sum()
Add maximum s006
per household to test Ernie Tedeschi's idea.
hh_max_s006 = tu[['h_seq', 's006']].groupby('h_seq').max()
hh_max_s006.columns = ['max_s006']
hh = pd.merge(hh, hh_max_s006, left_index=True, right_index=True)
(round(tu.expanded_income_b.sum()), round(hh.expanded_income_b.sum()))
(13196.0, 13196.0)
(round(tu.XTOT_m.sum()), round(hh.XTOT_m.sum()))
(330.0, 330.0)
Four approaches.
Is this right? For each household, assign weight as:
$weight = \frac{\sum_{filers}weight}{n_{filers}}$
hh['hh_s006'] = hh.s006 / hh.filers
https://fred.stlouisfed.org/series/TTLHH estimates 126,224,000 total households in 2017.
Multiply this by a household's total share of s006
.
hh['s006_share'] = hh.s006 / hh.s006.sum()
TOTAL_HHS = 126224e3
hh['hh_s006_census'] = hh.s006_share * TOTAL_HHS
hh['XTOT_s006'] = 1e6 * hh.XTOT_m / hh.XTOT
hh['expanded_income_s006'] = 1e9 * hh.expanded_income_b / hh.expanded_income
In millions.
hh[['hh_s006', 'hh_s006_census', 'XTOT_s006',
'expanded_income_s006', 'max_s006']].sum() / 1e6
hh_s006 52.9 hh_s006_census 126.2 XTOT_s006 51.5 expanded_income_s006 52.3 max_s006 75.2 dtype: float64
Neither weighting approach recovers the true expanded income total.
round((hh.expanded_income * hh.hh_s006).sum() / 1e9)
15991.0
round((hh.expanded_income * hh.hh_s006_census).sum() / 1e9)
109675.0
2017 estimate: 82,827,000 family households (https://fred.stlouisfed.org/series/TTLFHH). This isn't comparable since family households exclude single-person households.
BYVARS = ['h_seq', 'ffpos']
fhh = tu.groupby(BYVARS).sum()
fhh_max_s006 = tu[BYVARS + ['s006']].groupby(BYVARS).max()
fhh_max_s006.columns = ['max_s006']
fhh = pd.merge(fhh, fhh_max_s006, left_index=True, right_index=True)
fhh['hh_s006'] = fhh.s006 / fhh.filers
fhh['XTOT_s006'] = 1e6 * fhh.XTOT_m / fhh.XTOT
fhh['expanded_income_s006'] = (
1e9 * fhh.expanded_income_b / fhh.expanded_income)
fhh[['hh_s006', 'XTOT_s006', 'expanded_income_s006', 'max_s006']].sum() / 1e6
hh_s006 68.9 XTOT_s006 67.4 expanded_income_s006 67.9 max_s006 90.6 dtype: float64