Census ACS¶

Retrieves data from the Census Bureau's American Community Survey 5-year series API for zctas, pumas, and tracts. A specific list of census variables is passed into the script, which are retrieved from the four ACS profile tables. Variables must be retrieved in chunks because only 50 can be passed to the API at a time, and each url varies by geography and retreives them in different combinations. After some processing output is written to a SQLite database. An option to create a metadata table appears at the bottom, but should only be run once for a given extract (acs1 and acs2) and not for each individual geography.

https://www.census.gov/data/developers/data-sets/acs-5year.html

Variables¶

In [1]:

import os, requests, json, sqlite3, random, pandas as pd, numpy as np
from IPython.display import clear_output

In [2]:

keyfile='census_key.txt'

#API variables - UPDATE THE YEAR AND GEO
year='2018'
geo='zip code tabulation area' # 'zip code tabulation area' or 'public use microdata area' or 'tract'
state='36'
dsource='acs'
dsource2='acs5'
dname='profile'

#Variables to read in from spreadsheet - UPDATE WORKSHEET
worksheet='acs2' # 'acs1' or 'acs2'
geoexcelsheet={'zip code tabulation area':'zctas', 'public use microdata area':'pumas', 'tract':'tracts'}
geotype=geoexcelsheet.get(geo)

#SQL output
tabname='{}_{}{}'.format(geotype,year,worksheet)
dbname=os.path.join('outputs','testdb.sqlite')

#Dump files for api data storage
jsonpath=os.path.join('outputs', tabname+'_retrieved_data.json')

Variable Lists¶

Get full list of variables from the API, read in our retrieval list, and compare the varianle IDs and names to make sure nothing is missing and that nothing has changed since the last iteration. Don't move on to the next block until both lists match. Lastly, read in list of geographies.

In [3]:

datadict={}
dps=['DP02','DP03','DP04','DP05']
for p in dps:
    vars_url = f'https://api.census.gov/data/{year}/{dsource}/{dsource2}/{dname}/groups/{p}.json'
    response=requests.get(vars_url)
    var_data=response.json()
    datadict.update(var_data['variables'])
random.sample(datadict.items(), 2)

Out[3]:

[('DP05_0086MA',
  {'label': 'Annotation of Margin of Error!!Total housing units',
   'predicateType': 'string',
   'group': 'DP05',
   'limit': 0,
   'predicateOnly': True}),
 ('DP02_0011PM',
  {'label': 'Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Nonfamily households!!Householder living alone',
   'concept': 'SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES',
   'predicateType': 'float',
   'group': 'DP02',
   'limit': 0,
   'predicateOnly': True})]

In [4]:

dfexcel = pd.read_excel(os.path.join('inputs','acs_variables.xlsx'),sheet_name=worksheet)
dfexcel.head()

Out[4]:

	db_var	census_var	census_label	dtype
0	HOC01_E	DP04_0001E	Estimate!!HOUSING OCCUPANCY!!Total housing units	int
1	HOC01_M	DP04_0001M	Margin of Error!!HOUSING OCCUPANCY!!Total hous...	int
2	HOC01_PC	DP04_0001PE	Percent Estimate!!HOUSING OCCUPANCY!!Total hou...	int
3	HOC01_PM	DP04_0001PM	Percent Margin of Error!!HOUSING OCCUPANCY!!To...	int
4	HOC02_E	DP04_0002E	Estimate!!HOUSING OCCUPANCY!!Total housing uni...	int

In [5]:

dfvars = pd.DataFrame.from_dict(datadict,columns=['label'],orient='index')
dfvars_selected=dfvars.loc[dfvars.index.isin(dfexcel['census_var'])]
dfvars_count=len(dfvars_selected)
dfexcel_count=len(dfexcel['census_var'])

if dfvars_count==dfexcel_count:
    print('There are an equal number of variables in both lists:', dfvars_count)
else:
    print('There is a mismatch in the number of variables; the api has,', dfvars_count, 
          'while the original list has',dfexcel_count,'. Missing:')
    nomatch=dfexcel[~dfexcel['census_var'].isin(dfvars_selected.index)]
    print(nomatch)

There are an equal number of variables in both lists: 236

In [6]:

mismatch=dfexcel[~dfexcel['census_label'].isin(dfvars_selected['label'])]

if len (mismatch) ==0:
    print('All labels match')
else:
    compare=pd.merge(mismatch,dfvars_selected, left_on='census_var', right_on=dfvars_selected.index)
    misfile=os.path.join('outputs','{}_mismatch.csv'.format(worksheet))
    compare.to_csv(misfile, columns=['db_var', 'census_var', 'census_label', 'label'],sep=',', index=False,
    header=['db_var','census_var' ,'oldlabel','newlabel'])
    print('* Mismatch file printed to outputs folder * \n')
    print('These labels do not match:')
    print(compare[['census_var','census_label','label']])

All labels match

In [7]:

# Geographic indetifiers: zctas to retrieve, pumas to filter by, and counties containing tracts to retrieve
excelgeo = pd.read_excel(os.path.join('inputs','acs_variables.xlsx'),sheet_name=geotype, dtype=object)
geoids = excelgeo['GEO'].tolist()
print('Number of geographic indetifiers:',len(geoids))

Number of geographic indetifiers: 215

Retrieve Data¶

Given the large number of variables in the ACS and limits of the API, variables must be passed to the url in separate blocks or chunks. The first chunk that's captured is written to an empty datalist; the header row and then one row for each geography. Each subsequent chunk is iterated through by row, so each row is appended to the correct row in datalist. In all cases, the last values, identifiers automatically returned with each API call, are not appended.

In [8]:

def chunks(l, n):
    # For item i in a range that is a length of l,
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]

In [9]:

reqvars=list(chunks(dfvars_selected.index.tolist(),46))
reqvars[0].insert(0,'NAME')
reqvars[0].insert(0,'GEO_ID')
print('Number of chunks:',len(reqvars))

Number of chunks: 6

In [10]:

with open(keyfile) as key:
    api_key=key.read().strip()

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dsource2}/{dname}'
base_url

Out[10]:

'https://api.census.gov/data/2018/acs/acs5/profile'

In [11]:

#Function for retrieving data; running this block loads it into memory
#Different geographies have different urls, 
#and a different number of identifiers tacked on to the end of each request

def getdata():
    dlist=[]
    for i, v in enumerate(reqvars):
        batchcols=','.join(v)
        if geotype=='zctas':
            data_url = f'{base_url}?get={batchcols}&for={geo}:{g}&key={api_key}'
            dropvar=-1
        elif geotype=='pumas':
            data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&key={api_key}'
            dropvar=-2
        elif geotype=='tracts':
            data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&in=county:{county}&key={api_key}'
            dropvar=-3
        else:
            print('Appropriate geography not specified in variables block')
            break  
        response=requests.get(data_url)
        if response.status_code==200:
            clear_output(wait=True)
            data=response.json()
            for i2, v2 in enumerate(data):
                if i == 0:
                    dlist.append(v2[:dropvar])
                else:
                    for item in v2[:dropvar]:
                        dlist[i2].append(item)
        else:
            print('***Problem with retrieval***, response code',response.status_code)
    return dlist

THIS BLOCK IS A REQUESTS BLOCK!¶

NOTE: ZCTA retrieval takes a long time - 15 mins for 215 ZCTAs

In [12]:

#If this block was run successfully for a given table and geography don't rerun - next block pulls from saved json
datalist=[]
if geotype=='zctas':
    for g in geoids:
        georecord=getdata()
        print('Retrieved data for',g)
        if len(datalist)==0:
            datalist.append(georecord[0])
            datalist.append(georecord[1])
        else:
            datalist.append(georecord[1])
elif geotype=='pumas':
    datalist=getdata()
elif geotype=='tracts':
    for county in geoids:
        georecord=getdata()
        print('Retrieved data for',county)
        if len(datalist)==0:
            for tract in georecord:
                datalist.append(tract)
        else:
            for tract in georecord[1:]:
                datalist.append(tract)
    
dlrows=len(datalist)
dlitems=sum(len(x) for x in datalist)
dlbyrow=dlitems / dlrows
print('Retrieved', dlrows, 'records and', dlitems,'data points with', dlbyrow, 'points for each record...')
        
with open(jsonpath, 'w') as f:
    json.dump(datalist, f)
print('Done - Data dumped to json file')

Retrieved data for 11697
Retrieved 216 records and 51408 data points with 238.0 points for each record...
Done - Data dumped to json file

Process Data¶

Replace footnotes with nulls, create a new GEOID2 column, replace census variable names with database variable names.

In [13]:

with open(jsonpath, 'r') as f:
    jsondata=json.load(f)
alldata = pd.DataFrame(jsondata[1:],columns=jsondata[0],dtype=object).rename(columns={
    'GEO_ID':'GEOID','NAME':'GEOLABEL'}).set_index('GEOID')
alldata.info()
# Index and column entries should be 1 row and 1 column less than previous count (excludes header row and index column) 

<class 'pandas.core.frame.DataFrame'>
Index: 215 entries, 8600000US10001 to 8600000US11697
Columns: 237 entries, GEOLABEL to DP05_0087PM
dtypes: object(237)
memory usage: 399.8+ KB

In [14]:

alldata.head(3)

Out[14]:

	GEOLABEL	DP04_0001E	DP04_0001M	DP04_0001PE	DP04_0001PM	DP04_0002E	DP04_0002M	DP04_0002PE	DP04_0002PM	DP04_0003E	...	DP05_0082PE	DP05_0082PM	DP05_0083E	DP05_0083M	DP05_0083PE	DP05_0083PM	DP05_0087E	DP05_0087M	DP05_0087PE	DP05_0087PM
GEOID
8600000US10001	ZCTA5 10001	14141	539	14141	-888888888	12431	521	87.9	2.2	1710	...	0.5	0.5	309	133	1.3	0.6	17441	981	17441	-888888888
8600000US10002	ZCTA5 10002	35724	453	35724	-888888888	33540	614	93.9	1.2	2184	...	0.2	0.2	1516	445	2.0	0.6	53042	1715	53042	-888888888
8600000US10003	ZCTA5 10003	30876	378	30876	-888888888	26124	703	84.6	2.1	4752	...	0.3	0.2	1074	304	2.0	0.6	44049	1560	44049	-888888888

3 rows × 237 columns

In [15]:

#This is a lousy solution, come up with something better in the future
footnotes=['-999999999','-999999999.0', '-999999999.00',
           '-888888888','-888888888.0', '-888888888.00',
           '-666666666','-666666666.0', '-666666666.00',
           '-555555555','-555555555.0', '-555555555.00',
           '-333333333','-333333333.0', '-333333333.00',
           '-222222222','-222222222.0', '-222222222.00']
alldata.replace(footnotes,np.nan,inplace=True)

In [16]:

idxgeoid2={'zctas':-5, 'pumas':-7,'tracts':-11}
alldata.insert(loc=0, column='GEOID2',value=alldata.index.str[idxgeoid2.get(geotype):])
alldata.head(3)

Out[16]:

	GEOID2	GEOLABEL	DP04_0001E	DP04_0001M	DP04_0001PE	DP04_0001PM	DP04_0002E	DP04_0002M	DP04_0002PE	DP04_0002PM	...	DP05_0082PE	DP05_0082PM	DP05_0083E	DP05_0083M	DP05_0083PE	DP05_0083PM	DP05_0087E	DP05_0087M	DP05_0087PE	DP05_0087PM
GEOID
8600000US10001	10001	ZCTA5 10001	14141	539	14141	NaN	12431	521	87.9	2.2	...	0.5	0.5	309	133	1.3	0.6	17441	981	17441	NaN
8600000US10002	10002	ZCTA5 10002	35724	453	35724	NaN	33540	614	93.9	1.2	...	0.2	0.2	1516	445	2.0	0.6	53042	1715	53042	NaN
8600000US10003	10003	ZCTA5 10003	30876	378	30876	NaN	26124	703	84.6	2.1	...	0.3	0.2	1074	304	2.0	0.6	44049	1560	44049	NaN

3 rows × 238 columns

In [17]:

# For PUMAS filter all the geotype for the state by local areas
if geotype == 'pumas':
    acsdata=alldata.loc[alldata.GEOID2.isin(geoids)].copy().astype(object).sort_index()
else:
    acsdata=alldata.copy().astype(object).sort_index()
acsdata.shape

Out[17]:

(215, 238)

In [18]:

#Dictionary of column names from the census and the nyc geodatabase
cv_to_db=dict(zip(dfexcel.census_var, dfexcel.db_var))
random.sample(cv_to_db.items(), 5)

Out[18]:

[('DP05_0001M', 'SXAG01_M'),
 ('DP05_0036PM', 'RACE02_PM'),
 ('DP05_0039M', 'RACE05_M'),
 ('DP04_0134E', 'RENT09_E'),
 ('DP05_0037PM', 'RACE03_PM')]

In [19]:

#Rename the census variables to nyc geodatabase variables
acsdata.rename(columns=cv_to_db,inplace=True)
acsdata.head(3)

Out[19]:

	GEOID2	GEOLABEL	HOC01_E	HOC01_M	HOC01_PC	HOC01_PM	HOC02_E	HOC02_M	HOC02_PC	HOC02_PM	...	HISL09_PC	HISL09_PM	HISL10_E	HISL10_M	HISL10_PC	HISL10_PM	VOTE01_E	VOTE01_M	VOTE01_PC	VOTE01_PM
GEOID
8600000US10001	10001	ZCTA5 10001	14141	539	14141	NaN	12431	521	87.9	2.2	...	0.5	0.5	309	133	1.3	0.6	17441	981	17441	NaN
8600000US10002	10002	ZCTA5 10002	35724	453	35724	NaN	33540	614	93.9	1.2	...	0.2	0.2	1516	445	2.0	0.6	53042	1715	53042	NaN
8600000US10003	10003	ZCTA5 10003	30876	378	30876	NaN	26124	703	84.6	2.1	...	0.3	0.2	1074	304	2.0	0.6	44049	1560	44049	NaN

3 rows × 238 columns

Write to Database¶

Update list of variables and data types, build create table string, create datatable in temporary database.

In [20]:

dfexcel.replace({'dtype': {'int': 'INTEGER', 'float': 'REAL'}},inplace=True)
dfexcel.census_label.replace({'!!': ' - '},inplace=True, regex=True)
dfexcel.head()

Out[20]:

	db_var	census_var	census_label	dtype
0	HOC01_E	DP04_0001E	Estimate - HOUSING OCCUPANCY - Total housing u...	INTEGER
1	HOC01_M	DP04_0001M	Margin of Error - HOUSING OCCUPANCY - Total ho...	INTEGER
2	HOC01_PC	DP04_0001PE	Percent Estimate - HOUSING OCCUPANCY - Total h...	INTEGER
3	HOC01_PM	DP04_0001PM	Percent Margin of Error - HOUSING OCCUPANCY - ...	INTEGER
4	HOC02_E	DP04_0002E	Estimate - HOUSING OCCUPANCY - Total housing u...	INTEGER

In [21]:

vardict=dfexcel.set_index('db_var').T.to_dict('list')
random.sample(vardict.items(), 2)

Out[21]:

[('HISL09_PM',
  ['DP05_0082PM',
   'Percent Margin of Error - HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino - Some other race alone',
   'REAL']),
 ('HISL03_E',
  ['DP05_0076E',
   'Estimate - HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino',
   'INTEGER'])]

In [22]:

con = sqlite3.connect(dbname) 
cur = con.cursor()

In [23]:

cur.execute('DROP TABLE IF EXISTS {}'.format(tabname))
dbstring="""
CREATE TABLE {} (
GEOID TEXT,
GEOID2 TEXT NOT NULL PRIMARY KEY,
GEOLABEL TEXT,
""".format(tabname)

for k,v in vardict.items():
    dbstring=dbstring+k+' '+v[2]+', \n'
    
dbstring=dbstring[:-3]
dbstring=dbstring+');'
print(dbstring)

CREATE TABLE zctas_2018acs2 (
GEOID TEXT,
GEOID2 TEXT NOT NULL PRIMARY KEY,
GEOLABEL TEXT,
HOC01_E INTEGER, 
HOC01_M INTEGER, 
HOC01_PC INTEGER, 
HOC01_PM INTEGER, 
HOC02_E INTEGER, 
HOC02_M INTEGER, 
HOC02_PC REAL, 
HOC02_PM REAL, 
HOC03_E INTEGER, 
HOC03_M INTEGER, 
HOC03_PC REAL, 
HOC03_PM REAL, 
HOC04_E REAL, 
HOC04_M REAL, 
HOC04_PC INTEGER, 
HOC04_PM INTEGER, 
HOC05_E REAL, 
HOC05_M REAL, 
HOC05_PC INTEGER, 
HOC05_PM INTEGER, 
HTEN01_E INTEGER, 
HTEN01_M INTEGER, 
HTEN01_PC INTEGER, 
HTEN01_PM INTEGER, 
HTEN02_E INTEGER, 
HTEN02_M INTEGER, 
HTEN02_PC REAL, 
HTEN02_PM REAL, 
HTEN03_E INTEGER, 
HTEN03_M INTEGER, 
HTEN03_PC REAL, 
HTEN03_PM REAL, 
HVAL01_E INTEGER, 
HVAL01_M INTEGER, 
HVAL01_PC INTEGER, 
HVAL01_PM INTEGER, 
MORT01_E INTEGER, 
MORT01_M INTEGER, 
MORT01_PC INTEGER, 
MORT01_PM INTEGER, 
MORT02_E INTEGER, 
MORT02_M INTEGER, 
MORT02_PC REAL, 
MORT02_PM REAL, 
MORT03_E INTEGER, 
MORT03_M INTEGER, 
MORT03_PC REAL, 
MORT03_PM REAL, 
RENT01_E INTEGER, 
RENT01_M INTEGER, 
RENT01_PC INTEGER, 
RENT01_PM INTEGER, 
RENT02_E INTEGER, 
RENT02_M INTEGER, 
RENT02_PC REAL, 
RENT02_PM REAL, 
RENT03_E INTEGER, 
RENT03_M INTEGER, 
RENT03_PC REAL, 
RENT03_PM REAL, 
RENT04_E INTEGER, 
RENT04_M INTEGER, 
RENT04_PC REAL, 
RENT04_PM REAL, 
RENT05_E INTEGER, 
RENT05_M INTEGER, 
RENT05_PC REAL, 
RENT05_PM REAL, 
RENT06_E INTEGER, 
RENT06_M INTEGER, 
RENT06_PC REAL, 
RENT06_PM REAL, 
RENT07_E INTEGER, 
RENT07_M INTEGER, 
RENT07_PC REAL, 
RENT07_PM REAL, 
RENT08_E INTEGER, 
RENT08_M INTEGER, 
RENT08_PC REAL, 
RENT08_PM REAL, 
RENT09_E INTEGER, 
RENT09_M INTEGER, 
RENT09_PC INTEGER, 
RENT09_PM INTEGER, 
RENT10_E INTEGER, 
RENT10_M INTEGER, 
RENT10_PC INTEGER, 
RENT10_PM INTEGER, 
SXAG01_E INTEGER, 
SXAG01_M INTEGER, 
SXAG01_PC INTEGER, 
SXAG01_PM INTEGER, 
SXAG02_E INTEGER, 
SXAG02_M INTEGER, 
SXAG02_PC REAL, 
SXAG02_PM REAL, 
SXAG03_E INTEGER, 
SXAG03_M INTEGER, 
SXAG03_PC REAL, 
SXAG03_PM REAL, 
SXAG04_E INTEGER, 
SXAG04_M INTEGER, 
SXAG04_PC REAL, 
SXAG04_PM REAL, 
SXAG05_E INTEGER, 
SXAG05_M INTEGER, 
SXAG05_PC REAL, 
SXAG05_PM REAL, 
SXAG06_E INTEGER, 
SXAG06_M INTEGER, 
SXAG06_PC REAL, 
SXAG06_PM REAL, 
SXAG07_E INTEGER, 
SXAG07_M INTEGER, 
SXAG07_PC REAL, 
SXAG07_PM REAL, 
SXAG08_E INTEGER, 
SXAG08_M INTEGER, 
SXAG08_PC REAL, 
SXAG08_PM REAL, 
SXAG09_E INTEGER, 
SXAG09_M INTEGER, 
SXAG09_PC REAL, 
SXAG09_PM REAL, 
SXAG10_E INTEGER, 
SXAG10_M INTEGER, 
SXAG10_PC REAL, 
SXAG10_PM REAL, 
SXAG11_E INTEGER, 
SXAG11_M INTEGER, 
SXAG11_PC REAL, 
SXAG11_PM REAL, 
SXAG12_E INTEGER, 
SXAG12_M INTEGER, 
SXAG12_PC REAL, 
SXAG12_PM REAL, 
SXAG13_E INTEGER, 
SXAG13_M INTEGER, 
SXAG13_PC REAL, 
SXAG13_PM REAL, 
SXAG14_E INTEGER, 
SXAG14_M INTEGER, 
SXAG14_PC REAL, 
SXAG14_PM REAL, 
SXAG15_E INTEGER, 
SXAG15_M INTEGER, 
SXAG15_PC REAL, 
SXAG15_PM REAL, 
SXAG16_E INTEGER, 
SXAG16_M INTEGER, 
SXAG16_PC REAL, 
SXAG16_PM REAL, 
SXAG17_E REAL, 
SXAG17_M REAL, 
SXAG17_PC INTEGER, 
SXAG17_PM INTEGER, 
RACE01_E INTEGER, 
RACE01_M INTEGER, 
RACE01_PC INTEGER, 
RACE01_PM INTEGER, 
RACE02_E INTEGER, 
RACE02_M INTEGER, 
RACE02_PC REAL, 
RACE02_PM REAL, 
RACE03_E INTEGER, 
RACE03_M INTEGER, 
RACE03_PC REAL, 
RACE03_PM REAL, 
RACE04_E INTEGER, 
RACE04_M INTEGER, 
RACE04_PC REAL, 
RACE04_PM REAL, 
RACE05_E INTEGER, 
RACE05_M INTEGER, 
RACE05_PC REAL, 
RACE05_PM REAL, 
RACE06_E INTEGER, 
RACE06_M INTEGER, 
RACE06_PC REAL, 
RACE06_PM REAL, 
RACE07_E INTEGER, 
RACE07_M INTEGER, 
RACE07_PC REAL, 
RACE07_PM REAL, 
RACE08_E INTEGER, 
RACE08_M INTEGER, 
RACE08_PC REAL, 
RACE08_PM REAL, 
RACE09_E INTEGER, 
RACE09_M INTEGER, 
RACE09_PC REAL, 
RACE09_PM REAL, 
HISL01_E INTEGER, 
HISL01_M INTEGER, 
HISL01_PC INTEGER, 
HISL01_PM INTEGER, 
HISL02_E INTEGER, 
HISL02_M INTEGER, 
HISL02_PC REAL, 
HISL02_PM REAL, 
HISL03_E INTEGER, 
HISL03_M INTEGER, 
HISL03_PC REAL, 
HISL03_PM REAL, 
HISL04_E INTEGER, 
HISL04_M INTEGER, 
HISL04_PC REAL, 
HISL04_PM REAL, 
HISL05_E INTEGER, 
HISL05_M INTEGER, 
HISL05_PC REAL, 
HISL05_PM REAL, 
HISL06_E INTEGER, 
HISL06_M INTEGER, 
HISL06_PC REAL, 
HISL06_PM REAL, 
HISL07_E INTEGER, 
HISL07_M INTEGER, 
HISL07_PC REAL, 
HISL07_PM REAL, 
HISL08_E INTEGER, 
HISL08_M INTEGER, 
HISL08_PC REAL, 
HISL08_PM REAL, 
HISL09_E INTEGER, 
HISL09_M INTEGER, 
HISL09_PC REAL, 
HISL09_PM REAL, 
HISL10_E INTEGER, 
HISL10_M INTEGER, 
HISL10_PC REAL, 
HISL10_PM REAL, 
VOTE01_E INTEGER, 
VOTE01_M INTEGER, 
VOTE01_PC INTEGER, 
VOTE01_PM INTEGER);

In [24]:

cur.execute(dbstring)

Out[24]:

<sqlite3.Cursor at 0x21af7efbea0>

In [25]:

acsdata.to_sql(name=tabname, if_exists='append', index=True, con=con)

In [26]:

cur.execute('SELECT COUNT(*) FROM {};'.format(tabname))
rows = cur.fetchone()
print(rows[0], 'records written to', tabname)

215 records written to zctas_2018acs2

In [27]:

cur.execute('SELECT * FROM {} LIMIT 1;'.format(tabname))
col_names = [cn[0] for cn in cur.description]
print(len(col_names), 'columns written to', tabname)
#Number should be same as number in df acsdata plus 1, since index not included in df count

239 columns written to zctas_2018acs2

In [28]:

con.close()

Metadata Table¶

DO NOT RERUN THIS SECTION FOR MULTIPLE GEOGRAPHIES. In the NYC Geodatabase there is only one metadata table for all of the ACS tables (acs1 and acs2) for all geographies. For whichever geography is processed first, set action variable to 'create' and run this entire series of blocks for the acs1 table. For the acs2 table, set the action variable to 'append' and skip the table creation and identifier insertion blocks.

In [ ]:

#Change table name and specify an action - you're creating the table for the first time with acs1 variables, 
#or appending the tables with acs2 variables

metatab='acslookup2018'
action='append' # 'create' or 'append'

In [ ]:

con = sqlite3.connect(dbname) 
cur = con.cursor()

In [ ]:

#Only run this block when creating initial table
if action=='create':
    mdstring="""
    CREATE TABLE {} (
    tabnum TEXT,
    est_id TEXT,
    est_value TEXT);
    """.format(metatab)
    cur.execute(mdstring)
else:
    print('Block not executed because "create" not selected as an action in earlier block')

In [ ]:

#Only run this block when creating initial table
if action=='create':
    exstring="""
        INSERT INTO {} VALUES('both','NOTE','Each variable has 4 values that are identified by a particular suffix: E for estimate, M for margin of error for the estimate, PC for percent total, and PM for margin of error for the percent total');
        INSERT INTO {} VALUES('both','GEOID','Id');
        INSERT INTO {} VALUES('both','GEOID2','Id2');
        INSERT INTO {} VALUES('both','GEOLABEL','Geography');
        """.format(metatab,metatab,metatab,metatab)
    cur.executescript(exstring)
    con.commit()
else:
    print('Block not executed because "create" not selected as an action in earlier block')

In [ ]:

#Run when creating table or when appending records
#Keys and values - db ids and labels - are simplified and truncated to 1 entry for each 4-column group (E,M,PC,PM)
if action in ('create','append'):
    for mk, mv in vardict.items():
        if mk.endswith('_E'):
            cur.execute("INSERT INTO {} values(?,?,?)".format(metatab),(worksheet,mk[:-2],mv[1][11:]))
    con.commit()
else:
    print('Block not executed because action not specified in earlier block')

In [ ]:

cur.execute('SELECT COUNT(*) FROM {};'.format(metatab))
rows = cur.fetchone()
print(rows[0], 'records in', metatab)

In [ ]:

action=''
con.close()

In [ ]:

Census ACS¶

Variables¶

Variable Lists¶

Retrieve Data¶

*THIS BLOCK IS A REQUESTS BLOCK!*¶

Process Data¶

Write to Database¶

Metadata Table¶

THIS BLOCK IS A REQUESTS BLOCK!¶