Cenpy (sen - pie
) is a package that exposes APIs from the US Census Bureau and makes it easy to pull down and work with Census data in Pandas. First, notice that there are two core modules in the package, base
and explorer
, which each do different things. First, let's look at explorer
.
import cenpy as c
import pandas
On import, explorer
requests all currently available APIs from the Census Bureau's API listing. In future, it will can also read a JSON
collection describing the databases from disk, if asked.
Explorer has two functions, available
and explain
. available
will provide a list of the identifiers of all the APIs that cenpy
knows about. If run with verbose=True
, cenpy
will also include the title of the database as a dictionary. It's a good idea to not process this directly, and instead use it to explore currently available APIs.
Also, beware that the US Census Bureau can change the names of the resources. This means that the index of the following table is not necessarily stable over time; sometimes, the same resource can change its identifier, like when the 2010 decennial census changed from 2010sf1
to DECENNIALSF12010
. So, consult the table built by cenpy.explorer.available()
if the keys appear to have changed.
Here, I'll just show the first five entries:
c.explorer.available().head()
title | temporal | spatial | publisher | programCode | modified | keyword | distribution | description | contactPoint | ... | c_isTimeseries | c_isCube | c_isAvailable | c_isAggregate | c_groupsLink | c_geographyLink | c_examplesLink | c_dataset | bureauCode | accessLevel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NONEMP2007 | 2007 Nonemployer Statistics: Non Employer Stat... | 2007/2007 | United States | U.S. Census Bureau | 006:007 | 2017-02-09 | NaN | {'@type': 'dcat:Distribution', 'accessURL': 'h... | Nonemployer Statistics is an annual series tha... | {'fn': 'Nonemployer Statistics staff', 'hasEma... | ... | NaN | True | True | True | https://api.census.gov/data/2007/nonemp/groups... | https://api.census.gov/data/2007/nonemp/geogra... | https://api.census.gov/data/2007/nonemp/exampl... | (nonemp,) | 006:07 | public |
POPESTagesex | Vintage 2014 Population Estimates: National An... | April 1, 2010 - Current | United States | U.S. Census Bureau | 006:008 | 2015-11-17 | NaN | {'@type': 'dcat:Distribution', 'accessURL': 'h... | Annual Estimates of the Resident Population by... | {'fn': 'Population Estimates Branch', 'hasEmai... | ... | NaN | True | True | True | https://api.census.gov/data/2014/pep/agesex/gr... | https://api.census.gov/data/2014/pep/agesex/ge... | https://api.census.gov/data/2014/pep/agesex/ex... | (pep, agesex) | 006:07 | public |
ZBPTotal2011 | 2011 County Business Patterns - Zip Code Busin... | 2011/2011 | United States | U.S. Census Bureau | 006:007 | 2017-05-23 | NaN | {'@type': 'dcat:Distribution', 'accessURL': 'h... | ZIP Code Business Patterns (ZBP) is an annual ... | {'fn': 'CBP Staff', 'hasEmail': 'ewd.county.bu... | ... | NaN | NaN | True | True | https://api.census.gov/data/2011/zbp/groups.json | https://api.census.gov/data/2011/zbp/geography... | https://api.census.gov/data/2011/zbp/examples.... | (zbp,) | 006:07 | public |
ITMONTHLYIMPORTSUSDA | Time Series International Trade: Monthly U.S. ... | Time Series | United States | U.S. Census Bureau | 006:004 | 2017-10-27 | NaN | {'@type': 'dcat:Distribution', 'accessURL': 'h... | The Census data API provides access to the mos... | {'fn': 'Jeff McHugh', 'hasEmail': 'Jeffrey.P.M... | ... | True | True | True | NaN | https://api.census.gov/data/timeseries/intltra... | https://api.census.gov/data/timeseries/intltra... | https://api.census.gov/data/timeseries/intltra... | (timeseries, intltrade, imports, usda) | 006:07 | public |
POPESTnatmonthly2015 | Vintage 2015 Population Estimates: National Mo... | April 1,2010 | United States | U.S. Census Bureau | 006:008 | 2016-12-21 | NaN | {'@type': 'dcat:Distribution', 'accessURL': 'h... | Monthly Population Estimates by Universe, Age,... | {'fn': 'Population Estimates Branch', 'hasEmai... | ... | NaN | True | True | True | https://api.census.gov/data/2015/pep/natmonthl... | https://api.census.gov/data/2015/pep/natmonthl... | https://api.census.gov/data/2015/pep/natmonthl... | (pep, natmonthly) | 006:07 | public |
5 rows × 24 columns
The explain
command provides the title and full description of the datasource. If run in verbose mode, the function returns the full json
listing of the API.
c.explorer.explain('DECENNIALSF12010')
{'Decennial SF1': 'Summary File 1 (SF 1) contains detailed tables focusing on age, sex, households, families, and housing units. These tables provide in-depth figures by race and Hispanic origin> some tables are repeated for each of nine race/Latino groups. Counts also are provided for over forty American Indian and Alaska Native tribes and for groups within race categories. The race categories include eighteen Asian groups and twelve Native Hawaiian and Other Pacific Islander groups. Counts of persons of Hispanic origin by country of origin (twenty-eight groups) are also shown. Summary File 1 presents data for the United States, the 50 states, and the District of Columbia in a hierarchical sequence down to the block level for many tabulations, but only to the census tract level for others. Summaries are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs) and Congressional districts. Geographic coverage for Puerto Rico is comparable to the 50 states. Data are presented in a hierarchical sequence down the block level for many tabulations, but only to the census tract level for others. Geographic areas include barrios, barrios-pueblo, subbarrios, places, census tracts, block groups, and blocks. Summaries also are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs).'}
To actually connect to a database resource, you create a Connection
. A Connection
works like a very simplified connection from the sqlalchemy
world. The Connection
class has a method, query
that constructs a query string and requests it from the Census server. This result is then parsed into JSON and returned to the user.
conn = c.base.Connection('DECENNIALSF12010')
That may have taken longer than you'd've expected. This is because, when the Connection
constructor is called, it populates the connection object with a bit of metadata that makes it possible to construct queries without referring to the census handbooks.
For instance, a connection's variables
represent all available search parameters for a given dataset.
conn.variables.head()
attributes | concept | group | label | limit | predicateOnly | predicateType | required | values | |
---|---|---|---|---|---|---|---|---|---|
for | NaN | Census API Geography Specification | N/A | Census API FIPS 'for' clause | 0 | True | fips-for | NaN | NaN |
in | NaN | Census API Geography Specification | N/A | Census API FIPS 'in' clause | 0 | True | fips-in | NaN | NaN |
ucgid | NaN | Census API Geography Specification | N/A | Uniform Census Geography Identifier clause | 0 | True | ucgid | NaN | NaN |
PCT022021 | NaN | GROUP QUARTERS POPULATION BY SEX BY GROUP QUA... | PCT22 | Total!!Female!!Noninstitutionalized population... | 0 | NaN | int | NaN | NaN |
P028E001 | NaN | HOUSEHOLD TYPE BY HOUSEHOLD SIZE (NATIVE HAWAI... | P28E | Total | 0 | NaN | int | NaN | NaN |
This dataframe is populated just like the census's table describing the variables on the corresponding api website. Fortunately, this means that you can modify and filter this dataframe just like you can regular pandas dataframes, so working out what the exact codes to use in your query is easy.
I've added a function, varslike
, that globs variables that fit a regular expression pattern. It can use the builtin python re
module, in addition to the fnmatch
module. It also can use any filtering function you want.
So, you can extract the rows of the variables using the df.ix
method on the list of columns that match your expression:
conn.variables.loc[conn.varslike('H011[AB]')]
attributes | concept | group | label | limit | predicateOnly | predicateType | required | values | |
---|---|---|---|---|---|---|---|---|---|
H011A002 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11A | Population in occupied housing units!!Owned wi... | 0 | NaN | int | NaN | NaN |
H011A001 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11A | Population in occupied housing units | 0 | NaN | int | NaN | NaN |
H011A004 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11A | Population in occupied housing units!!Renter o... | 0 | NaN | int | NaN | NaN |
H011A003 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11A | Population in occupied housing units!!Owned fr... | 0 | NaN | int | NaN | NaN |
H011B001 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11B | Population in occupied housing units | 0 | NaN | int | NaN | NaN |
H011B002 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11B | Population in occupied housing units!!Owned wi... | 0 | NaN | int | NaN | NaN |
H011B003 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11B | Population in occupied housing units!!Owned fr... | 0 | NaN | int | NaN | NaN |
H011B004 | NaN | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | H11B | Population in occupied housing units!!Renter o... | 0 | NaN | int | NaN | NaN |
Likewise, the different levels of geographic scale are determined from the metadata in the overall API listing and recorded.
However, many Census products have multiple possible geographical indexing systems, like the deprecated fips
code system and the new Geographical Names Information System, gnis
. Thus, the geographies
property is a dictionary of dataframes, where each key is the name of the identifier system and the value is the dataframe describing the identifier system.
For the 2010 census, the following systems are available:
conn.geographies.keys()
dict_keys(['fips'])
For an explanation of the geographic hierarchies, the geographies
tables show the geographies at which the data is summarized:
conn.geographies['fips'].head()
geoLevelDisplay | name | optionalWithWCFor | referenceDate | requires | wildcard | |
---|---|---|---|---|---|---|
0 | 010 | us | NaN | 2010-01-01 | NaN | NaN |
1 | 020 | region | NaN | 2010-01-01 | NaN | NaN |
2 | 030 | division | NaN | 2010-01-01 | NaN | NaN |
3 | 040 | state | NaN | 2010-01-01 | NaN | NaN |
4 | 050 | county | state | 2010-01-01 | [state] | [state] |
Note that some geographies in the fips
system have a requires filter to prevent drawing too much data. This will get passed to the query
method later.
So, let's just grab the housing information from the 2010 Census Short Form. Using the variables table above, we picked out a subset of the fields we wanted. Since the variables table is indexed by the identifiers, we can grab the indexes of the filtered dataframe as query parameters.
In addition, adding the NAME
field smart-fills the table with the name of the geographic entity being pulled from the Census.
cols = conn.varslike('H00[012]*', engine='fnmatch')
cols.append('NAME')
cols
['H001001', 'H002001', 'H002006', 'H002002', 'H002003', 'H002004', 'H002005', 'NAME']
Now the query. The query is constructed just like the API query, and works as follows.
get=
in=
for=
To be specific, a fully query tells the server what columns to pull of what underlying geography from what aggregation units. It's structured using these heterogeneous datatypes so it's easy to change the smallest units quickly, while providing sufficient granularity to change the filters and columns as you go.
This query below grabs the names, population, and housing estimates from the ACS, as well as their standard errors from census designated places in Arizona.
data = conn.query(cols, geo_unit = 'place:*', geo_filter = {'state':'04'})
Once constructed, the query executes as fast as your internet connection will move. This query has:
data.shape
(451, 10)
28 columns and 451 rows. So, rather fast.
For validity and ease of use, we store the last executed query to the object. If you're dodgy about your census API key never being shown in plaintext, never print this property!
conn.last_query
'https://api.census.gov/data/2010/dec/sf1?get=H001001,H002001,H002006,H002002,H002003,H002004,H002005,NAME&for=place:*&in=state:04&key=174dc2099125916233a42788cc0ffd0336d2ca85'
So, you have a dataframe with the information requested, plus the fields specified in the geo_filter
and geo_unit
. Sometimes, the pandas.infer_objects()
function is not able to infer the types or structures of the data in the ways that you might expect. Thus, you may need to format the final data to ensure that the data types are correct.
So, the following is a dataframe of the data requested. I've filtered it to only look at data where the population is larger than 40 thousand people.
Pretty neat!
data[data['H001001'].astype(int) > 40000]
H001001 | H002001 | H002006 | H002002 | H002003 | H002004 | H002005 | NAME | state | place | |
---|---|---|---|---|---|---|---|---|---|---|
64 | 94404 | 94404 | 0 | 94394 | 94394 | 0 | 10 | Chandler city, Arizona | 04 | 12000 |
110 | 201173 | 201173 | 0 | 200979 | 200979 | 0 | 194 | Mesa city, Arizona | 04 | 46000 |
201 | 90505 | 90505 | 0 | 90493 | 90493 | 0 | 12 | Glendale city, Arizona | 04 | 27820 |
267 | 124001 | 124001 | 0 | 120049 | 120049 | 0 | 3952 | Scottsdale city, Arizona | 04 | 65000 |
305 | 52586 | 52586 | 0 | 51082 | 51082 | 0 | 1504 | Surprise city, Arizona | 04 | 71510 |
314 | 73462 | 73462 | 0 | 73462 | 73462 | 0 | 0 | Tempe city, Arizona | 04 | 73000 |
333 | 229762 | 229762 | 0 | 228577 | 228506 | 71 | 1185 | Tucson city, Arizona | 04 | 77000 |
413 | 74907 | 74907 | 0 | 74880 | 74880 | 0 | 27 | Gilbert town, Arizona | 04 | 27400 |
433 | 64818 | 64818 | 0 | 64133 | 60939 | 3194 | 685 | Peoria city, Arizona | 04 | 54050 |
435 | 590149 | 590149 | 0 | 587936 | 587936 | 0 | 2213 | Phoenix city, Arizona | 04 | 55000 |
And, just in case you're liable to forget your FIPS codes, the explorer module can look up some fips codes listings for you.
c.explorer.fips_table('place', in_state='AZ')
reading https://www2.census.gov/geo/docs/reference/codes/files/st04_az_places.txt
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | AZ | 4 | 730 | Aguila CDP | Census Designated Place | S | Maricopa County |
1 | AZ | 4 | 870 | Ajo CDP | Census Designated Place | S | Pima County |
2 | AZ | 4 | 940 | Ak Chin CDP | Census Designated Place | S | Pima County |
3 | AZ | 4 | 1090 | Ak-Chin Village CDP | Census Designated Place | S | Pinal County |
4 | AZ | 4 | 1170 | Alamo Lake CDP | Census Designated Place | S | La Paz County |
5 | AZ | 4 | 1560 | Ali Chuk CDP | Census Designated Place | S | Pima County |
6 | AZ | 4 | 1570 | Ali Chukson CDP | Census Designated Place | S | Pima County |
7 | AZ | 4 | 1620 | Ali Molina CDP | Census Designated Place | S | Pima County |
8 | AZ | 4 | 1920 | Alpine CDP | Census Designated Place | S | Apache County |
9 | AZ | 4 | 1990 | Amado CDP | Census Designated Place | S | Santa Cruz County |
10 | AZ | 4 | 2270 | Anegam CDP | Census Designated Place | S | Pima County |
11 | AZ | 4 | 2410 | Antares CDP | Census Designated Place | S | Mohave County |
12 | AZ | 4 | 2430 | Anthem CDP | Census Designated Place | S | Maricopa County |
13 | AZ | 4 | 2830 | Apache Junction city | Incorporated Place | A | Maricopa County, Pinal County |
14 | AZ | 4 | 3320 | Arivaca CDP | Census Designated Place | S | Pima County |
15 | AZ | 4 | 3380 | Arivaca Junction CDP | Census Designated Place | S | Pima County |
16 | AZ | 4 | 3530 | Arizona City CDP | Census Designated Place | S | Pinal County |
17 | AZ | 4 | 3915 | Arizona Village CDP | Census Designated Place | S | Mohave County |
18 | AZ | 4 | 4020 | Arlington CDP | Census Designated Place | S | Maricopa County |
19 | AZ | 4 | 4440 | Ash Fork CDP | Census Designated Place | S | Yavapai County |
20 | AZ | 4 | 4710 | Avenue B and C CDP | Census Designated Place | S | Yuma County |
21 | AZ | 4 | 4720 | Avondale city | Incorporated Place | A | Maricopa County |
22 | AZ | 4 | 4880 | Avra Valley CDP | Census Designated Place | S | Pima County |
23 | AZ | 4 | 4930 | Aztec CDP | Census Designated Place | S | Yuma County |
24 | AZ | 4 | 5140 | Bagdad CDP | Census Designated Place | S | Yavapai County |
25 | AZ | 4 | 5450 | Bear Flat CDP | Census Designated Place | S | Gila County |
26 | AZ | 4 | 5490 | Beaver Dam CDP | Census Designated Place | S | Mohave County |
27 | AZ | 4 | 5495 | Beaver Valley CDP | Census Designated Place | S | Gila County |
28 | AZ | 4 | 5770 | Benson city | Incorporated Place | A | Cochise County |
29 | AZ | 4 | 5970 | Beyerville CDP | Census Designated Place | S | Santa Cruz County |
... | ... | ... | ... | ... | ... | ... | ... |
421 | AZ | 4 | 82120 | Wheatfields CDP | Census Designated Place | S | Gila County |
422 | AZ | 4 | 82155 | Whetstone CDP | Census Designated Place | S | Cochise County |
423 | AZ | 4 | 82270 | Whispering Pines CDP | Census Designated Place | S | Gila County |
424 | AZ | 4 | 82390 | Whitecone CDP | Census Designated Place | S | Navajo County |
425 | AZ | 4 | 82425 | White Hills CDP | Census Designated Place | S | Mohave County |
426 | AZ | 4 | 82450 | White Mountain Lake CDP | Census Designated Place | S | Navajo County |
427 | AZ | 4 | 82530 | Whiteriver CDP | Census Designated Place | S | Navajo County |
428 | AZ | 4 | 82660 | Why CDP | Census Designated Place | S | Pima County |
429 | AZ | 4 | 82740 | Wickenburg town | Incorporated Place | A | Maricopa County, Yavapai County |
430 | AZ | 4 | 82810 | Wide Ruins CDP | Census Designated Place | S | Apache County |
431 | AZ | 4 | 82880 | Wikieup CDP | Census Designated Place | S | Mohave County |
432 | AZ | 4 | 82950 | Wilhoit CDP | Census Designated Place | S | Yavapai County |
433 | AZ | 4 | 83090 | Willcox city | Incorporated Place | A | Cochise County |
434 | AZ | 4 | 83160 | Williams city | Incorporated Place | A | Coconino County |
435 | AZ | 4 | 83388 | Williamson CDP | Census Designated Place | S | Yavapai County |
436 | AZ | 4 | 83475 | Willow Canyon CDP | Census Designated Place | S | Pima County |
437 | AZ | 4 | 83570 | Willow Valley CDP | Census Designated Place | S | Mohave County |
438 | AZ | 4 | 83720 | Window Rock CDP | Census Designated Place | S | Apache County |
439 | AZ | 4 | 83790 | Winkelman town | Incorporated Place | A | Gila County, Pinal County |
440 | AZ | 4 | 83930 | Winslow city | Incorporated Place | A | Navajo County |
441 | AZ | 4 | 83960 | Winslow West CDP | Census Designated Place | S | Coconino County, Navajo County |
442 | AZ | 4 | 84000 | Wintersburg CDP | Census Designated Place | S | Maricopa County |
443 | AZ | 4 | 84140 | Wittmann CDP | Census Designated Place | S | Maricopa County |
444 | AZ | 4 | 84350 | Woodruff CDP | Census Designated Place | S | Navajo County |
445 | AZ | 4 | 84980 | Yarnell CDP | Census Designated Place | S | Yavapai County |
446 | AZ | 4 | 85260 | York CDP | Census Designated Place | S | Greenlee County |
447 | AZ | 4 | 85330 | Young CDP | Census Designated Place | S | Gila County |
448 | AZ | 4 | 85400 | Youngtown town | Incorporated Place | A | Maricopa County |
449 | AZ | 4 | 85470 | Yucca CDP | Census Designated Place | S | Mohave County |
450 | AZ | 4 | 85540 | Yuma city | Incorporated Place | A | Yuma County |
451 rows × 7 columns
The Census TIGER geometry API is substantively different from every other API, in that it's an ArcGIS REST API. But, I've tried to expose a consistent interface. It works like this:
import cenpy.tiger as tiger
tiger.available()
[{'name': 'AIANNHA', 'type': 'MapServer'}, {'name': 'CBSA', 'type': 'MapServer'}, {'name': 'Hydro_LargeScale', 'type': 'MapServer'}, {'name': 'Hydro', 'type': 'MapServer'}, {'name': 'Labels', 'type': 'MapServer'}, {'name': 'Legislative', 'type': 'MapServer'}, {'name': 'Places_CouSub_ConCity_SubMCD', 'type': 'MapServer'}, {'name': 'PUMA_TAD_TAZ_UGA_ZCTA', 'type': 'MapServer'}, {'name': 'Region_Division', 'type': 'MapServer'}, {'name': 'School', 'type': 'MapServer'}, {'name': 'Special_Land_Use_Areas', 'type': 'MapServer'}, {'name': 'State_County', 'type': 'MapServer'}, {'name': 'tigerWMS_ACS2013', 'type': 'MapServer'}, {'name': 'tigerWMS_ACS2014', 'type': 'MapServer'}, {'name': 'tigerWMS_ACS2015', 'type': 'MapServer'}, {'name': 'tigerWMS_ACS2016', 'type': 'MapServer'}, {'name': 'tigerWMS_ACS2017', 'type': 'MapServer'}, {'name': 'tigerWMS_ACS2018', 'type': 'MapServer'}, {'name': 'tigerWMS_Census2010', 'type': 'MapServer'}, {'name': 'tigerWMS_Current', 'type': 'MapServer'}, {'name': 'tigerWMS_ECON2012', 'type': 'MapServer'}, {'name': 'tigerWMS_PhysicalFeatures', 'type': 'MapServer'}, {'name': 'Tracts_Blocks', 'type': 'MapServer'}, {'name': 'Transportation_LargeScale', 'type': 'MapServer'}, {'name': 'Transportation', 'type': 'MapServer'}, {'name': 'TribalTracts', 'type': 'MapServer'}, {'name': 'Urban', 'type': 'MapServer'}, {'name': 'USLandmass', 'type': 'MapServer'}]
In some cases, it makes quite a bit of sense to "attach" a map server to your connection. In the case of the US Census 2010 we've been using, there is an obvious data product match in tigerWMS_Census2010
. So, let's attach it to the connection.
conn.set_mapservice('tigerWMS_Census2010')
conn.mapservice
<cenpy.tiger.TigerConnection at 0x7f1b3f988dd8>
neat! this is the same as calling:
tiger.TigerConnection('tigerWMS_Census2010')
but this attaches that object it to the connection you've been using. The connection also updates with this information:
conn
Connection to Decennial SF1(ID: https://api.census.gov/data/id/DECENNIALSF12010) With MapServer: Census 2010 WMS
An ESRI MapServer is a big thing, and cenpy
doesn't support all of its features. Since cenpy
is designed to support retreival of data from the US Census, we only support GET
statements for defined geographic units, and ignore the vaious other functionalities in the service.
To work with a service, note that any map server is composed of layers:
conn.mapservice.layers
{0: (ESRILayer) Public Use Microdata Areas, 1: (ESRILayer) Public Use Microdata Areas Labels, 2: (ESRILayer) Traffic Analysis Districts, 3: (ESRILayer) Traffic Analysis Districts Labels, 4: (ESRILayer) Traffic Analysis Zones, 5: (ESRILayer) Traffic Analysis Zones Labels, 6: (ESRILayer) Urban Growth Areas, 7: (ESRILayer) Urban Growth Areas Labels, 8: (ESRILayer) ZIP Code Tabulation Areas, 9: (ESRILayer) ZIP Code Tabulation Areas Labels, 10: (ESRILayer) Tribal Census Tracts, 11: (ESRILayer) Tribal Census Tracts Labels, 12: (ESRILayer) Tribal Block Groups, 13: (ESRILayer) Tribal Block Groups Labels, 14: (ESRILayer) Census Tracts, 15: (ESRILayer) Census Tracts Labels, 16: (ESRILayer) Census Block Groups, 17: (ESRILayer) Census Block Groups Labels, 18: (ESRILayer) Census Blocks, 19: (ESRILayer) Census Blocks Labels, 20: (ESRILayer) Unified School Districts, 21: (ESRILayer) Unified School Districts Labels, 22: (ESRILayer) Secondary School Districts, 23: (ESRILayer) Secondary School Districts Labels, 24: (ESRILayer) Elementary School Districts, 25: (ESRILayer) Elementary School Districts Labels, 26: (ESRILayer) Estates, 27: (ESRILayer) Estates Labels, 28: (ESRILayer) County Subdivisions, 29: (ESRILayer) County Subdivisions Labels, 30: (ESRILayer) Subbarrios, 31: (ESRILayer) Subbarrios Labels, 32: (ESRILayer) Consolidated Cities, 33: (ESRILayer) Consolidated Cities Labels, 34: (ESRILayer) Incorporated Places, 35: (ESRILayer) Incorporated Places Labels, 36: (ESRILayer) Census Designated Places, 37: (ESRILayer) Census Designated Places Labels, 38: (ESRILayer) Alaska Native Regional Corporations, 39: (ESRILayer) Alaska Native Regional Corporations Labels, 40: (ESRILayer) Tribal Subdivisions, 41: (ESRILayer) Tribal Subdivisions Labels, 42: (ESRILayer) Federal American Indian Reservations, 43: (ESRILayer) Federal American Indian Reservations Labels, 44: (ESRILayer) Off-Reservation Trust Lands, 45: (ESRILayer) Off-Reservation Trust Lands Labels, 46: (ESRILayer) State American Indian Reservations, 47: (ESRILayer) State American Indian Reservations Labels, 48: (ESRILayer) Hawaiian Home Lands, 49: (ESRILayer) Hawaiian Home Lands Labels, 50: (ESRILayer) Alaska Native Village Statistical Areas, 51: (ESRILayer) Alaska Native Village Statistical Areas Labels, 52: (ESRILayer) Oklahoma Tribal Statistical Areas, 53: (ESRILayer) Oklahoma Tribal Statistical Areas Labels, 54: (ESRILayer) State Designated Tribal Statistical Areas, 55: (ESRILayer) State Designated Tribal Statistical Areas Labels, 56: (ESRILayer) Tribal Designated Statistical Areas, 57: (ESRILayer) Tribal Designated Statistical Areas Labels, 58: (ESRILayer) American Indian Joint-Use Areas, 59: (ESRILayer) American Indian Joint-Use Areas Labels, 60: (ESRILayer) 113th Congressional Districts, 61: (ESRILayer) 113th Congressional Districts Labels, 62: (ESRILayer) 111th Congressional Districts, 63: (ESRILayer) 111th Congressional Districts Labels, 64: (ESRILayer) 2013 State Legislative Districts - Upper, 65: (ESRILayer) 2013 State Legislative Districts - Upper Labels, 66: (ESRILayer) 2013 State Legislative Districts - Lower, 67: (ESRILayer) 2013 State Legislative Districts - Lower Labels, 68: (ESRILayer) 2010 State Legislative Districts - Upper, 69: (ESRILayer) 2010 State Legislative Districts - Upper Labels, 70: (ESRILayer) 2010 State Legislative Districts - Lower, 71: (ESRILayer) 2010 State Legislative Districts - Lower Labels, 72: (ESRILayer) Voting Districts, 73: (ESRILayer) Voting Districts Labels, 74: (ESRILayer) Census Divisions, 75: (ESRILayer) Census Divisions Labels, 76: (ESRILayer) Census Regions, 77: (ESRILayer) Census Regions Labels, 78: (ESRILayer) Urbanized Areas, 79: (ESRILayer) Urbanized Areas Labels, 80: (ESRILayer) Urban Clusters, 81: (ESRILayer) Urban Clusters Labels, 82: (ESRILayer) Combined New England City and Town Areas, 83: (ESRILayer) Combined New England City and Town Areas Labels, 84: (ESRILayer) New England City and Town Area Divisions, 85: (ESRILayer) New England City and Town Area Divisions Labels, 86: (ESRILayer) Metropolitan New England City and Town Areas, 87: (ESRILayer) Metropolitan New England City and Town Areas Labels, 88: (ESRILayer) Micropolitan New England City and Town Areas, 89: (ESRILayer) Micropolitan New England City and Town Areas Labels, 90: (ESRILayer) Combined Statistical Areas, 91: (ESRILayer) Combined Statistical Areas Labels, 92: (ESRILayer) Metropolitan Divisions, 93: (ESRILayer) Metropolitan Divisions Labels, 94: (ESRILayer) Metropolitan Statistical Areas, 95: (ESRILayer) Metropolitan Statistical Areas Labels, 96: (ESRILayer) Micropolitan Statistical Areas, 97: (ESRILayer) Micropolitan Statistical Areas Labels, 98: (ESRILayer) States, 99: (ESRILayer) States Labels, 100: (ESRILayer) Counties, 101: (ESRILayer) Counties Labels}
These layers are what actually implement query operations. For now, let's focus on the same "class" of units we were using before, Census Designated Places:
conn.mapservice.layers[36]
(ESRILayer) Census Designated Places
A query function is implemented both at the mapservice level and the layer level. At the mapservice level, a layer ID is required in order to complete the query.
Mapservice queries are driven by SQL. So, to grab all of the geodata that fits the CDPs we pulled before, you could start to construct it like this.
First, just like the main connection, each layer has a set of variables:
conn.mapservice.layers[36].variables
alias | domain | length | name | type | |
---|---|---|---|---|---|
0 | OBJECTID | None | NaN | OBJECTID | esriFieldTypeOID |
1 | AREALAND | None | NaN | AREALAND | esriFieldTypeDouble |
2 | AREAWATER | None | NaN | AREAWATER | esriFieldTypeDouble |
3 | PLACENS | None | 8.0 | PLACENS | esriFieldTypeString |
4 | PLACECC | None | 2.0 | PLACECC | esriFieldTypeString |
5 | PLACE | None | 5.0 | PLACE | esriFieldTypeString |
6 | FUNCSTAT | None | 1.0 | FUNCSTAT | esriFieldTypeString |
7 | GEOID | None | 7.0 | GEOID | esriFieldTypeString |
8 | LSADC | None | 2.0 | LSADC | esriFieldTypeString |
9 | MTFCC | None | 5.0 | MTFCC | esriFieldTypeString |
10 | BASENAME | None | 100.0 | BASENAME | esriFieldTypeString |
11 | NAME | None | 100.0 | NAME | esriFieldTypeString |
12 | OID | None | NaN | OID | esriFieldTypeDouble |
13 | CBSAPCI | None | 1.0 | CBSAPCI | esriFieldTypeString |
14 | NECTAPCI | None | 1.0 | NECTAPCI | esriFieldTypeString |
15 | STATE | None | 2.0 | STATE | esriFieldTypeString |
16 | UR | None | 1.0 | UR | esriFieldTypeString |
17 | CENTLON | None | 12.0 | CENTLON | esriFieldTypeString |
18 | CENTLAT | None | 11.0 | CENTLAT | esriFieldTypeString |
19 | INTPTLON | None | 12.0 | INTPTLON | esriFieldTypeString |
20 | INTPTLAT | None | 11.0 | INTPTLAT | esriFieldTypeString |
21 | HU100 | None | NaN | HU100 | esriFieldTypeDouble |
22 | POP100 | None | NaN | POP100 | esriFieldTypeDouble |
23 | STGEOMETRY | None | NaN | STGEOMETRY | esriFieldTypeGeometry |
24 | STGEOMETRY.AREA | None | NaN | STGEOMETRY.AREA | esriFieldTypeDouble |
25 | STGEOMETRY.LEN | None | NaN | STGEOMETRY.LEN | esriFieldTypeDouble |
Our prior query grabbed the places in AZ. So, we could use a SQL query that focuses on that.
I try to pack the geometries into containers that people are used to using. Without knowing if GEOS is installed on a user's computer, I use PySAL
as the target geometry type.
If you do have GEOS, that means you can use Shapely or GeoPandas. So, to choose your backend, you can use the following two arguments to this query function. the pkg
argument will let you choose the three types of python objects to output to.
Pysal is default. If you select Shapely, the result will just be a pandas dataframe with Shapely geometries instead of pysal geometries. If you choose geopandas (or throw a gpize) option, cenpy will try to convert the pandas dataframe into a GeoPandas dataframe.
geodata = conn.mapservice.query(layer=36, where='STATE = 04')
geodata.head()
AREALAND | AREAWATER | BASENAME | CBSAPCI | CENTLAT | CENTLON | FUNCSTAT | GEOID | HU100 | INTPTLAT | ... | OID | PLACE | PLACECC | PLACENS | POP100 | STATE | STGEOMETRY.AREA | STGEOMETRY.LEN | UR | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19032775 | 3749 | Copper Hill | N | +33.4379206 | -110.7516247 | S | 0415600 | 63 | +33.4309411 | ... | 280403717476730 | 15600 | U1 | 02582762 | 108 | 04 | 2.740956e+07 | 29887.334159 | R | POLYGON ((-12332768.2339 3951052.317500003, -1... |
1 | 370140896 | 37430 | Vicksburg | N | +33.7292771 | -113.8254052 | S | 0479940 | 687 | +33.7271570 | ... | 280403717388941 | 79940 | U1 | 02582892 | 597 | 04 | 5.365771e+08 | 139508.893591 | R | POLYGON ((-12681838.7724 3986891.478200004, -1... |
2 | 14935603 | 21306 | Sawmill | N | +35.8924354 | -109.1538097 | S | 0464590 | 243 | +35.8940886 | ... | 280401190808799 | 64590 | U1 | 02409289 | 748 | 04 | 2.283813e+07 | 26370.978055 | R | POLYGON ((-12154943.6963 4285660.0066, -121549... |
3 | 8693601 | 0 | Goodyear Village | N | +33.1973247 | -111.8723879 | S | 0428465 | 121 | +33.1973247 | ... | 280403861091191 | 28465 | U2 | 02612139 | 457 | 04 | 1.244920e+07 | 18140.700058 | M | POLYGON ((-12456882.7869 3922518.464100003, -1... |
4 | 23385412 | 55226 | Carrizo | N | +33.9866282 | -110.3314358 | S | 0410320 | 40 | +33.9793542 | ... | 280403717231648 | 10320 | U1 | 02582748 | 127 | 04 | 3.418072e+07 | 37640.185623 | R | POLYGON ((-12289756.4977 4027500.7016, -122896... |
5 rows × 26 columns
To join the geodata to the other data, use pandas functions:
import pandas as pd
newdata = pd.merge(data, geodata, left_on='place', right_on='PLACE')
newdata.head()
H001001 | H002001 | H002006 | H002002 | H002003 | H002004 | H002005 | NAME_x | state | place | ... | OID | PLACE | PLACECC | PLACENS | POP100 | STATE | STGEOMETRY.AREA | STGEOMETRY.LEN | UR | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 621 | 621 | 0 | 0 | 0 | 0 | 621 | Queen Valley CDP, Arizona | 04 | 58290 | ... | 280401260231706 | 58290 | U1 | 02409115 | 788 | 04 | 3.615206e+07 | 29036.707030 | R | POLYGON ((-12394427.7659 3930594.1774, -123943... |
1 | 226 | 226 | 0 | 148 | 0 | 148 | 78 | Rainbow City CDP, Arizona | 04 | 58620 | ... | 280403717231691 | 58620 | U2 | 02582847 | 968 | 04 | 8.203581e+06 | 15645.985013 | M | POLYGON ((-12244069.0863 4010211.943099998, -1... |
2 | 162 | 162 | 0 | 162 | 0 | 162 | 0 | Rancho Mesa Verde CDP, Arizona | 04 | 58870 | ... | 280403717389008 | 58870 | U2 | 02582848 | 625 | 04 | 4.109284e+05 | 2901.827817 | M | POLYGON ((-12763708.1353 3841881.7513, -127634... |
3 | 202 | 202 | 0 | 0 | 0 | 0 | 202 | Red Mesa CDP, Arizona | 04 | 59550 | ... | 280401190808794 | 59550 | U1 | 02409150 | 480 | 04 | 5.222080e+07 | 43104.992136 | R | POLYGON ((-12184877.6187 4433486.655900002, -1... |
4 | 64 | 64 | 0 | 0 | 0 | 0 | 64 | Red Rock CDP (Apache County), Arizona | 04 | 59760 | ... | 280403717231660 | 59760 | U1 | 02582849 | 169 | 04 | 4.694147e+06 | 13576.105484 | R | POLYGON ((-12142700.7787 4383404.183799997, -1... |
5 rows × 36 columns
So, that's how you get your geodata in addition to your regular data!
We'll try the Economic Census
conn2 = c.base.Connection('CBP2012')
Alright, let's look at the available columns:
conn2.variables
attributes | concept | group | label | limit | predicateOnly | predicateType | required | values | |
---|---|---|---|---|---|---|---|---|---|
for | NaN | Census API Geography Specification | N/A | Census API FIPS 'for' clause | 0 | True | fips-for | NaN | NaN |
in | NaN | Census API Geography Specification | N/A | Census API FIPS 'in' clause | 0 | True | fips-in | NaN | NaN |
ucgid | NaN | Census API Geography Specification | N/A | Uniform Census Geography Identifier clause | 0 | True | ucgid | NaN | NaN |
PAYANN | PAYANN_F | NaN | N/A | Annual payroll ($1,000) | 0 | NaN | int | NaN | NaN |
GEOTYPE | NaN | NaN | N/A | Type of geography flag | 0 | NaN | string | NaN | NaN |
PAYQTR1 | PAYQTR1_F | NaN | N/A | First-quarter payroll ($1,000) | 0 | NaN | int | NaN | NaN |
YEAR | YEAR_TTL | NaN | N/A | Year | 0 | NaN | string | NaN | {'item': {'1982': '1982', '1983': '1983', '198... |
COUNTY | NaN | NaN | N/A | FIPS county code | 0 | NaN | NaN | NaN | NaN |
ST | NaN | NaN | N/A | FIPS state code | 0 | NaN | NaN | NaN | NaN |
EMPSZES | EMPSZES_TTL | NaN | N/A | Employment size of establishment | 14 | NaN | string | default displayed | {'item': {'001': 'All establishments', '204': ... |
CSA | NaN | NaN | N/A | FIPS Combined Statistical Area code | 0 | NaN | string | NaN | NaN |
MD | NaN | NaN | N/A | FIPS Metropolitan Division code | 0 | NaN | string | NaN | NaN |
GEO_ID | GEO_TTL,FOOTID_GEO | NaN | N/A | Geographic identifier code | 0 | NaN | string | NaN | NaN |
EMP | EMP_F | NaN | N/A | Paid employees for pay period including March ... | 0 | NaN | int | NaN | NaN |
ESTAB | ESTAB_F | NaN | N/A | Number of establishments | 0 | NaN | int | NaN | NaN |
MSA | NaN | NaN | N/A | FIPS Metropolitan Statistical Area or Micropol... | 0 | NaN | NaN | NaN | NaN |
NAICS2012 | NAICS2012_TTL,FOOTID_NAICS | NaN | N/A | 2012 NAICS code | 2016 | NaN | string | default displayed | {'item': {'111331': 'Apple orchards', '111332'... |
LFO | LFO_TTL | NaN | N/A | Legal form of organization code | 8 | NaN | string | default displayed | {'item': {'001': 'All establishments', '002': ... |
PAYQTR1_N | PAYQTR1_N_F | NaN | N/A | Noise range for first-quarter payroll (%) | 0 | NaN | int | NaN | NaN |
PAYANN_N | PAYANN_N_F | NaN | N/A | Noise range for annual payroll | 0 | NaN | int | NaN | NaN |
EMP_N | EMP_N_F | NaN | N/A | Noise range for number of paid employees for p... | 0 | NaN | int | NaN | NaN |
To show the required predicates, can filter the variables
dataframe by the required
field. Note that required means that the query will fail if these are not passed as keyword arguments. They don't have to specify a single value, though, so they can be left as a wild card, like we did with place:*
in the prior query:
conn2.variables[~ conn2.variables.required.isnull()]
attributes | concept | group | label | limit | predicateOnly | predicateType | required | values | |
---|---|---|---|---|---|---|---|---|---|
EMPSZES | EMPSZES_TTL | NaN | N/A | Employment size of establishment | 14 | NaN | string | default displayed | {'item': {'001': 'All establishments', '204': ... |
NAICS2012 | NAICS2012_TTL,FOOTID_NAICS | NaN | N/A | 2012 NAICS code | 2016 | NaN | string | default displayed | {'item': {'111331': 'Apple orchards', '111332'... |
LFO | LFO_TTL | NaN | N/A | Legal form of organization code | 8 | NaN | string | default displayed | {'item': {'001': 'All establishments', '002': ... |
Like before, geographies are shown within a given hierarchy. Here, the only geography is the fips
geography.
conn2.geographies.keys()
dict_keys(['fips'])
conn2.geographies['fips']
geoLevelDisplay | limit | name | optionalWithWCFor | referenceDate | requires | wildcard | |
---|---|---|---|---|---|---|---|
0 | 01 | 1 | us | NaN | 2012-01-01 | NaN | NaN |
1 | 02 | 51 | state | NaN | 2012-01-01 | NaN | NaN |
2 | 03 | 3304 | county | state | 2012-01-01 | [state] | [state] |
3 | 08,09 | 917 | metropolitan statistical area/micropolitan sta... | NaN | 2012-01-01 | NaN | NaN |
Now, we'll do some fun with error handling and passing of additional arguments to the query. Any "extra" required predicates beyond get
, for
and in
are added at the end of the query as keyword arguments. These are caught and introduced into the query following the API specifications.
First, though, let's see what happens when we submit a malformed query!
Here, we can query for every column in the dataset applied to places in California (fips = 06
). The dataset we're working with, the Economic Census, requires an OPTAX
field, which identifies the "type of operation or tax status code" along which to slice the data. Just like the other arguments, we will map them to keywords in the API string, and a wildcard represents a slice of all possible values.
cols = conn2.varslike('ESTAB*', engine='fnmatch')
data2 = conn2.query(cols=cols, geo_unit='county:*', geo_filter={'state':'06'})
data2.head()
ESTAB | state | county | |
---|---|---|---|
0 | 635 | 06 | 999 |
1 | 36700 | 06 | 001 |
2 | 43 | 06 | 003 |
3 | 801 | 06 | 005 |
4 | 4615 | 06 | 007 |
And so you get the table of employment by County & NAICS code for employment and establishments in California counties. Since we're using counties as our unit of analysis, we could grab the geodata for counties.
conn2.set_mapservice('State_County')
But, there are quite a few layers in this MapService:
len(conn2.mapservice.layers)
71
Oof. If you ever want to check out the web interface to see what it looks like, you can retrieve the URLs of most objects using:
conn2.mapservice._baseurl
'http://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/State_County/MapServer'
Anyway, we know counties don't really change all that much. So, let's just pick a counties layer and pull it down for California:
geodata2= conn2.mapservice.query(layer=1,where='STATE = 06')
newdata2 = pd.merge(data2, geodata2, left_on='county', right_on='COUNTY')
newdata2.head()
ESTAB | state | county | AREALAND | AREAWATER | BASENAME | CENTLAT | CENTLON | COUNTY | COUNTYCC | ... | INTPTLON | LSADC | MTFCC | NAME | OBJECTID | OID | STATE | STGEOMETRY.AREA | STGEOMETRY.LEN | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 36700 | 06 | 001 | 1909598011 | 216923745 | Alameda | +37.6505688 | -121.9177578 | 001 | H1 | ... | -121.9124880 | 06 | G4020 | Alameda County | 2098 | 27590141293924 | 06 | 3.398122e+09 | 309844.151899 | POLYGON ((-13612245.2954 4538149.388899997, -1... |
1 | 43 | 06 | 003 | 1912292631 | 12557304 | Alpine | +38.5971043 | -119.8206026 | 003 | H1 | ... | -119.7983522 | 06 | G4020 | Alpine County | 1317 | 27590289634197 | 06 | 3.156005e+09 | 275565.367754 | POLYGON ((-13366502.0648 4678945.273900002, -1... |
2 | 801 | 06 | 005 | 1539933575 | 29470567 | Amador | +38.4466174 | -120.6516693 | 005 | H1 | ... | -120.6538563 | 06 | G4020 | Amador County | 2724 | 27590143912562 | 06 | 2.562527e+09 | 359598.495341 | POLYGON ((-13472696.4062 4647651.505999997, -1... |
3 | 4615 | 06 | 007 | 4238438258 | 105310959 | Butte | +39.6665788 | -121.6007017 | 007 | H1 | ... | -121.6019188 | 06 | G4020 | Butte County | 2237 | 27590417130535 | 06 | 7.340033e+09 | 525974.965501 | POLYGON ((-13565003.3072 4798393.384000003, -1... |
4 | 891 | 06 | 009 | 2641829200 | 43797659 | Calaveras | +38.2044678 | -120.5546688 | 009 | H1 | ... | -120.5614415 | 06 | G4020 | Calaveras County | 347 | 27590202403841 | 06 | 4.356212e+09 | 367005.819167 | POLYGON ((-13428574.0355 4627724.500200003, -1... |
5 rows × 24 columns
And that's all there is to it! Geodata and tabular data from the Census APIs in one place.
File an issue if you have concerns!