Grabbing data with cenpy¶

Cenpy (sen - pie) is a package that exposes APIs from the US Census Bureau and makes it easy to pull down and work with Census data in Pandas. First, notice that there are two core modules in the package, base and explorer, which each do different things. First, let's look at explorer.

In [1]:

import cenpy as c
import pandas

On import, explorer requests all currently available APIs from the Census Bureau's API listing. In future, it will can also read a JSON collection describing the databases from disk, if asked.

Explorer has two functions, available and explain. available will provide a list of the identifiers of all the APIs that cenpy knows about. If run with verbose=True, cenpy will also include the title of the database as a dictionary. It's a good idea to not process this directly, and instead use it to explore currently available APIs.

Also, beware that the US Census Bureau can change the names of the resources. This means that the index of the following table is not necessarily stable over time; sometimes, the same resource can change its identifier, like when the 2010 decennial census changed from 2010sf1 to DECENNIALSF12010. So, consult the table built by cenpy.explorer.available() if the keys appear to have changed.

Here, I'll just show the first five entries:

In [2]:

c.explorer.available().head()

Out[2]:

	title	temporal	spatial	publisher	programCode	modified	keyword	distribution	description	contactPoint	...	c_isTimeseries	c_isCube	c_isAvailable	c_isAggregate	c_groupsLink	c_geographyLink	c_examplesLink	c_dataset	bureauCode	accessLevel
NONEMP2007	2007 Nonemployer Statistics: Non Employer Stat...	2007/2007	United States	U.S. Census Bureau	006:007	2017-02-09	NaN	{'@type': 'dcat:Distribution', 'accessURL': 'h...	Nonemployer Statistics is an annual series tha...	{'fn': 'Nonemployer Statistics staff', 'hasEma...	...	NaN	True	True	True	https://api.census.gov/data/2007/nonemp/groups...	https://api.census.gov/data/2007/nonemp/geogra...	https://api.census.gov/data/2007/nonemp/exampl...	(nonemp,)	006:07	public
POPESTagesex	Vintage 2014 Population Estimates: National An...	April 1, 2010 - Current	United States	U.S. Census Bureau	006:008	2015-11-17	NaN	{'@type': 'dcat:Distribution', 'accessURL': 'h...	Annual Estimates of the Resident Population by...	{'fn': 'Population Estimates Branch', 'hasEmai...	...	NaN	True	True	True	https://api.census.gov/data/2014/pep/agesex/gr...	https://api.census.gov/data/2014/pep/agesex/ge...	https://api.census.gov/data/2014/pep/agesex/ex...	(pep, agesex)	006:07	public
ZBPTotal2011	2011 County Business Patterns - Zip Code Busin...	2011/2011	United States	U.S. Census Bureau	006:007	2017-05-23	NaN	{'@type': 'dcat:Distribution', 'accessURL': 'h...	ZIP Code Business Patterns (ZBP) is an annual ...	{'fn': 'CBP Staff', 'hasEmail': 'ewd.county.bu...	...	NaN	NaN	True	True	https://api.census.gov/data/2011/zbp/groups.json	https://api.census.gov/data/2011/zbp/geography...	https://api.census.gov/data/2011/zbp/examples....	(zbp,)	006:07	public
ITMONTHLYIMPORTSUSDA	Time Series International Trade: Monthly U.S. ...	Time Series	United States	U.S. Census Bureau	006:004	2017-10-27	NaN	{'@type': 'dcat:Distribution', 'accessURL': 'h...	The Census data API provides access to the mos...	{'fn': 'Jeff McHugh', 'hasEmail': 'Jeffrey.P.M...	...	True	True	True	NaN	https://api.census.gov/data/timeseries/intltra...	https://api.census.gov/data/timeseries/intltra...	https://api.census.gov/data/timeseries/intltra...	(timeseries, intltrade, imports, usda)	006:07	public
POPESTnatmonthly2015	Vintage 2015 Population Estimates: National Mo...	April 1,2010	United States	U.S. Census Bureau	006:008	2016-12-21	NaN	{'@type': 'dcat:Distribution', 'accessURL': 'h...	Monthly Population Estimates by Universe, Age,...	{'fn': 'Population Estimates Branch', 'hasEmai...	...	NaN	True	True	True	https://api.census.gov/data/2015/pep/natmonthl...	https://api.census.gov/data/2015/pep/natmonthl...	https://api.census.gov/data/2015/pep/natmonthl...	(pep, natmonthly)	006:07	public

5 rows × 24 columns

The explain command provides the title and full description of the datasource. If run in verbose mode, the function returns the full json listing of the API.

In [3]:

c.explorer.explain('DECENNIALSF12010')

Out[3]:

{'Decennial SF1': 'Summary File 1 (SF 1) contains detailed tables focusing on age, sex, households, families, and housing units. These tables provide in-depth figures by race and Hispanic origin> some tables are repeated for each of nine race/Latino groups. Counts also are provided for over forty American Indian and Alaska Native tribes and for groups within race categories. The race categories include eighteen Asian groups and twelve Native Hawaiian and Other Pacific Islander groups. Counts of persons of Hispanic origin by country of origin (twenty-eight groups) are also shown. Summary File 1 presents data for the United States, the 50 states, and the District of Columbia in a hierarchical sequence down to the block level for many tabulations, but only to the census tract level for others. Summaries are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs) and Congressional districts. Geographic coverage for Puerto Rico is comparable to the 50 states. Data are presented in a hierarchical sequence down the block level for many tabulations, but only to the census tract level for others. Geographic areas include barrios, barrios-pueblo, subbarrios, places, census tracts, block groups, and blocks. Summaries also are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs).'}

To actually connect to a database resource, you create a Connection. A Connection works like a very simplified connection from the sqlalchemy world. The Connection class has a method, query that constructs a query string and requests it from the Census server. This result is then parsed into JSON and returned to the user.

In [4]:

conn = c.base.Connection('DECENNIALSF12010')

That may have taken longer than you'd've expected. This is because, when the Connection constructor is called, it populates the connection object with a bit of metadata that makes it possible to construct queries without referring to the census handbooks.

For instance, a connection's variables represent all available search parameters for a given dataset.

In [5]:

conn.variables.head()

Out[5]:

	attributes	concept	group	label	predicateOnly	predicateType	required	values
for	NaN	Census API Geography Specification	N/A	Census API FIPS 'for' clause	True	fips-for	NaN	NaN
in	NaN	Census API Geography Specification	N/A	Census API FIPS 'in' clause	True	fips-in	NaN	NaN
ucgid	NaN	Census API Geography Specification	N/A	Uniform Census Geography Identifier clause	True	ucgid	NaN	NaN
PCT022021	NaN	GROUP QUARTERS POPULATION BY SEX BY GROUP QUA...	PCT22	Total!!Female!!Noninstitutionalized population...	NaN	int	NaN	NaN
P028E001	NaN	HOUSEHOLD TYPE BY HOUSEHOLD SIZE (NATIVE HAWAI...	P28E	Total	NaN	int	NaN	NaN

This dataframe is populated just like the census's table describing the variables on the corresponding api website. Fortunately, this means that you can modify and filter this dataframe just like you can regular pandas dataframes, so working out what the exact codes to use in your query is easy.

I've added a function, varslike, that globs variables that fit a regular expression pattern. It can use the builtin python re module, in addition to the fnmatch module. It also can use any filtering function you want.

So, you can extract the rows of the variables using the df.ix method on the list of columns that match your expression:

In [6]:

conn.variables.loc[conn.varslike('H011[AB]')]

Out[6]:

	attributes	concept	group	label	predicateOnly	predicateType	required	values
H011A002	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11A	Population in occupied housing units!!Owned wi...	NaN	int	NaN	NaN
H011A001	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11A	Population in occupied housing units	NaN	int	NaN	NaN
H011A004	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11A	Population in occupied housing units!!Renter o...	NaN	int	NaN	NaN
H011A003	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11A	Population in occupied housing units!!Owned fr...	NaN	int	NaN	NaN
H011B001	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11B	Population in occupied housing units	NaN	int	NaN	NaN
H011B002	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11B	Population in occupied housing units!!Owned wi...	NaN	int	NaN	NaN
H011B003	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11B	Population in occupied housing units!!Owned fr...	NaN	int	NaN	NaN
H011B004	NaN	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	H11B	Population in occupied housing units!!Renter o...	NaN	int	NaN	NaN

Likewise, the different levels of geographic scale are determined from the metadata in the overall API listing and recorded.

However, many Census products have multiple possible geographical indexing systems, like the deprecated fips code system and the new Geographical Names Information System, gnis. Thus, the geographies property is a dictionary of dataframes, where each key is the name of the identifier system and the value is the dataframe describing the identifier system.

For the 2010 census, the following systems are available:

In [7]:

conn.geographies.keys()

Out[7]:

dict_keys(['fips'])

For an explanation of the geographic hierarchies, the geographies tables show the geographies at which the data is summarized:

In [8]:

conn.geographies['fips'].head()

Out[8]:

	geoLevelDisplay	name	optionalWithWCFor	referenceDate	requires	wildcard
0	010	us	NaN	2010-01-01	NaN	NaN
1	020	region	NaN	2010-01-01	NaN	NaN
2	030	division	NaN	2010-01-01	NaN	NaN
3	040	state	NaN	2010-01-01	NaN	NaN
4	050	county	state	2010-01-01	[state]	[state]

Note that some geographies in the fips system have a requires filter to prevent drawing too much data. This will get passed to the query method later.

So, let's just grab the housing information from the 2010 Census Short Form. Using the variables table above, we picked out a subset of the fields we wanted. Since the variables table is indexed by the identifiers, we can grab the indexes of the filtered dataframe as query parameters.

In addition, adding the NAME field smart-fills the table with the name of the geographic entity being pulled from the Census.

In [9]:

cols = conn.varslike('H00[012]*', engine='fnmatch')

In [10]:

cols.append('NAME')

In [11]:

cols

Out[11]:

['H001001',
 'H002001',
 'H002006',
 'H002002',
 'H002003',
 'H002004',
 'H002005',
 'NAME']

Now the query. The query is constructed just like the API query, and works as follows.

cols - list of columns desired from the database, maps to census API's get=
geo_unit - string denoting the unit of study to pull, maps to census API's in=
geo_filter - dictionary containing groupings of geo_units, if required, maps to for=

To be specific, a fully query tells the server what columns to pull of what underlying geography from what aggregation units. It's structured using these heterogeneous datatypes so it's easy to change the smallest units quickly, while providing sufficient granularity to change the filters and columns as you go.

This query below grabs the names, population, and housing estimates from the ACS, as well as their standard errors from census designated places in Arizona.

In [12]:

data = conn.query(cols, geo_unit = 'place:*', geo_filter = {'state':'04'})

Once constructed, the query executes as fast as your internet connection will move. This query has:

In [13]:

data.shape

Out[13]:

(451, 10)

28 columns and 451 rows. So, rather fast.

For validity and ease of use, we store the last executed query to the object. If you're dodgy about your census API key never being shown in plaintext, never print this property!

In [14]:

conn.last_query

Out[14]:

'https://api.census.gov/data/2010/dec/sf1?get=H001001,H002001,H002006,H002002,H002003,H002004,H002005,NAME&for=place:*&in=state:04&key=174dc2099125916233a42788cc0ffd0336d2ca85'

So, you have a dataframe with the information requested, plus the fields specified in the geo_filter and geo_unit. Sometimes, the pandas.infer_objects() function is not able to infer the types or structures of the data in the ways that you might expect. Thus, you may need to format the final data to ensure that the data types are correct.

So, the following is a dataframe of the data requested. I've filtered it to only look at data where the population is larger than 40 thousand people.

Pretty neat!

In [15]:

data[data['H001001'].astype(int) > 40000]

Out[15]:

	H001001	H002001	H002002	H002003	H002004	H002005	NAME	state	place
64	94404	94404	94394	94394	0	10	Chandler city, Arizona	04	12000
110	201173	201173	200979	200979	0	194	Mesa city, Arizona	04	46000
201	90505	90505	90493	90493	0	12	Glendale city, Arizona	04	27820
267	124001	124001	120049	120049	0	3952	Scottsdale city, Arizona	04	65000
305	52586	52586	51082	51082	0	1504	Surprise city, Arizona	04	71510
314	73462	73462	73462	73462	0	0	Tempe city, Arizona	04	73000
333	229762	229762	228577	228506	71	1185	Tucson city, Arizona	04	77000
413	74907	74907	74880	74880	0	27	Gilbert town, Arizona	04	27400
433	64818	64818	64133	60939	3194	685	Peoria city, Arizona	04	54050
435	590149	590149	587936	587936	0	2213	Phoenix city, Arizona	04	55000

And, just in case you're liable to forget your FIPS codes, the explorer module can look up some fips codes listings for you.

In [16]:

c.explorer.fips_table('place', in_state='AZ')

reading https://www2.census.gov/geo/docs/reference/codes/files/st04_az_places.txt

Out[16]:

	0	1	2	3	4	5	6
0	AZ	4	730	Aguila CDP	Census Designated Place	S	Maricopa County
1	AZ	4	870	Ajo CDP	Census Designated Place	S	Pima County
2	AZ	4	940	Ak Chin CDP	Census Designated Place	S	Pima County
3	AZ	4	1090	Ak-Chin Village CDP	Census Designated Place	S	Pinal County
4	AZ	4	1170	Alamo Lake CDP	Census Designated Place	S	La Paz County
5	AZ	4	1560	Ali Chuk CDP	Census Designated Place	S	Pima County
6	AZ	4	1570	Ali Chukson CDP	Census Designated Place	S	Pima County
7	AZ	4	1620	Ali Molina CDP	Census Designated Place	S	Pima County
8	AZ	4	1920	Alpine CDP	Census Designated Place	S	Apache County
9	AZ	4	1990	Amado CDP	Census Designated Place	S	Santa Cruz County
10	AZ	4	2270	Anegam CDP	Census Designated Place	S	Pima County
11	AZ	4	2410	Antares CDP	Census Designated Place	S	Mohave County
12	AZ	4	2430	Anthem CDP	Census Designated Place	S	Maricopa County
13	AZ	4	2830	Apache Junction city	Incorporated Place	A	Maricopa County, Pinal County
14	AZ	4	3320	Arivaca CDP	Census Designated Place	S	Pima County
15	AZ	4	3380	Arivaca Junction CDP	Census Designated Place	S	Pima County
16	AZ	4	3530	Arizona City CDP	Census Designated Place	S	Pinal County
17	AZ	4	3915	Arizona Village CDP	Census Designated Place	S	Mohave County
18	AZ	4	4020	Arlington CDP	Census Designated Place	S	Maricopa County
19	AZ	4	4440	Ash Fork CDP	Census Designated Place	S	Yavapai County
20	AZ	4	4710	Avenue B and C CDP	Census Designated Place	S	Yuma County
21	AZ	4	4720	Avondale city	Incorporated Place	A	Maricopa County
22	AZ	4	4880	Avra Valley CDP	Census Designated Place	S	Pima County
23	AZ	4	4930	Aztec CDP	Census Designated Place	S	Yuma County
24	AZ	4	5140	Bagdad CDP	Census Designated Place	S	Yavapai County
25	AZ	4	5450	Bear Flat CDP	Census Designated Place	S	Gila County
26	AZ	4	5490	Beaver Dam CDP	Census Designated Place	S	Mohave County
27	AZ	4	5495	Beaver Valley CDP	Census Designated Place	S	Gila County
28	AZ	4	5770	Benson city	Incorporated Place	A	Cochise County
29	AZ	4	5970	Beyerville CDP	Census Designated Place	S	Santa Cruz County
...	...	...	...	...	...	...	...
421	AZ	4	82120	Wheatfields CDP	Census Designated Place	S	Gila County
422	AZ	4	82155	Whetstone CDP	Census Designated Place	S	Cochise County
423	AZ	4	82270	Whispering Pines CDP	Census Designated Place	S	Gila County
424	AZ	4	82390	Whitecone CDP	Census Designated Place	S	Navajo County
425	AZ	4	82425	White Hills CDP	Census Designated Place	S	Mohave County
426	AZ	4	82450	White Mountain Lake CDP	Census Designated Place	S	Navajo County
427	AZ	4	82530	Whiteriver CDP	Census Designated Place	S	Navajo County
428	AZ	4	82660	Why CDP	Census Designated Place	S	Pima County
429	AZ	4	82740	Wickenburg town	Incorporated Place	A	Maricopa County, Yavapai County
430	AZ	4	82810	Wide Ruins CDP	Census Designated Place	S	Apache County
431	AZ	4	82880	Wikieup CDP	Census Designated Place	S	Mohave County
432	AZ	4	82950	Wilhoit CDP	Census Designated Place	S	Yavapai County
433	AZ	4	83090	Willcox city	Incorporated Place	A	Cochise County
434	AZ	4	83160	Williams city	Incorporated Place	A	Coconino County
435	AZ	4	83388	Williamson CDP	Census Designated Place	S	Yavapai County
436	AZ	4	83475	Willow Canyon CDP	Census Designated Place	S	Pima County
437	AZ	4	83570	Willow Valley CDP	Census Designated Place	S	Mohave County
438	AZ	4	83720	Window Rock CDP	Census Designated Place	S	Apache County
439	AZ	4	83790	Winkelman town	Incorporated Place	A	Gila County, Pinal County
440	AZ	4	83930	Winslow city	Incorporated Place	A	Navajo County
441	AZ	4	83960	Winslow West CDP	Census Designated Place	S	Coconino County, Navajo County
442	AZ	4	84000	Wintersburg CDP	Census Designated Place	S	Maricopa County
443	AZ	4	84140	Wittmann CDP	Census Designated Place	S	Maricopa County
444	AZ	4	84350	Woodruff CDP	Census Designated Place	S	Navajo County
445	AZ	4	84980	Yarnell CDP	Census Designated Place	S	Yavapai County
446	AZ	4	85260	York CDP	Census Designated Place	S	Greenlee County
447	AZ	4	85330	Young CDP	Census Designated Place	S	Gila County
448	AZ	4	85400	Youngtown town	Incorporated Place	A	Maricopa County
449	AZ	4	85470	Yucca CDP	Census Designated Place	S	Mohave County
450	AZ	4	85540	Yuma city	Incorporated Place	A	Yuma County

451 rows × 7 columns

GEO & Tiger Integration¶

The Census TIGER geometry API is substantively different from every other API, in that it's an ArcGIS REST API. But, I've tried to expose a consistent interface. It works like this:

In [17]:

import cenpy.tiger as tiger

In [18]:

tiger.available()

Out[18]:

[{'name': 'AIANNHA', 'type': 'MapServer'},
 {'name': 'CBSA', 'type': 'MapServer'},
 {'name': 'Hydro_LargeScale', 'type': 'MapServer'},
 {'name': 'Hydro', 'type': 'MapServer'},
 {'name': 'Labels', 'type': 'MapServer'},
 {'name': 'Legislative', 'type': 'MapServer'},
 {'name': 'Places_CouSub_ConCity_SubMCD', 'type': 'MapServer'},
 {'name': 'PUMA_TAD_TAZ_UGA_ZCTA', 'type': 'MapServer'},
 {'name': 'Region_Division', 'type': 'MapServer'},
 {'name': 'School', 'type': 'MapServer'},
 {'name': 'Special_Land_Use_Areas', 'type': 'MapServer'},
 {'name': 'State_County', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2013', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2014', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2015', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2016', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2017', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2018', 'type': 'MapServer'},
 {'name': 'tigerWMS_Census2010', 'type': 'MapServer'},
 {'name': 'tigerWMS_Current', 'type': 'MapServer'},
 {'name': 'tigerWMS_ECON2012', 'type': 'MapServer'},
 {'name': 'tigerWMS_PhysicalFeatures', 'type': 'MapServer'},
 {'name': 'Tracts_Blocks', 'type': 'MapServer'},
 {'name': 'Transportation_LargeScale', 'type': 'MapServer'},
 {'name': 'Transportation', 'type': 'MapServer'},
 {'name': 'TribalTracts', 'type': 'MapServer'},
 {'name': 'Urban', 'type': 'MapServer'},
 {'name': 'USLandmass', 'type': 'MapServer'}]

In some cases, it makes quite a bit of sense to "attach" a map server to your connection. In the case of the US Census 2010 we've been using, there is an obvious data product match in tigerWMS_Census2010. So, let's attach it to the connection.

In [19]:

conn.set_mapservice('tigerWMS_Census2010')

In [20]:

conn.mapservice

Out[20]:

<cenpy.tiger.TigerConnection at 0x7f1b3f988dd8>

neat! this is the same as calling:

tiger.TigerConnection('tigerWMS_Census2010')

but this attaches that object it to the connection you've been using. The connection also updates with this information:

In [21]:

conn

Out[21]:

Connection to Decennial SF1(ID: https://api.census.gov/data/id/DECENNIALSF12010)
With MapServer: Census 2010 WMS

An ESRI MapServer is a big thing, and cenpy doesn't support all of its features. Since cenpy is designed to support retreival of data from the US Census, we only support GET statements for defined geographic units, and ignore the vaious other functionalities in the service.

To work with a service, note that any map server is composed of layers:

In [22]:

conn.mapservice.layers

Out[22]:

{0: (ESRILayer) Public Use Microdata Areas,
 1: (ESRILayer) Public Use Microdata Areas Labels,
 2: (ESRILayer) Traffic Analysis Districts,
 3: (ESRILayer) Traffic Analysis Districts Labels,
 4: (ESRILayer) Traffic Analysis Zones,
 5: (ESRILayer) Traffic Analysis Zones Labels,
 6: (ESRILayer) Urban Growth Areas,
 7: (ESRILayer) Urban Growth Areas Labels,
 8: (ESRILayer) ZIP Code Tabulation Areas,
 9: (ESRILayer) ZIP Code Tabulation Areas Labels,
 10: (ESRILayer) Tribal Census Tracts,
 11: (ESRILayer) Tribal Census Tracts Labels,
 12: (ESRILayer) Tribal Block Groups,
 13: (ESRILayer) Tribal Block Groups Labels,
 14: (ESRILayer) Census Tracts,
 15: (ESRILayer) Census Tracts Labels,
 16: (ESRILayer) Census Block Groups,
 17: (ESRILayer) Census Block Groups Labels,
 18: (ESRILayer) Census Blocks,
 19: (ESRILayer) Census Blocks Labels,
 20: (ESRILayer) Unified School Districts,
 21: (ESRILayer) Unified School Districts Labels,
 22: (ESRILayer) Secondary School Districts,
 23: (ESRILayer) Secondary School Districts Labels,
 24: (ESRILayer) Elementary School Districts,
 25: (ESRILayer) Elementary School Districts Labels,
 26: (ESRILayer) Estates,
 27: (ESRILayer) Estates Labels,
 28: (ESRILayer) County Subdivisions,
 29: (ESRILayer) County Subdivisions Labels,
 30: (ESRILayer) Subbarrios,
 31: (ESRILayer) Subbarrios Labels,
 32: (ESRILayer) Consolidated Cities,
 33: (ESRILayer) Consolidated Cities Labels,
 34: (ESRILayer) Incorporated Places,
 35: (ESRILayer) Incorporated Places Labels,
 36: (ESRILayer) Census Designated Places,
 37: (ESRILayer) Census Designated Places Labels,
 38: (ESRILayer) Alaska Native Regional Corporations,
 39: (ESRILayer) Alaska Native Regional Corporations Labels,
 40: (ESRILayer) Tribal Subdivisions,
 41: (ESRILayer) Tribal Subdivisions Labels,
 42: (ESRILayer) Federal American Indian Reservations,
 43: (ESRILayer) Federal American Indian Reservations Labels,
 44: (ESRILayer) Off-Reservation Trust Lands,
 45: (ESRILayer) Off-Reservation Trust Lands Labels,
 46: (ESRILayer) State American Indian Reservations,
 47: (ESRILayer) State American Indian Reservations Labels,
 48: (ESRILayer) Hawaiian Home Lands,
 49: (ESRILayer) Hawaiian Home Lands Labels,
 50: (ESRILayer) Alaska Native Village Statistical Areas,
 51: (ESRILayer) Alaska Native Village Statistical Areas Labels,
 52: (ESRILayer) Oklahoma Tribal Statistical Areas,
 53: (ESRILayer) Oklahoma Tribal Statistical Areas Labels,
 54: (ESRILayer) State Designated Tribal Statistical Areas,
 55: (ESRILayer) State Designated Tribal Statistical Areas Labels,
 56: (ESRILayer) Tribal Designated Statistical Areas,
 57: (ESRILayer) Tribal Designated Statistical Areas Labels,
 58: (ESRILayer) American Indian Joint-Use Areas,
 59: (ESRILayer) American Indian Joint-Use Areas Labels,
 60: (ESRILayer) 113th Congressional Districts,
 61: (ESRILayer) 113th Congressional Districts Labels,
 62: (ESRILayer) 111th Congressional Districts,
 63: (ESRILayer) 111th Congressional Districts Labels,
 64: (ESRILayer) 2013 State Legislative Districts - Upper,
 65: (ESRILayer) 2013 State Legislative Districts - Upper Labels,
 66: (ESRILayer) 2013 State Legislative Districts - Lower,
 67: (ESRILayer) 2013 State Legislative Districts - Lower Labels,
 68: (ESRILayer) 2010 State Legislative Districts - Upper,
 69: (ESRILayer) 2010 State Legislative Districts - Upper Labels,
 70: (ESRILayer) 2010 State Legislative Districts - Lower,
 71: (ESRILayer) 2010 State Legislative Districts - Lower Labels,
 72: (ESRILayer) Voting Districts,
 73: (ESRILayer) Voting Districts Labels,
 74: (ESRILayer) Census Divisions,
 75: (ESRILayer) Census Divisions Labels,
 76: (ESRILayer) Census Regions,
 77: (ESRILayer) Census Regions Labels,
 78: (ESRILayer) Urbanized Areas,
 79: (ESRILayer) Urbanized Areas Labels,
 80: (ESRILayer) Urban Clusters,
 81: (ESRILayer) Urban Clusters Labels,
 82: (ESRILayer) Combined New England City and Town Areas,
 83: (ESRILayer) Combined New England City and Town Areas Labels,
 84: (ESRILayer) New England City and Town Area Divisions,
 85: (ESRILayer) New England City and Town Area  Divisions Labels,
 86: (ESRILayer) Metropolitan New England City and Town Areas,
 87: (ESRILayer) Metropolitan New England City and Town Areas Labels,
 88: (ESRILayer) Micropolitan New England City and Town Areas,
 89: (ESRILayer) Micropolitan New England City and Town Areas Labels,
 90: (ESRILayer) Combined Statistical Areas,
 91: (ESRILayer) Combined Statistical Areas Labels,
 92: (ESRILayer) Metropolitan Divisions,
 93: (ESRILayer) Metropolitan Divisions Labels,
 94: (ESRILayer) Metropolitan Statistical Areas,
 95: (ESRILayer) Metropolitan Statistical Areas Labels,
 96: (ESRILayer) Micropolitan Statistical Areas,
 97: (ESRILayer) Micropolitan Statistical Areas Labels,
 98: (ESRILayer) States,
 99: (ESRILayer) States Labels,
 100: (ESRILayer) Counties,
 101: (ESRILayer) Counties Labels}

These layers are what actually implement query operations. For now, let's focus on the same "class" of units we were using before, Census Designated Places:

In [23]:

conn.mapservice.layers[36]

Out[23]:

(ESRILayer) Census Designated Places

A query function is implemented both at the mapservice level and the layer level. At the mapservice level, a layer ID is required in order to complete the query.

Mapservice queries are driven by SQL. So, to grab all of the geodata that fits the CDPs we pulled before, you could start to construct it like this.

First, just like the main connection, each layer has a set of variables:

In [24]:

conn.mapservice.layers[36].variables

Out[24]:

	alias	domain	length	name	type
0	OBJECTID	None	NaN	OBJECTID	esriFieldTypeOID
1	AREALAND	None	NaN	AREALAND	esriFieldTypeDouble
2	AREAWATER	None	NaN	AREAWATER	esriFieldTypeDouble
3	PLACENS	None	8.0	PLACENS	esriFieldTypeString
4	PLACECC	None	2.0	PLACECC	esriFieldTypeString
5	PLACE	None	5.0	PLACE	esriFieldTypeString
6	FUNCSTAT	None	1.0	FUNCSTAT	esriFieldTypeString
7	GEOID	None	7.0	GEOID	esriFieldTypeString
8	LSADC	None	2.0	LSADC	esriFieldTypeString
9	MTFCC	None	5.0	MTFCC	esriFieldTypeString
10	BASENAME	None	100.0	BASENAME	esriFieldTypeString
11	NAME	None	100.0	NAME	esriFieldTypeString
12	OID	None	NaN	OID	esriFieldTypeDouble
13	CBSAPCI	None	1.0	CBSAPCI	esriFieldTypeString
14	NECTAPCI	None	1.0	NECTAPCI	esriFieldTypeString
15	STATE	None	2.0	STATE	esriFieldTypeString
16	UR	None	1.0	UR	esriFieldTypeString
17	CENTLON	None	12.0	CENTLON	esriFieldTypeString
18	CENTLAT	None	11.0	CENTLAT	esriFieldTypeString
19	INTPTLON	None	12.0	INTPTLON	esriFieldTypeString
20	INTPTLAT	None	11.0	INTPTLAT	esriFieldTypeString
21	HU100	None	NaN	HU100	esriFieldTypeDouble
22	POP100	None	NaN	POP100	esriFieldTypeDouble
23	STGEOMETRY	None	NaN	STGEOMETRY	esriFieldTypeGeometry
24	STGEOMETRY.AREA	None	NaN	STGEOMETRY.AREA	esriFieldTypeDouble
25	STGEOMETRY.LEN	None	NaN	STGEOMETRY.LEN	esriFieldTypeDouble

Our prior query grabbed the places in AZ. So, we could use a SQL query that focuses on that.

I try to pack the geometries into containers that people are used to using. Without knowing if GEOS is installed on a user's computer, I use PySAL as the target geometry type.

If you do have GEOS, that means you can use Shapely or GeoPandas. So, to choose your backend, you can use the following two arguments to this query function. the pkg argument will let you choose the three types of python objects to output to.

Pysal is default. If you select Shapely, the result will just be a pandas dataframe with Shapely geometries instead of pysal geometries. If you choose geopandas (or throw a gpize) option, cenpy will try to convert the pandas dataframe into a GeoPandas dataframe.

In [25]:

geodata = conn.mapservice.query(layer=36, where='STATE = 04')

In [26]:

geodata.head()

Out[26]:

	AREALAND	AREAWATER	BASENAME	CBSAPCI	CENTLAT	CENTLON	FUNCSTAT	GEOID	HU100	INTPTLAT	...	OID	PLACE	PLACECC	PLACENS	POP100	STATE	STGEOMETRY.AREA	STGEOMETRY.LEN	UR	geometry
0	19032775	3749	Copper Hill	N	+33.4379206	-110.7516247	S	0415600	63	+33.4309411	...	280403717476730	15600	U1	02582762	108	04	2.740956e+07	29887.334159	R	POLYGON ((-12332768.2339 3951052.317500003, -1...
1	370140896	37430	Vicksburg	N	+33.7292771	-113.8254052	S	0479940	687	+33.7271570	...	280403717388941	79940	U1	02582892	597	04	5.365771e+08	139508.893591	R	POLYGON ((-12681838.7724 3986891.478200004, -1...
2	14935603	21306	Sawmill	N	+35.8924354	-109.1538097	S	0464590	243	+35.8940886	...	280401190808799	64590	U1	02409289	748	04	2.283813e+07	26370.978055	R	POLYGON ((-12154943.6963 4285660.0066, -121549...
3	8693601	0	Goodyear Village	N	+33.1973247	-111.8723879	S	0428465	121	+33.1973247	...	280403861091191	28465	U2	02612139	457	04	1.244920e+07	18140.700058	M	POLYGON ((-12456882.7869 3922518.464100003, -1...
4	23385412	55226	Carrizo	N	+33.9866282	-110.3314358	S	0410320	40	+33.9793542	...	280403717231648	10320	U1	02582748	127	04	3.418072e+07	37640.185623	R	POLYGON ((-12289756.4977 4027500.7016, -122896...

5 rows × 26 columns

To join the geodata to the other data, use pandas functions:

In [27]:

import pandas as pd

In [28]:

newdata = pd.merge(data, geodata, left_on='place', right_on='PLACE')

In [29]:

newdata.head()

Out[29]:

	H001001	H002001	H002002	H002004	H002005	NAME_x	state	place	...	OID	PLACE	PLACECC	PLACENS	POP100	STATE	STGEOMETRY.AREA	STGEOMETRY.LEN	UR	geometry
0	621	621	0	0	621	Queen Valley CDP, Arizona	04	58290	...	280401260231706	58290	U1	02409115	788	04	3.615206e+07	29036.707030	R	POLYGON ((-12394427.7659 3930594.1774, -123943...
1	226	226	148	148	78	Rainbow City CDP, Arizona	04	58620	...	280403717231691	58620	U2	02582847	968	04	8.203581e+06	15645.985013	M	POLYGON ((-12244069.0863 4010211.943099998, -1...
2	162	162	162	162	0	Rancho Mesa Verde CDP, Arizona	04	58870	...	280403717389008	58870	U2	02582848	625	04	4.109284e+05	2901.827817	M	POLYGON ((-12763708.1353 3841881.7513, -127634...
3	202	202	0	0	202	Red Mesa CDP, Arizona	04	59550	...	280401190808794	59550	U1	02409150	480	04	5.222080e+07	43104.992136	R	POLYGON ((-12184877.6187 4433486.655900002, -1...
4	64	64	0	0	64	Red Rock CDP (Apache County), Arizona	04	59760	...	280403717231660	59760	U1	02582849	169	04	4.694147e+06	13576.105484	R	POLYGON ((-12142700.7787 4383404.183799997, -1...

5 rows × 36 columns

So, that's how you get your geodata in addition to your regular data!

OK, that's one API, does it work for others?¶

We'll try the Economic Census

In [30]:

conn2 = c.base.Connection('CBP2012')

Alright, let's look at the available columns:

In [31]:

conn2.variables

Out[31]:

	attributes	concept	group	label	limit	predicateOnly	predicateType	required	values
for	NaN	Census API Geography Specification	N/A	Census API FIPS 'for' clause	0	True	fips-for	NaN	NaN
in	NaN	Census API Geography Specification	N/A	Census API FIPS 'in' clause	0	True	fips-in	NaN	NaN
ucgid	NaN	Census API Geography Specification	N/A	Uniform Census Geography Identifier clause	0	True	ucgid	NaN	NaN
PAYANN	PAYANN_F	NaN	N/A	Annual payroll ($1,000)	0	NaN	int	NaN	NaN
GEOTYPE	NaN	NaN	N/A	Type of geography flag	0	NaN	string	NaN	NaN
PAYQTR1	PAYQTR1_F	NaN	N/A	First-quarter payroll ($1,000)	0	NaN	int	NaN	NaN
YEAR	YEAR_TTL	NaN	N/A	Year	0	NaN	string	NaN	{'item': {'1982': '1982', '1983': '1983', '198...
COUNTY	NaN	NaN	N/A	FIPS county code	0	NaN	NaN	NaN	NaN
ST	NaN	NaN	N/A	FIPS state code	0	NaN	NaN	NaN	NaN
EMPSZES	EMPSZES_TTL	NaN	N/A	Employment size of establishment	14	NaN	string	default displayed	{'item': {'001': 'All establishments', '204': ...
CSA	NaN	NaN	N/A	FIPS Combined Statistical Area code	0	NaN	string	NaN	NaN
MD	NaN	NaN	N/A	FIPS Metropolitan Division code	0	NaN	string	NaN	NaN
GEO_ID	GEO_TTL,FOOTID_GEO	NaN	N/A	Geographic identifier code	0	NaN	string	NaN	NaN
EMP	EMP_F	NaN	N/A	Paid employees for pay period including March ...	0	NaN	int	NaN	NaN
ESTAB	ESTAB_F	NaN	N/A	Number of establishments	0	NaN	int	NaN	NaN
MSA	NaN	NaN	N/A	FIPS Metropolitan Statistical Area or Micropol...	0	NaN	NaN	NaN	NaN
NAICS2012	NAICS2012_TTL,FOOTID_NAICS	NaN	N/A	2012 NAICS code	2016	NaN	string	default displayed	{'item': {'111331': 'Apple orchards', '111332'...
LFO	LFO_TTL	NaN	N/A	Legal form of organization code	8	NaN	string	default displayed	{'item': {'001': 'All establishments', '002': ...
PAYQTR1_N	PAYQTR1_N_F	NaN	N/A	Noise range for first-quarter payroll (%)	0	NaN	int	NaN	NaN
PAYANN_N	PAYANN_N_F	NaN	N/A	Noise range for annual payroll	0	NaN	int	NaN	NaN
EMP_N	EMP_N_F	NaN	N/A	Noise range for number of paid employees for p...	0	NaN	int	NaN	NaN

To show the required predicates, can filter the variables dataframe by the required field. Note that required means that the query will fail if these are not passed as keyword arguments. They don't have to specify a single value, though, so they can be left as a wild card, like we did with place:* in the prior query:

In [32]:

conn2.variables[~ conn2.variables.required.isnull()]

Out[32]:

	attributes	concept	group	label	limit	predicateOnly	predicateType	required	values
EMPSZES	EMPSZES_TTL	NaN	N/A	Employment size of establishment	14	NaN	string	default displayed	{'item': {'001': 'All establishments', '204': ...
NAICS2012	NAICS2012_TTL,FOOTID_NAICS	NaN	N/A	2012 NAICS code	2016	NaN	string	default displayed	{'item': {'111331': 'Apple orchards', '111332'...
LFO	LFO_TTL	NaN	N/A	Legal form of organization code	8	NaN	string	default displayed	{'item': {'001': 'All establishments', '002': ...

Like before, geographies are shown within a given hierarchy. Here, the only geography is the fips geography.

In [33]:

conn2.geographies.keys()

Out[33]:

dict_keys(['fips'])

In [34]:

conn2.geographies['fips']

Out[34]:

	geoLevelDisplay	limit	name	optionalWithWCFor	referenceDate	requires	wildcard
0	01	1	us	NaN	2012-01-01	NaN	NaN
1	02	51	state	NaN	2012-01-01	NaN	NaN
2	03	3304	county	state	2012-01-01	[state]	[state]
3	08,09	917	metropolitan statistical area/micropolitan sta...	NaN	2012-01-01	NaN	NaN

Now, we'll do some fun with error handling and passing of additional arguments to the query. Any "extra" required predicates beyond get, for and in are added at the end of the query as keyword arguments. These are caught and introduced into the query following the API specifications.

First, though, let's see what happens when we submit a malformed query!

Here, we can query for every column in the dataset applied to places in California (fips = 06). The dataset we're working with, the Economic Census, requires an OPTAX field, which identifies the "type of operation or tax status code" along which to slice the data. Just like the other arguments, we will map them to keywords in the API string, and a wildcard represents a slice of all possible values.

In [35]:

cols = conn2.varslike('ESTAB*', engine='fnmatch')

In [36]:

data2 = conn2.query(cols=cols, geo_unit='county:*', geo_filter={'state':'06'})

In [37]:

data2.head()

Out[37]:

	ESTAB	state	county
0	635	06	999
1	36700	06	001
2	43	06	003
3	801	06	005
4	4615	06	007

And so you get the table of employment by County & NAICS code for employment and establishments in California counties. Since we're using counties as our unit of analysis, we could grab the geodata for counties.

In [38]:

conn2.set_mapservice('State_County')

But, there are quite a few layers in this MapService:

In [39]:

len(conn2.mapservice.layers)

Out[39]:

Oof. If you ever want to check out the web interface to see what it looks like, you can retrieve the URLs of most objects using:

In [40]:

conn2.mapservice._baseurl

Out[40]:

'http://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/State_County/MapServer'

Anyway, we know counties don't really change all that much. So, let's just pick a counties layer and pull it down for California:

In [41]:

geodata2= conn2.mapservice.query(layer=1,where='STATE = 06')

In [42]:

newdata2 = pd.merge(data2, geodata2, left_on='county', right_on='COUNTY')

In [43]:

newdata2.head()

Out[43]:

	ESTAB	state	county	AREALAND	AREAWATER	BASENAME	CENTLAT	CENTLON	COUNTY	COUNTYCC	...	INTPTLON	LSADC	MTFCC	NAME	OBJECTID	OID	STATE	STGEOMETRY.AREA	STGEOMETRY.LEN	geometry
0	36700	06	001	1909598011	216923745	Alameda	+37.6505688	-121.9177578	001	H1	...	-121.9124880	06	G4020	Alameda County	2098	27590141293924	06	3.398122e+09	309844.151899	POLYGON ((-13612245.2954 4538149.388899997, -1...
1	43	06	003	1912292631	12557304	Alpine	+38.5971043	-119.8206026	003	H1	...	-119.7983522	06	G4020	Alpine County	1317	27590289634197	06	3.156005e+09	275565.367754	POLYGON ((-13366502.0648 4678945.273900002, -1...
2	801	06	005	1539933575	29470567	Amador	+38.4466174	-120.6516693	005	H1	...	-120.6538563	06	G4020	Amador County	2724	27590143912562	06	2.562527e+09	359598.495341	POLYGON ((-13472696.4062 4647651.505999997, -1...
3	4615	06	007	4238438258	105310959	Butte	+39.6665788	-121.6007017	007	H1	...	-121.6019188	06	G4020	Butte County	2237	27590417130535	06	7.340033e+09	525974.965501	POLYGON ((-13565003.3072 4798393.384000003, -1...
4	891	06	009	2641829200	43797659	Calaveras	+38.2044678	-120.5546688	009	H1	...	-120.5614415	06	G4020	Calaveras County	347	27590202403841	06	4.356212e+09	367005.819167	POLYGON ((-13428574.0355 4627724.500200003, -1...

5 rows × 24 columns

And that's all there is to it! Geodata and tabular data from the Census APIs in one place.

File an issue if you have concerns!

In [ ]: