The Data Observatory is a spatial data repository that enables data scientists to augment their data and broaden their analysis. It offers a wide range of datasets from around the globe.
This guide is intended for those who want to start augmenting their own data using CARTOframes and wish to explore CARTO's public Data Observatory catalog to find datasets that best fit their use cases and analyses.
Note: The catalog is public and you don't need a CARTO account to search for available datasets
In this guide we walk through the Data Observatory catalog looking for demographics data in the US.
The catalog is comprised of thousands of curated spatial datasets, so when searching for data the easiest way to find what you are looking for is to make use of a faceted search. A faceted (or hierarchical) search allows you to narrow down search results by applying multiple filters based on faceted classification of catalog datasets.
Datasets are organized in three main hierarchies:
For our analysis we are looking for demographic datasets in the US with a spatial resolution at the block group level.
We can start by discovering which available geographies (or spatial resolutions) we have for demographic data in the US, by filtering the catalog
by country
and category
and listing the available geographies
.
Let's start exploring the available categories of data for the US:
from cartoframes.data.observatory import Catalog
Catalog().country('usa').categories
[<Category.get('human_mobility')>, <Category.get('environmental')>, <Category.get('points_of_interest')>, <Category.get('road_traffic')>, <Category.get('demographics')>, <Category.get('financial')>]
For the case of the US, the Data Observatory provides six different categories of datasets. Let's discover the available spatial resolutions for the demographics category (which at a first sight will contain the population data we need).
from cartoframes.data.observatory import Catalog
geographies = Catalog().country('usa').category('demographics').geographies
geographies
[<Geography.get('ags_q17_4739be4f')>, <Geography.get('expn_grid_a4075de4')>, <Geography.get('mbi_blockgroups_1ab060a')>, <Geography.get('mbi_counties_141b61cd')>, <Geography.get('mbi_county_subd_e8e6ea23')>, <Geography.get('mbi_pc_5_digit_4b1682a6')>, <Geography.get('usct_blockgroup_f45b6b49')>, <Geography.get('usct_cbsa_6c8b51ef')>, <Geography.get('usct_censustract_bc698c5a')>, <Geography.get('usct_congression_b6336b2c')>, <Geography.get('usct_county_ec40c962')>, <Geography.get('usct_county_92f1b5df')>, <Geography.get('usct_place_12d6699f')>, <Geography.get('usct_puma_b859f0fa')>, <Geography.get('usct_schooldistr_515af763')>, <Geography.get('usct_schooldistr_da72a4cb')>, <Geography.get('usct_schooldistr_287be4f7')>, <Geography.get('usct_state_4c8090b5')>, <Geography.get('usct_zcta5_75071016')>]
Let's filter the geographies by those that contain information at the level of blockgroup. For that purpose we are converting the geographies to a pandas DataFrame
and search for the string blockgroup
in the id
of the geographies:
df = geographies.to_dataframe()
df[df['id'].str.contains('blockgroup', case=False, na=False)]
id | slug | name | description | country_id | provider_id | provider_name | lang | geom_coverage | geom_type | update_frequency | version | is_public_data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | carto-do.mbi.geography_usa_blockgroups_2019 | mbi_blockgroups_1ab060a | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | usa | mbi | Michael Bauer International | eng | None | MULTIPOLYGON | None | 2019 | False |
6 | carto-do-public-data.usa_carto.geography_usa_b... | usct_blockgroup_f45b6b49 | Census Block Groups (2015) - shoreline clipped | Shoreline clipped TIGER/Line boundaries. More ... | usa | usa_carto | CARTO shoreline-clipped USA Tiger geographies | eng | None | MULTIPOLYGON | None | 2015 | True |
We have three available datasets, from three different providers: Michael Bauer International, Open Data and AGS. For this example, we are going to look for demographic datasets for the MBI blockgroups geography mbi_blockgroups_1ab060a
:
datasets = Catalog().country('usa').category('demographics').geography('mbi_blockgroups_1ab060a').datasets
datasets
[<Dataset.get('mbi_households__45067b14')>, <Dataset.get('mbi_population_341ee33b')>, <Dataset.get('mbi_purchasing__53ab279d')>, <Dataset.get('mbi_consumer_sp_54c4abc3')>, <Dataset.get('mbi_sociodemogr_b5516832')>, <Dataset.get('mbi_education_20063878')>, <Dataset.get('mbi_households__c943a740')>, <Dataset.get('mbi_retail_spen_c31f0ba0')>, <Dataset.get('mbi_consumer_pr_68d1265a')>]
Let's continue with the data discovery. We have 6 datasets in the US with demographics information at the level of MBI blockgroups:
datasets.to_dataframe()
id | slug | name | description | country_id | geography_id | geography_name | geography_description | category_id | category_name | provider_id | provider_name | data_source_id | lang | temporal_aggregation | time_coverage | update_frequency | version | is_public_data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | carto-do.mbi.demographics_householdsbytype_usa... | mbi_households__45067b14 | Households By Type at Blockgroups (micro) leve... | Data is country-specific. | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | households_by_type | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
1 | carto-do.mbi.demographics_population_usa_block... | mbi_population_341ee33b | Population at Blockgroups (micro) level for USA | Population figures are shown as projected aver... | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | population | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
2 | carto-do.mbi.demographics_purchasingpower_usa_... | mbi_purchasing__53ab279d | Purchasing Power at Blockgroups (micro) level ... | Purchasing Power describes the disposable inco... | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | purchasing_power | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
3 | carto-do.mbi.demographics_consumerspending_usa... | mbi_consumer_sp_54c4abc3 | Consumer Spending at Blockgroups (micro) level... | MBI Consumer Spending by product groups quanti... | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | consumer_spending | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
4 | carto-do.mbi.demographics_sociodemographics_us... | mbi_sociodemogr_b5516832 | Sociodemographics at Blockgroups (micro) level... | MBI Sociodemographics includes:\n- Population\... | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | sociodemographics | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
5 | carto-do.mbi.demographics_education_usa_blockg... | mbi_education_20063878 | Education at Blockgroups (micro) level for USA | Data is country-specific. | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | education | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
6 | carto-do.mbi.demographics_householdsbyincomequ... | mbi_households__c943a740 | Households By Income Quintiles at Blockgroups ... | On the national level the number of households... | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | households_by_income_quintiles | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
7 | carto-do.mbi.demographics_retailspending_usa_b... | mbi_retail_spen_c31f0ba0 | Retail Spending at Blockgroups (micro) level f... | Retail Spending relates to the proportion of P... | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | retail_spending | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
8 | carto-do.mbi.demographics_consumerprofiles_usa... | mbi_consumer_pr_68d1265a | Consumer Profiles at Blockgroups (micro) level... | The MB International Consumer Styles describe ... | usa | carto-do.mbi.geography_usa_blockgroups_2019 | USA - Blockgroups | MBI Digital Boundaries for USA at Blockgroups ... | demographics | Demographics | mbi | Michael Bauer International | consumer_profiles | eng | yearly | [2019-01-01, 2020-01-01) | None | 2019 | False |
They comprise different information: consumer spending, retail potential, consumer profiles, etc.
At a first sight, it looks the dataset with data_source_id: sociodemographic
might contain the population information we are looking for. Let's try to understand a little bit better what data this dataset contains by looking at its variables:
from cartoframes.data.observatory import Dataset
dataset = Dataset.get('ags_sociodemogr_e92b1637')
variables = dataset.variables
variables
[<Variable.get('AGECY0004_bf30e80a')> #'Population age 0-4 (2019A)', <Variable.get('AGECY0509_c74a565c')> #'Population age 5-9 (2019A)', <Variable.get('AGECY1014_1e97be2e')> #'Population age 10-14 (2019A)', <Variable.get('AGECY1519_66ed0078')> #'Population age 15-19 (2019A)', <Variable.get('AGECY2024_270f4203')> #'Population age 20-24 (2019A)', <Variable.get('AGECY2529_5f75fc55')> #'Population age 25-29 (2019A)', <Variable.get('AGECY3034_86a81427')> #'Population age 30-34 (2019A)', <Variable.get('AGECY3539_fed2aa71')> #'Population age 35-39 (2019A)', <Variable.get('AGECY4044_543eba59')> #'Population age 40-44 (2019A)', <Variable.get('AGECY4549_2c44040f')> #'Population age 45-49 (2019A)', <Variable.get('AGECY5054_f599ec7d')> #'Population age 50-54 (2019A)', <Variable.get('AGECY5559_8de3522b')> #'Population age 55-59 (2019A)', <Variable.get('AGECY6064_cc011050')> #'Population age 60-64 (2019A)', <Variable.get('AGECY6569_b47bae06')> #'Population age 65-69 (2019A)', <Variable.get('AGECY7074_6da64674')> #'Population age 70-74 (2019A)', <Variable.get('AGECY7579_15dcf822')> #'Population age 75-79 (2019A)', <Variable.get('AGECY8084_b25d4aed')> #'Population age 80-84 (2019A)', <Variable.get('AGECYGT15_681a1204')> #'Population Age 15+ (2019A)', <Variable.get('AGECYGT25_433741c7')> #'Population Age 25+ (2019A)', <Variable.get('AGECYGT85_b9d8a94d')> #'Population age 85+ (2019A)', <Variable.get('AGECYMED_b6eaafb4')> #'Median Age (2019A)', <Variable.get('AGEPYMED_91aa42e6')> #'Median Age (2024A)', <Variable.get('BLOCKGROUP_16298bd5')> #'Geographic Identifier', <Variable.get('DWLCY_e0711b62')> #'Housing units (2019A)', <Variable.get('DWLCYOWNED_a34794a5')> #'Occupied units owner (2019A)', <Variable.get('DWLCYRENT_239f79ae')> #'Occupied units renter (2019A)', <Variable.get('DWLCYVACNT_4d5e33e9')> #'Housing units vacant (2019A)', <Variable.get('DWLPY_819e5af0')> #'Housing units (2024A)', <Variable.get('EDUCYASSOC_fa1bcf13')> #'Pop 25+ Associate degree (2019A)', <Variable.get('EDUCYBACH_c2295f79')> #'Pop 25+ Bachelors degree (2019A)', <Variable.get('EDUCYGRAD_d0179ccb')> #'Pop 25+ graduate or prof school degree (2019A)', <Variable.get('EDUCYHSCH_b236c803')> #'Pop 25+ HS graduate (2019A)', <Variable.get('EDUCYLTGR9_cbcfcc89')> #'Pop 25+ less than 9th grade (2019A)', <Variable.get('EDUCYSCOLL_1e8c4828')> #'Pop 25+ college no diploma (2019A)', <Variable.get('EDUCYSHSCH_5c444deb')> #'Pop 25+ 9th-12th grade no diploma (2019A)', <Variable.get('HHDCY_23e8e012')> #'Households (2019A)', <Variable.get('HHDCYAVESZ_f4a95c6f')> #'Average Household Size (2019A)', <Variable.get('HHDCYFAM_85548592')> #'Family Households (2019A)', <Variable.get('HHDCYMEDAG_69c53f22')> #'Median Age of Householder (2019A)', <Variable.get('HHDPY_4207a180')> #'Households (2024A)', <Variable.get('HHSCYLPFCH_e4112270')> #'Families female no husband children (2019A)', <Variable.get('HHSCYLPMCH_e844cd91')> #'Families male no wife w children (2019A)', <Variable.get('HHSCYMCFCH_9bddf3b1')> #'Families married couple w children (2019A)', <Variable.get('HINCY10025_665c9060')> #'Household Income $100000-$124999 (2019A)', <Variable.get('HINCY1015_d2be7e2b')> #'Household Income $10000-$14999 (2019A)', <Variable.get('HINCY12550_f5b5f848')> #'Household Income $125000-$149999 (2019A)', <Variable.get('HINCY15020_21e894dd')> #'Household Income $150000-$199999 (2019A)', <Variable.get('HINCY1520_8f321b8c')> #'Household Income $15000-$19999 (2019A)', <Variable.get('HINCY2025_eb268206')> #'Household Income $20000-$24999 (2019A)', <Variable.get('HINCY2530_849c8523')> #'Household Income $25000-$29999 (2019A)', <Variable.get('HINCY3035_4a81d422')> #'Household Income $30000-$34999 (2019A)', <Variable.get('HINCY3540_73617481')> #'Household Income $35000-$39999 (2019A)', <Variable.get('HINCY4045_98177a5c')> #'Household Income $40000-$44999 (2019A)', <Variable.get('HINCY4550_f7ad7d79')> #'Household Income $45000-$49999 (2019A)', <Variable.get('HINCY5060_62f78b34')> #'Household Income $50000-$59999 (2019A)', <Variable.get('HINCY6075_1933e114')> #'Household Income $60000-$74999 (2019A)', <Variable.get('HINCY75100_9d5c69c8')> #'Household Income $75000-$99999 (2019A)', <Variable.get('HINCYGT200_e552a738')> #'Household Income > $200000 (2019A)', <Variable.get('HINCYLT10_745f9119')> #'Household Income < $10000 (2019A)', <Variable.get('HINCYMED24_22603d1a')> #'Median Household Income: Age < 25 (2019A)', <Variable.get('HINCYMED25_55670d8c')> #'Median Household Income: Age 25-34 (2019A)', <Variable.get('HINCYMED35_4c7c3ccd')> #'Median Household Income: Age 35-44 (2019A)', <Variable.get('HINCYMED45_33daa0a')> #'Median Household Income: Age 45-54 (2019A)', <Variable.get('HINCYMED55_1a269b4b')> #'Median Household Income: Age 55-64 (2019A)', <Variable.get('HINCYMED65_310bc888')> #'Median Household Income: Age 65-74 (2019A)', <Variable.get('HINCYMED75_2810f9c9')> #'Median Household Income: Age 75+ (2019A)', <Variable.get('HISCYHISP_f3b3a31e')> #'Population Hispanic (2019A)', <Variable.get('HOOEXMED_c2d4b5b')> #'Median Value of Owner Occupied Housing Units', <Variable.get('HUSEX1DET_3684405c')> #'UNITS IN STRUCTURE: 1 DETACHED', <Variable.get('HUSEXAPT_988f452f')> #'UNITS IN STRUCTURE: 20 OR MORE', <Variable.get('INCCYAVEHH_383bfd10')> #'Average household Income (2019A)', <Variable.get('INCCYMEDFA_59fa177d')> #'Median family income (2019A)', <Variable.get('INCCYMEDHH_bea58257')> #'Median household income (2019A)', <Variable.get('INCCYPCAP_691da8ff')> #'Per capita income (2019A)', <Variable.get('INCPYAVEHH_6e0d7b43')> #'Average household Income (2024A)', <Variable.get('INCPYMEDHH_e8930404')> #'Median household income (2024A)', <Variable.get('INCPYPCAP_ec5fd8ca')> #'Per capita income (2024A)', <Variable.get('LBFCYARM_8c06223a')> #'Pop 16+ in Armed Forces (2019A)', <Variable.get('LBFCYEMPL_c9c22a0')> #'Pop 16+ civilian employed (2019A)', <Variable.get('LBFCYLBF_59ce7ab0')> #'Population In Labor Force (2019A)', <Variable.get('LBFCYNLF_c4c98350')> #'Pop 16+ not in labor force (2019A)', <Variable.get('LBFCYPOP16_53fa921c')> #'Population Age 16+ (2019A)', <Variable.get('LBFCYUNEM_1e711de4')> #'Pop 16+ civilian unemployed (2019A)', <Variable.get('LNIEXISOL_d776b2f7')> #'LINGUISTICALLY ISOLATED HOUSEHOLDS (NON-ENGLISH SP...', <Variable.get('LNIEXSPAN_9a19f7f7')> #'SPANISH SPEAKING HOUSEHOLDS', <Variable.get('MARCYDIVOR_32a11923')> #'Divorced (2019A)', <Variable.get('MARCYMARR_26e07b7')> #'Now Married (2019A)', <Variable.get('MARCYNEVER_c82856b0')> #'Never Married (2019A)', <Variable.get('MARCYSEP_9024e7e5')> #'Separated (2019A)', <Variable.get('MARCYWIDOW_7a2977e0')> #'Widowed (2019A)', <Variable.get('POPCY_f5800f44')> #'Population (2019A)', <Variable.get('POPCYGRP_74c19673')> #'Population in Group Quarters (2019A)', <Variable.get('POPCYGRPI_147af7a9')> #'Institutional Group Quarters Population (2019A)', <Variable.get('POPPY_946f4ed6')> #'Population (2024A)', <Variable.get('RCHCYAMNHS_4a788a9d')> #'Non Hispanic American Indian (2019A)', <Variable.get('RCHCYASNHS_fabeaa31')> #'Non Hispanic Asian (2019A)', <Variable.get('RCHCYBLNHS_b5649728')> #'Non Hispanic Black (2019A)', <Variable.get('RCHCYHANHS_dbe5754')> #'Non Hispanic Hawaiian/Pacific Islander (2019A)', <Variable.get('RCHCYMUNHS_1a2518ec')> #'Non Hispanic Multiple Race (2019A)', <Variable.get('RCHCYOTNHS_d8592ce9')> #'Non Hispanic Other Race (2019A)', <Variable.get('RCHCYWHNHS_9206188d')> #'Non Hispanic White (2019A)', <Variable.get('RNTEXMED_2e309f54')> #'Median Cash Rent', <Variable.get('SEXCYFEM_d52acecb')> #'Population female (2019A)', <Variable.get('SEXCYMAL_ca14d4b8')> #'Population male (2019A)', <Variable.get('UNECYRATE_b3dc32ba')> #'Unemployment Rate (2019A)', <Variable.get('VPHCY1_53dc760f')> #'Households: One Vehicle Available (2019A)', <Variable.get('VPHCYGT1_a052056d')> #'Households: Two or More Vehicles Available (2019A)', <Variable.get('VPHCYNONE_22cb7350')> #'Households: No Vehicle Available (2019A)']
from cartoframes.data.observatory import Dataset
vdf = variables.to_dataframe()
vdf
id | slug | name | description | column_name | db_type | dataset_id | agg_method | variable_group_id | starred | |
---|---|---|---|---|---|---|---|---|---|---|
0 | carto-do.ags.demographics_sociodemographic_usa... | AGECY0004_bf30e80a | AGECY0004 | Population age 0-4 (2019A) | AGECY0004 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
1 | carto-do.ags.demographics_sociodemographic_usa... | AGECY0509_c74a565c | AGECY0509 | Population age 5-9 (2019A) | AGECY0509 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
2 | carto-do.ags.demographics_sociodemographic_usa... | AGECY1014_1e97be2e | AGECY1014 | Population age 10-14 (2019A) | AGECY1014 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
3 | carto-do.ags.demographics_sociodemographic_usa... | AGECY1519_66ed0078 | AGECY1519 | Population age 15-19 (2019A) | AGECY1519 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
4 | carto-do.ags.demographics_sociodemographic_usa... | AGECY2024_270f4203 | AGECY2024 | Population age 20-24 (2019A) | AGECY2024 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
103 | carto-do.ags.demographics_sociodemographic_usa... | SEXCYMAL_ca14d4b8 | SEXCYMAL | Population male (2019A) | SEXCYMAL | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
104 | carto-do.ags.demographics_sociodemographic_usa... | UNECYRATE_b3dc32ba | UNECYRATE | Unemployment Rate (2019A) | UNECYRATE | FLOAT | carto-do.ags.demographics_sociodemographic_usa... | AVG | None | False |
105 | carto-do.ags.demographics_sociodemographic_usa... | VPHCY1_53dc760f | VPHCY1 | Households: One Vehicle Available (2019A) | VPHCY1 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
106 | carto-do.ags.demographics_sociodemographic_usa... | VPHCYGT1_a052056d | VPHCYGT1 | Households: Two or More Vehicles Available (20... | VPHCYGT1 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
107 | carto-do.ags.demographics_sociodemographic_usa... | VPHCYNONE_22cb7350 | VPHCYNONE | Households: No Vehicle Available (2019A) | VPHCYNONE | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
108 rows × 10 columns
We can see there are several variables related to population, so this is the Dataset
we are looking for.
vdf[vdf['description'].str.contains('pop', case=False, na=False)]
id | slug | name | description | column_name | db_type | dataset_id | agg_method | variable_group_id | starred | |
---|---|---|---|---|---|---|---|---|---|---|
0 | carto-do.ags.demographics_sociodemographic_usa... | AGECY0004_bf30e80a | AGECY0004 | Population age 0-4 (2019A) | AGECY0004 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
1 | carto-do.ags.demographics_sociodemographic_usa... | AGECY0509_c74a565c | AGECY0509 | Population age 5-9 (2019A) | AGECY0509 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
2 | carto-do.ags.demographics_sociodemographic_usa... | AGECY1014_1e97be2e | AGECY1014 | Population age 10-14 (2019A) | AGECY1014 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
3 | carto-do.ags.demographics_sociodemographic_usa... | AGECY1519_66ed0078 | AGECY1519 | Population age 15-19 (2019A) | AGECY1519 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
4 | carto-do.ags.demographics_sociodemographic_usa... | AGECY2024_270f4203 | AGECY2024 | Population age 20-24 (2019A) | AGECY2024 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
5 | carto-do.ags.demographics_sociodemographic_usa... | AGECY2529_5f75fc55 | AGECY2529 | Population age 25-29 (2019A) | AGECY2529 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
6 | carto-do.ags.demographics_sociodemographic_usa... | AGECY3034_86a81427 | AGECY3034 | Population age 30-34 (2019A) | AGECY3034 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
7 | carto-do.ags.demographics_sociodemographic_usa... | AGECY3539_fed2aa71 | AGECY3539 | Population age 35-39 (2019A) | AGECY3539 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
8 | carto-do.ags.demographics_sociodemographic_usa... | AGECY4044_543eba59 | AGECY4044 | Population age 40-44 (2019A) | AGECY4044 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
9 | carto-do.ags.demographics_sociodemographic_usa... | AGECY4549_2c44040f | AGECY4549 | Population age 45-49 (2019A) | AGECY4549 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
10 | carto-do.ags.demographics_sociodemographic_usa... | AGECY5054_f599ec7d | AGECY5054 | Population age 50-54 (2019A) | AGECY5054 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
11 | carto-do.ags.demographics_sociodemographic_usa... | AGECY5559_8de3522b | AGECY5559 | Population age 55-59 (2019A) | AGECY5559 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
12 | carto-do.ags.demographics_sociodemographic_usa... | AGECY6064_cc011050 | AGECY6064 | Population age 60-64 (2019A) | AGECY6064 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
13 | carto-do.ags.demographics_sociodemographic_usa... | AGECY6569_b47bae06 | AGECY6569 | Population age 65-69 (2019A) | AGECY6569 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
14 | carto-do.ags.demographics_sociodemographic_usa... | AGECY7074_6da64674 | AGECY7074 | Population age 70-74 (2019A) | AGECY7074 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
15 | carto-do.ags.demographics_sociodemographic_usa... | AGECY7579_15dcf822 | AGECY7579 | Population age 75-79 (2019A) | AGECY7579 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
16 | carto-do.ags.demographics_sociodemographic_usa... | AGECY8084_b25d4aed | AGECY8084 | Population age 80-84 (2019A) | AGECY8084 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
17 | carto-do.ags.demographics_sociodemographic_usa... | AGECYGT15_681a1204 | AGECYGT15 | Population Age 15+ (2019A) | AGECYGT15 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
18 | carto-do.ags.demographics_sociodemographic_usa... | AGECYGT25_433741c7 | AGECYGT25 | Population Age 25+ (2019A) | AGECYGT25 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
19 | carto-do.ags.demographics_sociodemographic_usa... | AGECYGT85_b9d8a94d | AGECYGT85 | Population age 85+ (2019A) | AGECYGT85 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
28 | carto-do.ags.demographics_sociodemographic_usa... | EDUCYASSOC_fa1bcf13 | EDUCYASSOC | Pop 25+ Associate degree (2019A) | EDUCYASSOC | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
29 | carto-do.ags.demographics_sociodemographic_usa... | EDUCYBACH_c2295f79 | EDUCYBACH | Pop 25+ Bachelors degree (2019A) | EDUCYBACH | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
30 | carto-do.ags.demographics_sociodemographic_usa... | EDUCYGRAD_d0179ccb | EDUCYGRAD | Pop 25+ graduate or prof school degree (2019A) | EDUCYGRAD | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
31 | carto-do.ags.demographics_sociodemographic_usa... | EDUCYHSCH_b236c803 | EDUCYHSCH | Pop 25+ HS graduate (2019A) | EDUCYHSCH | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
32 | carto-do.ags.demographics_sociodemographic_usa... | EDUCYLTGR9_cbcfcc89 | EDUCYLTGR9 | Pop 25+ less than 9th grade (2019A) | EDUCYLTGR9 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
33 | carto-do.ags.demographics_sociodemographic_usa... | EDUCYSCOLL_1e8c4828 | EDUCYSCOLL | Pop 25+ college no diploma (2019A) | EDUCYSCOLL | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
34 | carto-do.ags.demographics_sociodemographic_usa... | EDUCYSHSCH_5c444deb | EDUCYSHSCH | Pop 25+ 9th-12th grade no diploma (2019A) | EDUCYSHSCH | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
66 | carto-do.ags.demographics_sociodemographic_usa... | HISCYHISP_f3b3a31e | HISCYHISP | Population Hispanic (2019A) | HISCYHISP | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
77 | carto-do.ags.demographics_sociodemographic_usa... | LBFCYARM_8c06223a | LBFCYARM | Pop 16+ in Armed Forces (2019A) | LBFCYARM | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
78 | carto-do.ags.demographics_sociodemographic_usa... | LBFCYEMPL_c9c22a0 | LBFCYEMPL | Pop 16+ civilian employed (2019A) | LBFCYEMPL | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
79 | carto-do.ags.demographics_sociodemographic_usa... | LBFCYLBF_59ce7ab0 | LBFCYLBF | Population In Labor Force (2019A) | LBFCYLBF | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
80 | carto-do.ags.demographics_sociodemographic_usa... | LBFCYNLF_c4c98350 | LBFCYNLF | Pop 16+ not in labor force (2019A) | LBFCYNLF | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
81 | carto-do.ags.demographics_sociodemographic_usa... | LBFCYPOP16_53fa921c | LBFCYPOP16 | Population Age 16+ (2019A) | LBFCYPOP16 | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
82 | carto-do.ags.demographics_sociodemographic_usa... | LBFCYUNEM_1e711de4 | LBFCYUNEM | Pop 16+ civilian unemployed (2019A) | LBFCYUNEM | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
90 | carto-do.ags.demographics_sociodemographic_usa... | POPCY_f5800f44 | POPCY | Population (2019A) | POPCY | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
91 | carto-do.ags.demographics_sociodemographic_usa... | POPCYGRP_74c19673 | POPCYGRP | Population in Group Quarters (2019A) | POPCYGRP | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
92 | carto-do.ags.demographics_sociodemographic_usa... | POPCYGRPI_147af7a9 | POPCYGRPI | Institutional Group Quarters Population (2019A) | POPCYGRPI | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
93 | carto-do.ags.demographics_sociodemographic_usa... | POPPY_946f4ed6 | POPPY | Population (2024A) | POPPY | FLOAT | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
102 | carto-do.ags.demographics_sociodemographic_usa... | SEXCYFEM_d52acecb | SEXCYFEM | Population female (2019A) | SEXCYFEM | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
103 | carto-do.ags.demographics_sociodemographic_usa... | SEXCYMAL_ca14d4b8 | SEXCYMAL | Population male (2019A) | SEXCYMAL | INTEGER | carto-do.ags.demographics_sociodemographic_usa... | SUM | None | False |
The Data Observatory catalog is not only a repository of curated spatial datasets, it also contains valuable information that helps on understanding better the underlying data for every dataset, so you can take an informed decision on what data best fits your problem.
Some of the augmented metadata you can find for each dataset in the catalog is:
head
and tail
methods to get a glimpse of the actual data. This helps you to understand the available columns, data types, etc. To start modelling your problem right away.geom_coverage
to visualize on a map the geographical coverage of the data in the Dataset
.counts
, fields_by_type
and a full describe
method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset.You don't need a subscription to a dataset to be able to query the augmented metadata, it's just publicly available for anyone exploring the Data Observatory catalog.
Let's overview some of that information, starting by getting a glimpse of the ten first or last rows of the actual data of the dataset:
from cartoframes.data.observatory import Dataset
dataset = Dataset.get('ags_sociodemogr_e92b1637')
dataset.head()
DWLCY | HHDCY | POPCY | VPHCY1 | AGECYMED | HHDCYFAM | HOOEXMED | HUSEXAPT | LBFCYARM | LBFCYLBF | ... | MARCYDIVOR | MARCYNEVER | MARCYWIDOW | RCHCYAMNHS | RCHCYASNHS | RCHCYBLNHS | RCHCYHANHS | RCHCYMUNHS | RCHCYOTNHS | RCHCYWHNHS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 5 | 6 | 0 | 64.00 | 1 | 63749 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
1 | 2 | 2 | 5 | 1 | 36.50 | 2 | 124999 | 0 | 0 | 2 | ... | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 2 |
2 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 21 | 11 | 22 | 4 | 64.00 | 6 | 74999 | 0 | 0 | 10 | ... | 4 | 13 | 2 | 0 | 0 | 22 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 959 | 0 | 18.91 | 0 | 0 | 0 | 0 | 378 | ... | 0 | 959 | 0 | 5 | 53 | 230 | 0 | 25 | 0 | 609 |
5 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 101 columns
Alternatively, you can get the last ten ones with dataset.tail()
An overview of the coverage of the dataset
dataset.geom_coverage()
Some stats about the dataset:
dataset.counts()
rows 217182.0 cells 22369746.0 null_cells 0.0 null_cells_percent 0.0 dtype: float64
dataset.fields_by_type()
float 4 string 1 integer 96 dtype: int64
dataset.describe()
AGECY0004 | AGECY0509 | AGECY1014 | AGECY1519 | AGECY2024 | AGECY2529 | AGECY3034 | AGECY3539 | AGECY4044 | AGECY4549 | ... | RCHCYMUNHS | RCHCYOTNHS | RCHCYWHNHS | RNTEXMED | SEXCYFEM | SEXCYMAL | UNECYRATE | VPHCY1 | VPHCYGT1 | VPHCYNONE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
avg | 9.072047e+01 | 9.311367e+01 | 9.591034e+01 | 9.722016e+01 | 1.001196e+02 | 1.087202e+02 | 1.036462e+02 | 1.003712e+02 | 9.199482e+01 | 9.412861e+01 | ... | 3.505126e+01 | 3.673164 | 9.111044e+02 | 9.315027e+02 | 7.691157e+02 | 7.464722e+02 | 3.687263 | 1.922163e+02 | 3.509257e+02 | 5.008733e+01 |
max | 5.007000e+03 | 5.274000e+03 | 5.225000e+03 | 7.607000e+03 | 1.489400e+04 | 5.746000e+03 | 4.936000e+03 | 5.451000e+03 | 5.052000e+03 | 4.596000e+03 | ... | 2.110000e+03 | 950.000000 | 3.681800e+04 | 3.999000e+03 | 3.255200e+04 | 3.104300e+04 | 100.000000 | 1.681400e+04 | 1.696200e+04 | 4.945000e+03 |
min | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
sum | 1.970285e+07 | 2.022261e+07 | 2.083000e+07 | 2.111447e+07 | 2.174418e+07 | 2.361208e+07 | 2.251009e+07 | 2.179881e+07 | 1.997962e+07 | 2.044304e+07 | ... | 7.612502e+06 | 797745.000000 | 1.978755e+08 | 2.023056e+08 | 1.670381e+08 | 1.621203e+08 | 800807.220000 | 4.174592e+07 | 7.621475e+07 | 1.087807e+07 |
range | 5.007000e+03 | 5.274000e+03 | 5.225000e+03 | 7.607000e+03 | 1.489400e+04 | 5.746000e+03 | 4.936000e+03 | 5.451000e+03 | 5.052000e+03 | 4.596000e+03 | ... | 2.110000e+03 | 950.000000 | 3.681800e+04 | 3.999000e+03 | 3.255200e+04 | 3.104300e+04 | 100.000000 | 1.681400e+04 | 1.696200e+04 | 4.945000e+03 |
stdev | 7.802265e+01 | 8.034981e+01 | 8.116058e+01 | 1.107727e+02 | 1.230680e+02 | 9.159219e+01 | 8.815390e+01 | 8.482190e+01 | 7.528368e+01 | 7.152112e+01 | ... | 5.045176e+01 | 14.906111 | 7.440860e+02 | 4.772473e+02 | 5.222389e+02 | 5.242907e+02 | 3.774735 | 1.561162e+02 | 2.771389e+02 | 8.571871e+01 |
q1 | 4.400000e+01 | 4.500000e+01 | 4.600000e+01 | 4.500000e+01 | 4.400000e+01 | 5.100000e+01 | 5.000000e+01 | 5.000000e+01 | 4.600000e+01 | 4.900000e+01 | ... | 1.100000e+01 | 0.000000 | 3.670000e+02 | 5.520000e+02 | 4.350000e+02 | 4.180000e+02 | 0.970000 | 8.800000e+01 | 1.700000e+02 | 5.000000e+00 |
q3 | 8.400000e+01 | 8.600000e+01 | 8.900000e+01 | 8.700000e+01 | 8.600000e+01 | 1.000000e+02 | 9.600000e+01 | 9.300000e+01 | 8.600000e+01 | 8.900000e+01 | ... | 2.900000e+01 | 0.000000 | 9.250000e+02 | 9.250000e+02 | 7.400000e+02 | 7.130000e+02 | 3.460000 | 1.830000e+02 | 3.410000e+02 | 3.400000e+01 |
median | 6.200000e+01 | 6.400000e+01 | 6.500000e+01 | 6.400000e+01 | 6.200000e+01 | 7.300000e+01 | 7.000000e+01 | 6.900000e+01 | 6.400000e+01 | 6.700000e+01 | ... | 1.900000e+01 | 0.000000 | 6.550000e+02 | 7.190000e+02 | 5.730000e+02 | 5.490000e+02 | 2.130000 | 1.310000e+02 | 2.520000e+02 | 1.700000e+01 |
interquartile_range | 4.000000e+01 | 4.100000e+01 | 4.300000e+01 | 4.200000e+01 | 4.200000e+01 | 4.900000e+01 | 4.600000e+01 | 4.300000e+01 | 4.000000e+01 | 4.000000e+01 | ... | 1.800000e+01 | 0.000000 | 5.580000e+02 | 3.730000e+02 | 3.050000e+02 | 2.950000e+02 | 2.490000 | 9.500000e+01 | 1.710000e+02 | 2.900000e+01 |
10 rows × 107 columns
Every Dataset
instance in the catalog contains other useful metadata:
dataset.to_dict()
{'id': 'carto-do.ags.demographics_sociodemographic_usa_blockgroup_2015_yearly_2019', 'slug': 'ags_sociodemogr_e92b1637', 'name': 'Sociodemographic', 'description': 'Census and ACS sociodemographic data estimated for the current year and data projected to five years. Projected fields are general aggregates (total population, total households, median age, avg income etc.)', 'country_id': 'usa', 'geography_id': 'carto-do-public-data.usa_carto.geography_usa_blockgroup_2015', 'geography_name': 'Census Block Groups (2015) - shoreline clipped', 'geography_description': 'Shoreline clipped TIGER/Line boundaries. More info: https://carto.com/blog/tiger-shoreline-clip/', 'category_id': 'demographics', 'category_name': 'Demographics', 'provider_id': 'ags', 'provider_name': 'Applied Geographic Solutions', 'data_source_id': 'sociodemographic', 'lang': 'eng', 'temporal_aggregation': 'yearly', 'time_coverage': '[2019-01-01, 2020-01-01)', 'update_frequency': None, 'version': '2019', 'is_public_data': False}
There's also some intersting metadata, for each variable in the dataset:
Variables are the most important asset in the catalog and when exploring datasets in the Data Observatory catalog it's very important that you understand clearly what variables are available to enrich your own data.
For each Variable
in each dataset, the Data Observatory provides (as it does with datasets) a set of methods and attributes to understand their underlaying data.
Some of them are:
head
and tail
methods to get a glimpse of the actual data and start modelling your problem right away.counts
, quantiles
and a full describe
method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset.histogram
plot with the distribution of the values on each variable.Let's overview some of that augmented metadata for the variables in the AGS population dataset.
from cartoframes.data.observatory import Variable
variable = Variable.get('POPPY_946f4ed6')
variable
<Variable.get('POPPY_946f4ed6')> #'Population (2024A)'
variable.to_dict()
{'id': 'carto-do.ags.demographics_sociodemographic_usa_blockgroup_2015_yearly_2019.POPPY', 'slug': 'POPPY_946f4ed6', 'name': 'POPPY', 'description': 'Population (2024A)', 'column_name': 'POPPY', 'db_type': 'FLOAT', 'dataset_id': 'carto-do.ags.demographics_sociodemographic_usa_blockgroup_2015_yearly_2019', 'agg_method': 'SUM', 'variable_group_id': None, 'starred': False}
There's also some utility methods ot understand the underlying data for each variable:
variable.head()
0 0 1 0 2 8 3 0 4 0 5 0 6 4 7 0 8 2 9 59 dtype: int64
variable.counts()
all 217182.000000 null 0.000000 zero 303.000000 extreme 9380.000000 distinct 6947.000000 outliers 27571.000000 null_percent 0.000000 zero_percent 0.139514 extreme_percent 0.043190 distinct_percent 3.198700 outliers_percent 0.126949 dtype: float64
variable.quantiles()
q1 867 q3 1490 median 1149 interquartile_range 623 dtype: int64
variable.histogram()
<Figure size 1200x700 with 1 Axes>
variable.describe()
avg 1.564793e+03 max 7.127400e+04 min 0.000000e+00 sum 3.398448e+08 range 7.127400e+04 stdev 1.098193e+03 q1 8.670000e+02 q3 1.490000e+03 median 1.149000e+03 interquartile_range 6.230000e+02 dtype: float64
Once you have explored the catalog and have detected a dataset with the variables you need for your analysis and the right spatial resolution, you have to look at the is_public_data
to know if you can just use it from CARTOframes or you first need to subscribe for a license.
Subscriptions to datasets allow you to use them from CARTOframes to enrich your own data or to download them. See the enrichment guides for more information about this.
Let's see the dataset and geography in our previous example:
dataset = Dataset.get('ags_sociodemogr_e92b1637')
dataset.is_public_data
False
from cartoframes.data.observatory import Geography
geography = Geography.get(dataset.geography)
geography.is_public_data
True
Both dataset
and geography
are not public data, that means you need a subscription to be able to use them to enrich your own data.
To subscribe to data in the Data Observatory catalog you need a CARTO account with access to Data Observatory
from cartoframes.auth import set_default_credentials
set_default_credentials('creds.json')
dataset.subscribe()
geography.subscribe()
Licenses to data in the Data Observatory grant you the right to use the data subscribed for the period of one year. Every dataset or geography you want to use to enrich your own data, as long as they are not public data, require a valid license.
You can check the actual status of your subscriptions directly from the catalog.
Catalog().subscriptions()
Datasets: [<Dataset.get('ags_businesscou_df363a87')>, <Dataset.get('ags_retailpoten_aaf25a8c')>, <Dataset.get('ags_sociodemogr_e92b1637')>, <Dataset.get('ags_crimerisk_e9cfa4d4')>] Geographies: [<Geography.get('usct_blockgroup_f45b6b49')>, <Geography.get('ags_blockgroup_1c63771c')>]
In this guide you've seen how to explore the Data Observatory catalog to identify variables of datasets that you can use to enrich your own data.
You've learned how to:
Geography
, Dataset
and their Variables
.We also recommend checking out the resources below to learn more about the Data Observatory catalog: