In this notebook, we'll load the five years of CPI data that we collected from the Bureau of Labor Statistics (https://www.bls.gov/cpi) using the public API, clean and prep the data for processing, and complete exploratory data analysis.
The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. The index measures price changes from a designed reference date, and in this case, the reference base is 1982-84 (CPI = 100). The data we collected is for the four census regions (South, West, Northeast, and Midwest) and five consumer price indexes (All items, All items less food and energy, Energy, Food, and Shelter).
The West region includes Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, Nevada, New Mexico, Oregon, Utah, Washington, and Wyoming.
The Midwest region is comprised of Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, and Wisconsin.
The South region is comprised of Alabama, Arkansas, Delaware, District of Columbia, Florida, Georgia, Kentucky, Louisiana, Maryland, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee, Texas, Virginia, and West Virginia.
The Northeast region is comprised of Connecticut, Maine, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, and Vermont.
We'll look to answer the following questions:
Which region has had the largest CPI increase for all items?
Which region has had the largest CPI increases for energy, food, and shelter?
Which commodity saw the largest CPI increase?
First, we'll start with importing the libraries and tools that we'll need.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from dataprep.eda import plot, create_report
Next, we'll read in the Excel file that we created that lists the Series Report IDs and the region and consumer price index that the are tied to. We'll use this document as a dictionary to map the decriptions to our data since the API only returns the Series Report IDs and values.
# create dataframe from series mapping document
cpi_mapping = pd.read_excel('BLS CPI/BLS Series Numbers.xlsx', sheet_name='Sheet1')
# lower column names
cpi_mapping.columns = cpi_mapping.columns.str.lower()
# view mapping document
cpi_mapping
bls_series | region | description | |
---|---|---|---|
0 | CUUR0300SA0 | South | All items |
1 | CUUR0400SA0 | West | All items |
2 | CUUR0100SA0 | Northeast | All items |
3 | CUUR0200SA0 | Midwest | All items |
4 | CUUR0300SA0L1E | South | All items less food and energy |
5 | CUUR0400SA0L1E | West | All items less food and energy |
6 | CUUR0100SA0L1E | Northeast | All items less food and energy |
7 | CUUR0200SA0L1E | Midwest | All items less food and energy |
8 | CUUR0300SA0E | South | Energy |
9 | CUUR0400SA0E | West | Energy |
10 | CUUR0100SA0E | Northeast | Energy |
11 | CUUR0200SA0E | Midwest | Energy |
12 | CUUR0300SAF1 | South | Food |
13 | CUUR0400SAF1 | West | Food |
14 | CUUR0100SAF1 | Northeast | Food |
15 | CUUR0200SAF1 | Midwest | Food |
16 | CUUR0300SAH1 | South | Shelter |
17 | CUUR0400SAH1 | West | Shelter |
18 | CUUR0100SAH1 | Northeast | Shelter |
19 | CUUR0200SAH1 | Midwest | Shelter |
Now we'll read in our collected data and map the region and category descriptions to the series_id.
# create cpi data df
cpi_data = pd.read_csv('BLS CPI/CPI Data 2017 - 2021.csv')
# change column names to lowercase
cpi_data.columns = cpi_data.columns.str.lower()
# change series id column name to snake_case
cpi_data = cpi_data.rename(columns={'series id':'series_id'})
cpi_data.sample(5)
series_id | month | value | |
---|---|---|---|
1123 | CUUR0400SAH1 | 1-Aug-20 | 356.336 |
333 | CUUR0400SA0L1E | 1-May-21 | 288.442 |
48 | CUUR0300SA0 | 1-Apr-18 | 242.486 |
992 | CUUR0200SAF1 | 1-Sep-20 | 254.773 |
347 | CUUR0400SA0L1E | 1-Apr-20 | 279.072 |
# create dictionaries to map region and description vaules to the cpi data
bls_area = dict(zip(cpi_mapping.bls_series, cpi_mapping.region))
bls_desc = dict(zip(cpi_mapping.bls_series, cpi_mapping.description))
# map values from dictionary
cpi_data['region'] = cpi_data['series_id'].map(bls_area)
cpi_data['description'] = cpi_data['series_id'].map(bls_desc)
# rename description column
cpi_data.rename(columns={'description':'category'}, inplace=True)
# reorder columns and drop series_id column
cpi_data = cpi_data[['region', 'category', 'month', 'value']]
# view a sample of the data
cpi_data.sample(10)
region | category | month | value | |
---|---|---|---|---|
969 | Northeast | Food | 1-Jun-17 | 254.968 |
781 | South | Food | 1-Dec-21 | 282.547 |
1103 | South | Shelter | 1-Feb-17 | 262.560 |
739 | Midwest | Energy | 1-Feb-20 | 194.298 |
503 | Midwest | All items less food and energy | 1-Apr-18 | 239.531 |
45 | South | All items | 1-Jul-18 | 243.776 |
873 | West | Food | 1-Nov-19 | 267.107 |
696 | Northeast | Energy | 1-Jun-18 | 224.930 |
203 | Midwest | All items | 1-May-21 | 250.582 |
621 | West | Energy | 1-Mar-19 | 251.002 |
cpi_data.info() # view number of observations and check for null values
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1300 entries, 0 to 1299 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 region 1300 non-null object 1 category 1300 non-null object 2 month 1300 non-null object 3 value 1300 non-null float64 dtypes: float64(1), object(3) memory usage: 40.8+ KB
Since our data contains both monthy and yearly averages, we will subset the dataframe into to one dataframe containing yearly values, and one containing monthly values.
# subset dataframe for annual values
cpi_ann = cpi_data[cpi_data['month'].str.contains('Annual')].copy()
# trim year from month column
cpi_ann['year'] = cpi_ann['month'].str[-4:]
# change year to datetime data type
cpi_ann['year'] = pd.to_datetime(cpi_ann['year']).dt.strftime('%Y')
# drop month column
cpi_ann.drop(columns={'month'}, inplace=True)
# sort by year
cpi_ann = cpi_ann.sort_values(by=['region','category','year'])
cpi_ann.head()
region | category | value | year | |
---|---|---|---|---|
247 | Midwest | All items | 229.874 | 2017 |
234 | Midwest | All items | 234.290 | 2018 |
221 | Midwest | All items | 237.776 | 2019 |
208 | Midwest | All items | 240.040 | 2020 |
195 | Midwest | All items | 252.242 | 2021 |
# subset dataframe for monthly values
cpi_mth = cpi_data[~cpi_data['month'].str.contains('Annual')].copy()
# change month column to datetime data type
cpi_mth['month'] = pd.to_datetime(cpi_mth['month'])
cpi_mth.head()
region | category | month | value | |
---|---|---|---|---|
1 | South | All items | 2021-12-01 | 269.263 |
2 | South | All items | 2021-11-01 | 268.360 |
3 | South | All items | 2021-10-01 | 267.160 |
4 | South | All items | 2021-09-01 | 264.593 |
5 | South | All items | 2021-08-01 | 263.728 |
# export cleaned dataframes a csvs
cpi_mth.to_csv('BLS CPI/FWT_CPI_Month.csv', index=False)
cpi_ann.to_csv('BLS CPI/FWT_CPI_Year.csv', index=False)
We'll start the EDA by plotting the monthly values for each region.
px.line(cpi_mth.query("region == 'South'").
sort_values(by='month'),
x='month',
y='value',
color='category',
labels={'value':'CPI', 'month':'', 'category':'Categories:'},
title="Consumer Price Index for Major Categories: Southern Region",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
px.line(cpi_mth.query("region == 'West'").
sort_values(by='month'),
x='month',
y='value',
color='category',
labels={'value':'CPI', 'month':'', 'category':'Categories:'},
title="Consumer Price Index for Major Categories: Western Region",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
px.line(cpi_mth.query("region == 'Northeast'").
sort_values(by='month'),
x='month',
y='value',
color='category',
labels={'value':'CPI', 'month':'', 'category':'Categories:'},
title="Consumer Price Index for Major Categories: Northeast Region",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
px.line(cpi_mth.query("region == 'Midwest'").
sort_values(by='month'),
x='month',
y='value',
color='category',
labels={'value':'CPI', 'month':'', 'category':'Categories:'},
title="Consumer Price Index for Major Categories: Midwest Region",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
Now we'll calculate the percentage change of each category from 2017 to 2021 using the annual averages and plot them as a bar graph by region.
cpi_ann['pct_change']= cpi_ann\
.groupby(['region','category'])['value']\
.pct_change(periods=4).mul(100)
cpi_ann_pct = cpi_ann[['region', 'category', 'pct_change']]\
.dropna().sort_values(by='pct_change',ascending=False).copy()
cpi_ann_pct.query("region == 'South'")
region | category | pct_change | |
---|---|---|---|
520 | South | Energy | 13.194426 |
1040 | South | Shelter | 12.245735 |
0 | South | All items | 10.024173 |
780 | South | Food | 10.000723 |
260 | South | All items less food and energy | 9.764752 |
fig = px.bar(cpi_ann_pct.query("region == 'South'"),
x='category',
y='pct_change',
color='category',
text_auto='.3s',
labels={'pct_change':'', 'category':''},
title="Southern Region Consumer Price Index % Change: 2017 to 2021",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
fig.update_yaxes(ticksuffix="%")
fig.show()
fig = px.bar(cpi_ann_pct.query("region == 'Northeast'"),
x='category',
y='pct_change',
color='category',
text_auto='.3s',
labels={'pct_change':'', 'category':''},
title="Northeast Region Consumer Price Index % Change: 2017 to 2021",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
fig.update_yaxes(ticksuffix="%")
fig.show()
fig = px.bar(cpi_ann_pct.query("region == 'West'"),
x='category',
y='pct_change',
color='category',
text_auto='.3s',
labels={'pct_change':'', 'category':''},
title="Western Region Consumer Price Index % Change: 2017 to 2021",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
fig.update_yaxes(ticksuffix="%")
fig.show()
fig = px.bar(cpi_ann_pct.query("region == 'Midwest'"),
x='category',
y='pct_change',
color='category',
text_auto='.3s',
labels={'pct_change':'', 'category':''},
title="Midwest Region Consumer Price Index % Change: 2017 to 2021",
height=600,
category_orders={'category':['All items','Food','Energy','Shelter','All items less food and energy']})
fig.update_yaxes(ticksuffix="%")
fig.show()
Now lets compare the regional percentage changes for each category.
fig = px.bar(cpi_ann_pct.query("category == 'All items'"),
x='region',
y='pct_change',
color='region',
text_auto='.3s',
labels={'pct_change':'', 'region':''},
title="Consumer Price Index for All Items % Change: 2017 to 2021",
height=600,
category_orders={'region':['South','West','Northeast','Midwest']})
fig.update_yaxes(ticksuffix="%")
fig.show()
fig = px.bar(cpi_ann_pct.query("category == 'Food'"),
x='region',
y='pct_change',
color='region',
text_auto='.3s',
labels={'pct_change':'', 'region':''},
title="Consumer Price Index for Food % Change: 2017 to 2021",
height=600,
category_orders={'region':['South','West','Northeast','Midwest']})
fig.update_yaxes(ticksuffix="%")
fig.show()
fig = px.bar(cpi_ann_pct.query("category == 'Energy'"),
x='region',
y='pct_change',
color='region',
text_auto='.3s',
labels={'pct_change':'', 'region':''},
title="Consumer Price Index for Energy % Change: 2017 to 2021",
height=600,
category_orders={'region':['South','West','Northeast','Midwest']})
fig.update_yaxes(ticksuffix="%")
fig.show()
fig = px.bar(cpi_ann_pct.query("category == 'Shelter'"),
x='region',
y='pct_change',
color='region',
text_auto='.3s',
labels={'pct_change':'', 'region':''},
title="Consumer Price Index for Shelter % Change: 2017 to 2021",
height=600,
category_orders={'region':['South','West','Northeast','Midwest']})
fig.update_yaxes(ticksuffix="%")
fig.show()
fig = px.bar(cpi_ann_pct.query("category == 'All items less food and energy'"),
x='region',
y='pct_change',
color='region',
text_auto='.3s',
labels={'pct_change':'', 'region':''},
title="Consumer Price Index for All Items (less food and energy) % Change: 2017 to 2021",
height=600,
category_orders={'region':['South','West','Northeast','Midwest']})
fig.update_yaxes(ticksuffix="%")
fig.show()
1. Which region has had the largest CPI increases for all items?
The Western region saw the largest CPI increase for all items (+12.9%).
2. Which region has had the largest CPI increases for energy, food, and shelter?
Energy: West (+28.4%)
Food: West (+13.7%)
Shelter: West (+14.8%)
3. Which commodity saw the largest CPI increase?
In all four regions, the commodity that saw the largest CPI increase was Energy.
In the next part of our analysis, we will take a look at State and metro area population data.