This notebook is intended to show how to use some intermediate Altair functionality around filtering, tidying, and interactivity with an interesting public dataset. It is explicitly not intended to convey any epidemiological insight (raw case counts are not necessarily useful).
We're going to start with data from covidtracking.com. This is an excellent resource that provides an API for curated, historical, state-by-state data.
import pandas as pd
import altair as alt
alt.data_transformers.enable('json')
df=pd.read_csv("https://covidtracking.com/api/v1/states/daily.csv")
We're first going to reformat the dates to add hyphens between the year, month, and day, so 20200228
becomes 2020-02-28
.
def datemunge(di):
d = str(di)
return "%s-%s-%s" % (d[0:4], d[4:6], d[6:8])
cleaned = df.copy()
cleaned["date"] = cleaned["date"].apply(datemunge)
We'll then melt
the data frame so that each observation is in its own row, so that (for example) state, date, positive, negative, hospitalized, icu
becomes state, date, observation_type, observation_value
, where observation_type
is one of positive
, negative
, hospitalized
, or icu
.
cleaned = pd.melt(cleaned,
id_vars=['date', 'state', 'fips'],
value_vars=list(set(df.columns) - set(['date', 'state', 'fips', 'hash', 'dateChecked', 'lastUpdateEt', 'dataQualityGrade'])),
value_name="cases",
var_name="case type")
We can see the difference between these representations by looking at the source data (df
) for Wisconsin on April 9th and the melted data (cleaned
) for Wisconsin on April 9th.
df[(df["state"] == "WI") & (df["date"] == 20200409)]
cleaned[(cleaned["state"] == "WI") & (cleaned["date"] == "2020-04-09")].dropna()
The next cell shows a function that operates on our cleaned
(long-form) data frame to produce a chart of results for a specific state. We're using Altair's transform_filter
function to postprocess the cleaned data to select
NaN
case count, anddef cases_for_state(state, show_points=False):
case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']
chart = alt.Chart(cleaned).\
encode(alt.X("date:N"),
alt.Y("cases", scale=alt.Scale(type="log")),
alt.Color("case type",
sort=alt.EncodingSortField(field="cases",
order="descending",
op="max")),
tooltip=['date', 'state', 'case type', 'cases']).\
transform_filter(alt.datum.state == state).\
transform_filter(alt.datum.cases > 0).\
transform_filter(alt.FieldOneOfPredicate("case type", case_types))
return chart.mark_line() + chart.mark_point() if show_points else chart.mark_line()
cases_for_state("WI")
Of course, we could generate a data frame that solely has the rows we care about for a given state, like this:
case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']
wi_cases = cleaned[(cleaned["state"] == "WI") &
(cleaned["case type"].isin(case_types)) &
(pd.to_numeric(cleaned["cases"], errors="coerce") > 0)]
alt.Chart(
wi_cases
).mark_line(
).encode(
alt.X("date:N"),
alt.Y("cases", scale=alt.Scale(type="log")),
alt.Color("case type",
sort=alt.EncodingSortField(field="cases",
order="descending",
op="max"))
)
Filtering our data in Altair can be more convenient, though, and enable interactive charting, as we'll see shortly.
We used the melt
function in Pandas to go from a wide-form table (df
), in which each observation is a column, to a long-form table (cleaned
), in which each observation is a row.
We can also do this transformation in Altair, with the transform_fold
function, as in the next cell. The fold
parameter takes a list of columns to break out into new observation types, and the as_
parameter takes a two-element list consisting of what to call the observation type column (whose values are the names of the columns from fold
) and what to call the observation value column (whose values are the values of the columns from fold
).
As a bonus, we'll also convert from "date integers" of the form 20200410
to actual date-time objects in Altair (instead of using DataFrame.apply
). We'll construct these by dividing the date value by 10,000 to get the year, dividing the remainder of the date value divided by 10,000 by 100 to get the month, and taking the remainder of the value divided by 100 to get the day. (Since Vega dates use zero-indexed months, we'll also have to subtract one from the month. Phew!)
This will turn into a Vega expression that we can pass into Altair's transform_calculate
method, and that looks like this:
alt.expr.datetime(
alt.expr.floor(alt.datum.date / 10000), # year
alt.expr.floor(alt.datum.date % 10000 / 100) - 1, # (zero-based) month
alt.datum.date % 100 # day
)
def cases_for_state_folded(state, show_points=False):
case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']
cleaned_date = alt.expr.datetime(alt.expr.floor(alt.datum.date / 10000), # year
alt.expr.floor(alt.datum.date % 10000 / 100) - 1, # (zero-based) month
alt.datum.date % 100) # day
chart = alt.Chart(df).\
encode(alt.X("monthdate(cleandate):O", title="date"),
alt.Y("cases:Q", scale=alt.Scale(type="log")),
alt.Color("case type:N",
sort=alt.EncodingSortField(field="cases",
order="descending",
op="max")),
tooltip=['yearmonthdate(cleandate)', 'state', 'case type:N', 'cases:Q']).\
transform_filter(alt.datum.state == state).\
transform_calculate(
cleandate=cleaned_date
).\
transform_fold(
as_=["case type", "cases"],
fold=case_types
).\
transform_filter(alt.datum.cases > 0)
return chart.mark_line() + chart.mark_point() if show_points else chart.mark_line()
cases_for_state_folded("WI")
We can also use Altair's selection support to make an interactive chart that lets us choose which state to plot cases for.
def interactive_cases_for_state():
case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']
input_dropdown = alt.binding_select(options=cleaned[(pd.to_numeric(cleaned["cases"], errors="coerce") > 0) & (cleaned["case type"] == "positive")]["state"].sort_values().unique())
selection = alt.selection_single(fields=['state'], bind=input_dropdown, name='Choose', init={"state":"AK"})
chart = alt.Chart(cleaned).\
encode(alt.X("date:N"),
alt.Y("cases", scale=alt.Scale(type="log")),
alt.Color("case type",
sort=alt.EncodingSortField(field="cases",
order="descending",
op="max")),
tooltip=['date', 'state', 'case type', 'cases']).\
transform_filter(selection).\
transform_filter(alt.datum.cases > 0).\
transform_filter(alt.FieldOneOfPredicate("case type", case_types)).\
add_selection(selection)
return chart.mark_line()
interactive_cases_for_state()
To plot case counts on a map, we'll need to integrate geographic data (the shapes of states as GeoJSON polygons) with our observations.
We'll pull down state shapes from a public datasource that has both state and county data, using Altair's topo_feature
function:
states = alt.topo_feature("https://vega.github.io/vega-datasets/data/us-10m.json", "states")
To plot total case counts per state, we'll make a chloropleth in Altair and will need to join the case counts with the state shapes. The state shapes are keyed by FIPS numeric state codes, not by alphabetical state codes. We have the FIPS codes in the source data as fips
, so we'll use Altair's transform_lookup
function to indicate that we want to take the case count, case type, state, and date from a postprocessed data frame where the fips
field matches the id
field in our state collection.
ctrim = cleaned[(cleaned["case type"] == "positive") & (cleaned["date"] == cleaned["date"].max())].copy()
alt.Chart(
states
).mark_geoshape(
).encode(
color='cases:Q',
tooltip=['state:N', 'cases:Q', 'date:N']
).transform_lookup(
lookup='id',
from_=alt.LookupData(ctrim, 'fips', ['cases', 'case type', 'state', 'date'])
).project(
type='albersUsa'
).properties(
width=500, height=400
)