california-coronavirus-data examples

By Ben Welsh

A demonstration of how to use Python to work with the Los Angeles Times' independent tally of coronavirus cases in California published on GitHub at datadesk/california-coronavirus-data. To run this notebook immediately in the cloud, click the Binder launcher below.

Binder

In [1]:
%load_ext lab_black

Import Python tools

Our data analysis and plotting tools

In [2]:
import pandas as pd
import altair as alt

Customizations to the Altair theme

In [3]:
import altair_latimes as lat
In [4]:
alt.themes.register("latimes", lat.theme)
alt.themes.enable("latimes")
Out[4]:
ThemeRegistry.enable('latimes')
In [5]:
alt.data_transformers.disable_max_rows()
Out[5]:
DataTransformerRegistry.enable('default')

Import data

Read in the agency totals

In [6]:
agency_df = pd.read_csv("../latimes-agency-totals.csv", parse_dates=["date"])
In [7]:
agency_df.head()
Out[7]:
agency county fips date confirmed_cases deaths recoveries did_not_update
0 Alameda Alameda 1 2020-06-21 4686 117.0 NaN NaN
1 Berkeley Alameda 1 2020-06-21 119 1.0 NaN NaN
2 Alpine Alpine 3 2020-06-21 1 0.0 1.0 True
3 Amador Amador 5 2020-06-21 13 0.0 10.0 NaN
4 Butte Butte 7 2020-06-21 94 1.0 72.0 True
In [8]:
agency_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6293 entries, 0 to 6292
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   agency           6293 non-null   object        
 1   county           6293 non-null   object        
 2   fips             6293 non-null   int64         
 3   date             6293 non-null   datetime64[ns]
 4   confirmed_cases  6293 non-null   int64         
 5   deaths           6292 non-null   float64       
 6   recoveries       2095 non-null   float64       
 7   did_not_update   893 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(3)
memory usage: 393.4+ KB

Aggregate data

By state

Lump all the agencies together and you get the statewide totals.

In [9]:
state_df = (
    agency_df.groupby(["date"])
    .agg({"confirmed_cases": "sum", "deaths": "sum"})
    .reset_index()
)
In [10]:
state_df.head()
Out[10]:
date confirmed_cases deaths
0 2020-01-26 2 0.0
1 2020-01-27 3 0.0
2 2020-01-28 3 0.0
3 2020-01-29 4 0.0
4 2020-01-30 4 0.0
In [11]:
state_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             148 non-null    datetime64[ns]
 1   confirmed_cases  148 non-null    int64         
 2   deaths           148 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 3.6 KB

By county

Three cities — Berkeley, Long Beach and Pasadena — run independent public health departments. Calculating county-level totals requires grouping them with their local peers.

In [12]:
county_df = (
    agency_df.groupby(["date", "county"])
    .agg({"confirmed_cases": "sum", "deaths": "sum"})
    .reset_index()
)
In [13]:
county_df.head()
Out[13]:
date county confirmed_cases deaths
0 2020-01-26 Alameda 0 0.0
1 2020-01-26 Calaveras 0 0.0
2 2020-01-26 Contra Costa 0 0.0
3 2020-01-26 Humboldt 0 0.0
4 2020-01-26 Los Angeles 1 0.0
In [14]:
county_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5948 entries, 0 to 5947
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             5948 non-null   datetime64[ns]
 1   county           5948 non-null   object        
 2   confirmed_cases  5948 non-null   int64         
 3   deaths           5948 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 186.0+ KB

Chart the statewide totals over time

In [15]:
# Create a base chart with the common x-axis
chart = alt.Chart(state_df).encode(x=alt.X("date:T", title=None))

# Create the cases line
cases = chart.mark_line(color=lat.palette["default"]).encode(
    y=alt.Y("confirmed_cases:Q", title="Confirmed cases")
)

# Create the deaths line
deaths = chart.mark_line(color=lat.palette["schemes"]["ice-7"][3]).encode(
    y=alt.Y("deaths:Q", title="Deaths")
)

# Combine them into a single chart
(cases & deaths).properties(title="Statewide cumulative totals")
Out[15]:

Chart the county totals

First on a linear scale

In [16]:
# Create the base chart
chart = (
    alt.Chart(county_df)
    .mark_line()
    .encode(
        x=alt.X("date:T", title=None),
        color=alt.Color("county:N", title="County", legend=None),
    )
)

# The cases line
cases = chart.encode(y=alt.Y("confirmed_cases:Q", title="Confirmed cases"),)

# The deaths line
deaths = chart.mark_line().encode(y=alt.Y("deaths:Q", title="Deaths"),)

# Combined into a chart
(cases & deaths).properties(title="Cumulative totals by county")
Out[16]:

Again on a logarithmic scale

In [17]:
# Make a base chart
chart = (
    alt.Chart(county_df)
    .mark_line()
    .encode(
        x=alt.X("date:T", title=None),
        color=alt.Color("county:N", title="County", legend=None),
    )
)

# The cases lines
cases = chart.transform_filter(alt.datum.confirmed_cases > 0).encode(
    y=alt.Y("confirmed_cases:Q", scale=alt.Scale(type="log"), title="Confirmed cases"),
)

# The deaths lines
deaths = chart.transform_filter(alt.datum.deaths > 0).encode(
    y=alt.Y("deaths:Q", scale=alt.Scale(type="log"), title="Deaths"),
)

# Slapping them together
(cases & deaths).properties(title="Cumulative totals by county")
Out[17]: