COVID-19 Data Collection and Analysis¶

In [1]:

from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

In [1]:

import re
import requests
import pandas as pd
import numpy as np
from time import sleep
from bs4 import BeautifulSoup
import pickle 
from typing import Optional

1. Data Collection¶

We are interested in the degree to which the SARS-CoV-2 virus has affected United States citizens (SARS-CoV-2 is the virus that causes the COVID-19 disease). The Centers for Disease Control and Prevention (CDC) provides relevant data from USAFacts.org that includes the number of confirmed COVID-19 cases on a per-county basis. At the bottom of the web page (https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/), a blue table provides a list of states, each of which has its own web page displaying the reported numbers of cases and deaths.

We automatically downloaded each state's data with Requests and then manipulated it with BeautifulSoup. Specifically, we first fetched the web page located at base_url and save the request's returned object (a respond object) to home_page. We then used the BeautifulSoup object to parse the home page as an HTML document in order to extract the link for every state. With these extracted URLs, we populated a state_urls dictionary by setting each key to be the state name and the value to be the full URL. To avoid download state web pages multiple times frequently, we iterated through the state_urls, make a web request for each URL, and save the contents out to a file on the hard drive.

In [7]:

# Every state's url begins with this prefix
base_url = 'https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/'
# Datasets will be saved to this directory
state_dir = "./drive/MyDrive/state_data/"

In [3]:

# Fetch web page and save the response object
home_page = requests.get(base_url)
print(home_page.status_code)
print(home_page.text[:10000])

200
<!doctype html><html lang="en"><head><title data-react-helmet="true">US COVID-19 cases and deaths by state | USAFacts</title><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge" /><meta data-react-helmet="true" charSet="utf-8" /><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1" /><meta data-react-helmet="true" name="msapplication-TileColor" content="#E3007D" /><meta data-react-helmet="true" name="theme-color" content="#E3007D" /><meta data-react-helmet="true" name="keywords" content="coronavirus, covid-19, covid-cases, Cases and Deaths" /><meta data-react-helmet="true" name="twitter:site" content="@usafacts" /><meta data-react-helmet="true" name="twitter:title" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" name="twitter:description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" name="twitter:creator" content="@usafacts" /><meta data-react-helmet="true" name="twitter:image" content="https://staticweb.usafacts.org/media/images/Covid_Report_h5H7hpb.original.jpg" /><meta data-react-helmet="true" property="fb:admins" content="Facebook numberic ID" /><meta data-react-helmet="true" property="og:site_name" content="USAFacts" /><meta data-react-helmet="true" property="og:image:width" content="1600" /><meta data-react-helmet="true" property="og:image:height" content="900" /><meta data-react-helmet="true" property="og:locale" content="en" /><meta data-react-helmet="true" property="twitter:site" content="@usafacts" /><meta data-react-helmet="true" name="description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" name="robots" content="max-snippet:-1" /><meta data-react-helmet="true" name="twitter:card" content="summary_large_image" /><meta data-react-helmet="true" property="og:title" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" property="og:description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" property="og:type" content="article" /><meta data-react-helmet="true" property="og:url" content="/visualizations/coronavirus-covid-19-spread-map/" /><meta data-react-helmet="true" property="og:image" content="https://staticweb.usafacts.org/media/images/Covid_Report_h5H7hpb.original.jpg" /><meta data-react-helmet="true" property="og:image:alt" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" property="twitter:title" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" property="twitter:description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" property="twitter:image" content="https://staticweb.usafacts.org/media/images/Covid_Report_h5H7hpb.original.jpg" /><meta data-react-helmet="true" property="article:section" content="coronavirus" /><meta data-react-helmet="true" property="article:tag" content="coronavirus" /><meta data-react-helmet="true" property="article:published_time" content="2020-01-22T09:00:00-07:00" /><meta data-react-helmet="true" property="article:modified_time" content="2022-05-13T09:00:00-07:00" /><meta data-react-helmet="true" name="twitter:label0" content="How many COVID-19 cases are there in Alabama state?" /><meta data-react-helmet="true" name="twitter:data0" content="1304710" /><meta data-react-helmet="true" name="twitter:label1" content="How many COVID-19 cases are there in Alaska state?" /><meta data-react-helmet="true" name="twitter:data1" content="239856" /><meta data-react-helmet="true" name="twitter:label2" content="How many COVID-19 cases are there in Arizona state?" /><meta data-react-helmet="true" name="twitter:data2" content="2030926" /><link rel="preload" href="/frontend-issues/static/js/react-7dff463b.40bdbd99.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/react-e5bca7e4.2d1be3f8.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-utils-static-c6a584f7.ae4b8d8b.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-utils-static-130abb37.ebbb4cbc.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-utils-static-fbb8199d.78aaeced.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-graphs.36f60ce4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-85c5278b.6f7fd7ea.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-dbafc870.2d7ab14d.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-375540fe.3a60ebc4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-99c91fca.102e336b.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-1b90ef66.ad49e4d3.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-92770bb2.f5b2d191.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-1be980da.9c4722f9.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-83875ef7.98237c04.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-58265e03.bb85cfae.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-2878d4f2.f3d842bf.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6136.5c170882.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/2417.49a3285a.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/7889.e23ca963.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/9556.74319ef6.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/829.9bfff25a.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1904.5b3b450b.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1385.15511551.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/826.0bdc9dac.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4468.928b4475.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/7918.8172dc28.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5261.e72e3e01.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5124.08339503.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6498.bff30bf4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4571.66b942e1.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4013.11a0dc61.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/8544.8c5bd262.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5207.856b72f4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4657.2525f1b2.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1999.30b3cce7.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1426.970435fe.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6235.582ec870.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6844.1c4e2776.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6125.ff7e6068.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/7364.bff8689f.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/3056.9b8755ae.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/8216.c0657a8f.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5513.65bf5c29.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6732.d4d15d2c.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/9774.2c5aaa26.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/9348.5789882c.js"

In [4]:

# Create a BeautifulSoup Object 
bs_page = BeautifulSoup(home_page.content, "html.parser")

# Populate a dictionary with state name as keys and URL as values
state_urls = {}
for link in bs_page.findAll('a', attrs={'href': re.compile("/visualizations/coronavirus-covid-19-spread-map/state/")}):
    state_urls[link.text] = "https://usafacts.org" + link.get('href')

In [5]:

# A sanity check
if len(state_urls.keys()) != 51 or \
state_urls["District of Columbia"] != "https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/state/district-of-columbia":
    print("Incorrect")
else:
    print("Correct")

Correct

In [8]:

# Save each webpage to our local hard drive
for state, url in state_urls.items():
    
    state_webpage = requests.get(url).content
    
    f = open(state_dir + state, 'wb')
    f.write(state_webpage)
    f.close()
    
    # Leave a 2 second pause between requests
    sleep(2) 

2. Data Manipulation¶

We created a data object covid_data including 4 pieces of information 1) 7-day average case (integer) 2) 7-day average deaths (integer) 3) # of confirmed cases (total) (float if # of confirmed cases per 100k) 4) # of deaths (integer).

Since the population data per county has been removed from usafacts.org, we collected those population data from other sources. We ignored counties with missing population data or populations of 0.

In [10]:

with open('./drive/MyDrive/population.pkl', 'rb') as f:
    population_dict = pickle.load(f)

def get_pop(state: str, county: str) -> Optional[int]:
    try:
        return population_dict.get(state).get(county)
    except AttributeError:
        print('incorrect state name!')
        return None

def load_covid_data(state_info):
    covid_data = {}
    for (state, state_path) in state_info:
        covid_data[state] = []
        with open(state_path, 'r') as f:
            soup = BeautifulSoup(f.read(), 'html.parser')
            counties = soup.find_all('a', href=re.compile('county/'))
            for c in counties:
                row = c.find_parent('tr')
                cols = [col.text.replace(',','') for col in row.find_all('td')]

                county_name = c.text
                pop = get_pop(state, county_name)
                if ((get_pop(state, county_name)) is None) or (pop == 0):
                    continue
                covid_data[state].append({'county_name': county_name,
                                          'population': pop,
                                          '7_day_avg_cases': float(cols[0]),
                                          '7_date_ave_deaths': float(cols[1]),
                                          'cases': int(cols[2]),
                                          'deaths': int(cols[3])})
    return covid_data

In [11]:

state_info = [(state, state_dir + state) for state in state_urls.keys()]
covid_data = load_covid_data(state_info)

3. Exploratory Data Analysis (EDA) I¶

We first observed the single-most extreme counties and states, then inspected all states, after having sorted the data by some features.

We computed

The single county (and the state to which it belongs) that has the lowest rate of COVID cases per 100k people.
The single county (and the state to which it belongs) that has the highest rate of COVID cases per 100k people.

In [12]:

def calculate_county_stats(covid_data):
    
    min_county_count = 999999
    min_county_name = ""
    max_county_count = -1
    max_county_name = ""
    
    # looks through every county in every state, while checking if we have a new low or high
    for state in covid_data.keys():
        for county in covid_data[state]:
            pop = county['population']
            if (pop is None) or (pop == 0):
                continue
            covid_rate = round(county['cases'] / (pop/100000),2)
            if covid_rate < min_county_count:
                min_county_count = covid_rate
                min_county_name = county['county_name'] + " (" + state + ")"
            if covid_rate > max_county_count:
                max_county_count = covid_rate
                max_county_name = county['county_name'] + " (" + state + ")"

    print(min_county_name + " has the lowest COVID cases per 100k: " + str(float(min_county_count)))
    print(max_county_name + " has the highest COVID cases per 100k: " + str(float(max_county_count)))                

calculate_county_stats(covid_data)

Hoonah-Angoon Census Area (Alaska) has the lowest COVID cases per 100k: 0.0
Loving County (Texas) has the highest COVID cases per 100k: 115976.33

We calculated

The state that has the lowest number of deaths
The state that has the highest number of deaths

In [13]:

def calculate_state_deaths(covid_data):
    
    min_state_deaths = 999999
    min_state_name = ""
    max_state_deaths = -1
    max_state_name = ""
    for state in covid_data.keys():
        cur_state_count = 0
        for county in covid_data[state]:
            cur_state_count += county['deaths']
            
        if cur_state_count < min_state_deaths:
            min_state_deaths = cur_state_count
            min_state_name = state
        if cur_state_count > max_state_deaths:
            max_state_deaths = cur_state_count
            max_state_name = state

    print(min_state_name + " has the fewest COVID deaths: " + str(min_state_deaths))
    print(max_state_name + " has the most COVID deaths: " + str(max_state_deaths))            

calculate_state_deaths(covid_data)

Vermont has the fewest COVID deaths: 646
California has the most COVID deaths: 89931

We calculated

The state that has the lowest rate of deaths based on its entire population
The state that has the highest rate of deaths based on its entire population

In [14]:

def calculate_state_deathrate(covid_data):
    
    min_state_death_rate = -1
    min_state_name = ""
    max_state_death_rate = 9999999
    max_state_name = ""
    
    for state in covid_data.keys():
        cur_state_deaths = 0
        cur_state_population = 0
        for county in covid_data[state]:
            pop = county['population']
            if (county['cases'] > 0) and (pop is not None):
                cur_state_population += pop
                cur_state_deaths += county['deaths']
                
        cur_state_deathrate = float(cur_state_population) / cur_state_deaths
        
        if cur_state_deathrate > min_state_death_rate:
            min_state_death_rate = cur_state_deathrate
            min_state_name = state
        if cur_state_deathrate < max_state_death_rate:
            max_state_death_rate = cur_state_deathrate
            max_state_name = state
            
    print(min_state_name + " has the lowest COVID death rate; 1 out of every " + str(round(min_state_death_rate)) + " people has died")
    print(max_state_name + " has the highest COVID death rate; 1 out of every " + str(round(max_state_death_rate)) + " people has died")

calculate_state_deathrate(covid_data)

Hawaii has the lowest COVID death rate; 1 out of every 996 people has died
Mississippi has the highest COVID death rate; 1 out of every 239 people has died

Complicated analysis requires a better data structure like pandas dataframe. We now convert the previous dictionary of lists of dictionaries to a pandas dataframe. Each row corresponds to a unique county. Five columns are county, state, # total covid cases (integer), # covid case per 100k (float), and # covid deaths (integer).

In [15]:

def convert_to_pandas(covid_data):
    
    covid_data_flipped = []
    for state, counties in covid_data.items():
        for county in counties: 
            pop = county['population']
            if (pop is None) or (pop == 0):
                continue
            cases = county['cases']
            cur_dict = {"county":county['county_name'], "state":state,
                            "# total covid cases": cases,
                            "# covid cases per 100k": cases/(pop/100000),
                           "# covid deaths": county['deaths']}
            covid_data_flipped.append(cur_dict)
    covid_df = pd.json_normalize(covid_data_flipped)
    return covid_df

covid_df = convert_to_pandas(covid_data)
print(covid_df.shape)
covid_df.head()

(3118, 5)

Out[15]:

	county	state	# total covid cases	# covid cases per 100k	# covid deaths
0	Autauga County	Alabama	15863	28393.205534	216
1	Baldwin County	Alabama	55862	25023.965883	681
2	Barbour County	Alabama	5681	23013.043831	98
3	Bibb County	Alabama	6457	28833.616147	105
4	Blount County	Alabama	15005	25948.535261	243

In [16]:

covid_df.to_csv('./combined_data/covid_df.csv', index=False)

We can use this dataframe to compute same quantities as done above more easily

the single county (and the state to which it belongs) that has the lowest rate of COVID cases per 100k people
the single county (and the state to which it belongs) that has the highest rate of COVID cases per 100k people
the state that has the lowest number of deaths
the state that has the highest number of deaths
The state that has the lowest rate of deaths based on its entire population
The state that has the highest rate of deaths based on its entire population

In [14]:

def calculate_county_stats2(covid_df):

    sorted_df = covid_df.sort_values(by=['# covid cases per 100k'])
    lowest = sorted_df.iloc[0]
    highest = sorted_df.iloc[-1]

    print(f"{lowest['county']} ({lowest['state']}) has the lowest rate of confirmed COVID cases per 100k: {lowest['# covid cases per 100k']:,.2f}")
    print(f"{highest['county']} ({highest['state']}) has the highest rate of confirmed COVID cases per 100k: {highest['# covid cases per 100k']:,.2f}")
    
calculate_county_stats2(covid_df)

Kalawao County (Hawaii) has the lowest rate of confirmed COVID cases per 100k: 0.00
Loving County (Texas) has the highest rate of confirmed COVID cases per 100k: 113,017.75

In [15]:

def calculate_state_deaths2(covid_df):
    
    state_deaths = covid_df.groupby('state').sum().sort_values(by=['# covid deaths'])
    lowest = state_deaths.iloc[0]
    highest = state_deaths.iloc[-1]

    print(lowest.name + " has the fewest COVID deaths: " + str(lowest['# covid deaths']))
    print(highest.name + " has the most COVID deaths: " + str(highest['# covid deaths']))

calculate_state_deaths2(covid_df)

Vermont has the fewest COVID deaths: 640.0
California has the most COVID deaths: 89667.0

In [17]:

def calculate_state_deathrate2(covid_df):
    
    covid_df2 = covid_df
    covid_df2['population'] = 100000*covid_df2['# total covid cases'] / covid_df2['# covid cases per 100k']
    covid_df2 = covid_df2.groupby('state').sum()
    covid_df2['death_rate'] = covid_df2['population'] / covid_df2['# covid deaths']
    covid_df2 = covid_df2.sort_values(by=['death_rate'])

    print(covid_df2.iloc[-1].name + " has the lowest COVID death rate; 1 out of every " + str(int(covid_df2.iloc[-1]['death_rate'])) + " people has died")
    print(covid_df2.iloc[0].name + " has the highest COVID death rate; 1 out of every " + str(int(covid_df2.iloc[0]['death_rate'])) + " people has died")

calculate_state_deathrate2(covid_df)

Hawaii has the lowest COVID death rate; 1 out of every 995 people has died
Mississippi has the highest COVID death rate; 1 out of every 238 people has died

Furthermore, considering that the data is messy and some are not reliable, we attempted to understand some of the uncertainty around COVID data. We consider that false negatives of deaths of COVID-19 is minimal. Every disease has a mortality rate and we can consider it's constant throughout all people in the US. Although some are at highe risk (e.g. older folks, people with pre-existing conditions, etc), we can imagine that this variance in the population to be fairly uniform throughout the USA. Therefore, if all counties were equal in their testing, we are supposed to see a consistent ratio between # people who died from COVID and # of people who tested positive for COVID, which is called 'case fatality rate'.

In [18]:

def add_death_stats(covid_df):
    
    # can add an infintesimal or fillna after the fact to handle nans from divide by 0.
    
    covid_df['population'] = 100000*covid_df['# total covid cases'] / (covid_df['# covid cases per 100k']+0.0001)
#     covid_df.fillna(0, inplace=True)
    covid_df["population"] = covid_df["population"].astype('int32')
    
    covid_df['# covid deaths per 100k'] = 100000*covid_df['# covid deaths'] / (covid_df['population']+0.0001)
#     covid_df.fillna(0, inplace=True)
    covid_df["# covid deaths per 100k"] = covid_df["# covid deaths per 100k"].astype('int32')
    
    covid_df['case_fatality_rate'] =  covid_df['# covid deaths'] / (covid_df['# total covid cases']+0.0001)
#     covid_df.fillna(0, inplace=True)
    covid_df = covid_df.sort_values(by=['case_fatality_rate'])

    return covid_df

covid_updated = add_death_stats(covid_df)
covid_updated

Out[18]:

	county	state	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate
1702	Loup County	Nebraska	87	13102.409639	0	663	0	0.000000
1696	Keya Paha County	Nebraska	118	14640.198511	0	805	0	0.000000
544	Kalawao County	Hawaii	0	0.000000	0	0	0	0.000000
87	Skagway Municipality	Alaska	30	2535.925613	0	1182	0	0.000000
269	Jackson County	Colorado	166	11925.287356	0	1391	0	0.000000
...	...	...	...	...	...	...	...	...
2715	Sabine County	Texas	1254	11895.276039	89	10541	844	0.070973
427	Dodge County	Georgia	2110	10240.232953	154	20604	747	0.072986
1752	Storey County	Nevada	133	3225.806452	11	4122	266	0.082707
538	Wilcox County	Georgia	828	9588.882455	71	8634	822	0.085749
444	Glascock County	Georgia	269	9054.190508	25	2970	841	0.092937

3118 rows × 8 columns

In [19]:

covid_updated.to_csv('./combined_data/covid_updated.csv', index = False)

From all those analyses above, we learned that states vary wildly in their death rate (e.g., The number of deaths in New Jersey or California is orders of magnitude higher than those in Hawaii or Alaska) and COVID testing. States also fluctuate a lot amongst their counties, as some counties with very bad statistics are within states with good statistics.

When it comes to data reliability, some states and counties are probably more proactive when it comes to testing, so they could have higher cases counts. Other counties might have a similar number of cases or higher, but they are just not being represented in the data due to lower testing. Deaths are thus harder to overlook, so states with lax testing policies may have inflated deaths per case metrics. Perhaps we could supplement the data with some measure of testing rates in the county or state.

4. Incorporate More Data¶

We are also interested in how COVID has impacted our world. We can better understand this by looking at how it relates to demographics, income, education, health, and politicala voting.

Our case_fatality_rate column can be viewed as an approximation of how effective and thorough COVID testing is for a given county. Our # covid deaths column can be viewed as an extreme indication of how severe COVID has impacted a given county. Our # covid cases per 100k column be viewed as middle-ground between the two aforementioned features. That is, it measures the impact of the disease and is influenced by the thoroughness of COVID testing.

Using these three informative features, we can inspect how impacted each county is, while correlating this with other features of each county, such as income-level, health metrics, demographics, etc.

In this project, we merged our COVID case data with 'election2020_by_county.csv' dataset. We only care about 15 columns which are hispanic, minority, female, unemployed, income, nodegree, bachelor, inactivity, obesity, desity, cancer, voter_turnout, voter_gap, trump, biden. We droppde fipscode and population columns.

A data description is as follows:

state: the state in which the county lies
fipscode: an ID to identify each county
county: the name of each county
population: total population
hispanic: percent of adults that are hispanic
minority: percent of adults that are nonwhite
female: percent of adults that are female
unemployed: unemployment rate, as a percent
income: median income
nodegree: percent of adults who have not completed high school
bachelor: percent of adults with a bachelor’s degree
inactive: percent of adults who do not exercise in their leisure time
obesity: percent of adults with BMI > 30
density: population density, persons per square mile of land
cancer: prevalence of cancer per 100,000 individuals
voter_turnout: percentage of voting age population that voted
voter_gap: percentage point gap in 2020 presidential voting: trump-briden

In [21]:

def merge_data(covid_updated, filepath):
    
    data2020 = pd.read_csv(filepath).drop(columns=['fipscode', 'population'])
    return pd.merge(covid_updated, data2020, on=['state', 'county'])

merged = merge_data(covid_updated, './drive/MyDrive/election2020_by_county.csv')
print(merged.shape)
merged.head()

(3044, 23)

Out[21]:

	county	state	# total covid cases	# covid cases per 100k	population	hispanic	minority	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
0	Loup County	Nebraska	87	13102.409639	663	0.0	0.4	...	6.2	14.4	33.0	30.7	1.3	NaN	-4.849885	65.0	81.5	16.5
1	Keya Paha County	Nebraska	118	14640.198511	805	1.1	2.2	...	8.0	15.8	30.2	29.6	7.9	256.3	8.000000	80.7	90.0	9.3
2	Jackson County	Colorado	166	11925.287356	1391	24.4	25.1	...	15.5	17.5	20.5	21.4	0.9	207.1	12.874251	58.1	77.9	19.8
3	Daggett County	Utah	37	3894.736842	949	3.3	5.1	...	12.4	19.3	20.3	26.1	14.5	185.3	-9.461967	62.4	80.2	17.8
4	Hinsdale County	Colorado	131	15975.609756	819	6.0	9.1	...	5.1	41.3	14.8	20.5	0.8	NaN	5.952381	15.6	55.9	40.3

5 rows × 23 columns

In [22]:

merged.to_csv('./combined_data/merged.csv', index = False)

Due to mismatching happened during merging, we have lost some rows.

In [19]:

print(len(covid_updated) - len(merged))

In [20]:

missing_counties = set()
merged_counties = set()
for index, row in merged.iterrows():
    merged_counties.add(row['county'].lower() + "_" + row['state'].lower())

missing_idxs = []
for index, row in covid_updated.iterrows():
    cur_county = row['county'].lower() + "_" + row['state'].lower()
    if cur_county not in merged_counties:
        # print("missing",cur_county)
        missing_idxs.append(index)
        missing_counties.add(cur_county)

covid_updated.loc[missing_idxs]

Out[20]:

	county	state	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate
87	Skagway Municipality	Alaska	30	2535.925613	0	1182	0	0.000000
90	Wrangell City and Borough	Alaska	71	2837.729816	0	2501	0	0.000000
75	Hoonah-Angoon Census Area	Alaska	0	0.000000	0	0	0	0.000000
544	Kalawao County	Hawaii	0	0.000000	0	0	0	0.000000
68	Aleutians West Census Area	Alaska	863	15317.713880	2	5633	35	0.002317
...	...	...	...	...	...	...	...	...
1127	East Feliciana Parish	Louisiana	7320	38254.507447	170	19134	888	0.023224
1122	Claiborne Parish	Louisiana	3280	20931.716656	77	15669	491	0.023476
1129	Franklin Parish	Louisiana	7347	36707.469398	178	20014	889	0.024228
1148	Red River Parish	Louisiana	2110	24994.077233	53	8441	627	0.025118
1115	Bienville Parish	Louisiana	3990	30133.675704	110	13240	830	0.027569

74 rows × 8 columns

In [21]:

# Remove rows with 0 deaths
merged = merged.loc[merged['# covid deaths'] != 0]

# Summary statistics
merged.describe()

Out[21]:

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	3.021000e+03	3021.000000	3021.000000	3.021000e+03	3021.000000	3021.000000	3021.000000	3021.000000	3021.000000	3021.000000	...	3021.000000	3021.000000	3021.000000	3021.000000	3021.000000	2979.000000	2984.000000	2983.000000	2983.000000	2983.000000
mean	2.546425e+04	24314.359415	302.067196	1.055975e+05	366.179742	0.015760	9.259484	22.694009	49.932903	5.495035	...	14.970275	20.066137	25.910824	30.947104	250.792122	228.507486	35.459084	32.702749	65.487026	32.784278
std	8.662408e+04	6015.982401	1006.517139	3.391059e+05	162.864254	0.008579	13.881480	19.920798	2.375417	1.969195	...	6.740886	8.846873	5.161410	4.467600	1741.445511	56.125890	13.868524	31.279827	15.699166	15.590596
min	3.200000e+01	3274.314819	1.000000	1.680000e+02	11.000000	0.000527	0.000000	0.000000	19.166215	1.800000	...	1.900000	2.600000	8.100000	11.800000	0.100000	46.200000	-168.323353	-90.000000	4.000000	3.100000
25%	2.560000e+03	20683.907225	41.000000	1.106500e+04	249.000000	0.010637	2.000000	6.900000	49.462074	4.100000	...	9.900000	14.000000	22.600000	28.300000	17.000000	193.300000	27.675757	15.050000	56.700000	20.800000
50%	6.401000e+03	24365.817878	96.000000	2.584300e+04	358.000000	0.014321	4.000000	15.200000	50.392410	5.300000	...	13.500000	17.900000	25.800000	31.100000	45.300000	230.300000	35.055497	39.100000	68.700000	29.600000
75%	1.705000e+04	27802.251591	229.000000	6.784200e+04	468.000000	0.018885	9.500000	33.900000	51.080252	6.500000	...	19.200000	23.600000	29.400000	33.700000	112.900000	265.100000	42.473125	56.800000	77.500000	41.650000
max	2.751220e+06	113017.751479	31736.000000	1.003911e+07	1229.000000	0.092937	99.200000	99.400000	58.100420	24.000000	...	53.300000	75.100000	41.400000	47.600000	69468.400000	458.300000	100.000000	93.100000	96.200000	94.000000

8 rows × 21 columns

5. Exploratory Data Analysis (EDA) II¶

We can partition any quantitative feature by using quantiles. With arbitrarily chosen minv and maxv, we can partition certain feature of interest multiple times and observe interesting relationships.

In [22]:

# Given minv and maxv, this function returns a subset of the dataframe that has feature values between minv and maxv inclusive.
def partition_df(df, column_name, minv, maxv):
    return df.loc[(merged[column_name] >= minv) & (merged[column_name] <= maxv)]

In [23]:

partition_df(merged, 'income', 21000, 31000).describe()

Out[23]:

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	...	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000	117.000000
mean	4634.606838	26336.198665	85.504274	16558.410256	520.239316	0.021598	8.474359	43.990598	49.624858	9.065812	...	26.895726	11.892308	31.676923	36.230769	84.973504	244.556410	44.983493	13.635043	56.181197	42.546154
std	5658.986674	7096.649003	87.830030	17899.159118	152.402917	0.010778	21.306839	32.372695	3.836346	2.228389	...	5.736489	3.710822	4.358590	5.410948	176.212760	47.754773	10.879040	45.685134	22.877081	22.815828
min	184.000000	7931.904161	3.000000	1536.000000	174.000000	0.004765	0.000000	0.400000	35.462777	4.400000	...	8.000000	5.800000	19.700000	21.000000	1.400000	99.200000	-10.411765	-71.600000	13.500000	9.300000
25%	2003.000000	22647.596516	43.000000	8110.000000	434.000000	0.015873	0.700000	5.800000	48.747893	7.700000	...	23.400000	9.200000	29.000000	32.600000	17.500000	215.200000	39.175362	-27.900000	35.100000	19.900000
50%	2889.000000	27116.874368	56.000000	11839.000000	515.000000	0.018861	1.600000	49.700000	50.691193	8.700000	...	26.900000	11.300000	32.000000	36.400000	36.800000	243.900000	44.536665	14.200000	54.900000	40.800000
75%	5247.000000	31600.407747	96.000000	18067.000000	599.000000	0.022789	3.400000	73.100000	51.953125	10.100000	...	29.800000	13.800000	35.000000	40.300000	74.100000	272.700000	51.138995	58.500000	78.900000	63.000000
max	40588.000000	39821.693908	589.000000	130624.000000	1064.000000	0.067568	99.200000	99.400000	56.526573	17.600000	...	53.300000	27.900000	41.300000	47.600000	1261.500000	380.000000	70.003709	80.400000	89.700000	85.100000

8 rows × 21 columns

In [24]:

def view_partitions(df, feature, n_partitions=3, cols=None):
    if cols is None:
        cols = df.columns
    start = 0
    for  i in range(n_partitions):
        stop = start + (1/n_partitions)
        display(partition_df(merged, feature,
                             merged[feature].quantile(start),
                             merged[feature].quantile(stop))[cols].describe())
        start = stop
view_partitions(merged, 'obesity')

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	1.033000e+03	1033.000000	1033.000000	1.033000e+03	1033.000000	1033.000000	1033.000000	1033.000000	1033.000000	1033.000000	...	1033.000000	1033.000000	1033.000000	1033.000000	1033.000000	1013.000000	1014.000000	1014.000000	1014.000000	1014.000000
mean	4.635578e+04	22771.754766	493.433688	1.961860e+05	313.152953	0.014420	14.784705	24.974831	49.784908	5.065247	...	13.033495	25.626718	22.030591	26.236883	365.470474	213.825469	32.293654	24.426331	61.280375	36.854043
std	1.397306e+05	6629.900988	1602.636109	5.424150e+05	172.554693	0.009160	17.944395	19.940681	2.537714	1.874395	...	6.728381	10.933433	4.181953	2.969353	1641.553548	61.119227	17.665726	36.403793	18.306117	18.108292
min	3.500000e+01	3274.314819	1.000000	1.680000e+02	11.000000	0.001057	0.000000	0.000000	31.955024	1.800000	...	1.900000	2.600000	8.100000	11.800000	0.100000	46.200000	-21.417069	-90.000000	4.000000	3.100000
25%	2.310000e+03	18617.709313	33.000000	1.072400e+04	180.000000	0.008413	3.100000	9.800000	49.321586	3.800000	...	8.300000	17.300000	19.200000	25.000000	11.000000	169.700000	21.126119	-2.200000	47.850000	21.900000
50%	7.027000e+03	22735.283123	93.000000	3.256400e+04	285.000000	0.012746	7.400000	18.600000	50.331365	4.800000	...	11.300000	23.300000	22.400000	27.100000	45.800000	212.600000	29.870988	27.500000	62.800000	35.200000
75%	3.332500e+04	26132.178794	335.000000	1.529390e+05	414.000000	0.017883	19.200000	35.200000	51.018405	5.900000	...	16.300000	32.200000	25.000000	28.400000	162.300000	251.600000	40.219742	54.300000	76.275000	49.975000
max	2.751220e+06	113017.751479	31736.000000	1.003911e+07	1229.000000	0.092937	95.300000	97.400000	58.100420	24.000000	...	46.800000	75.100000	38.800000	29.400000	32903.300000	445.400000	99.552895	93.100000	96.200000	94.000000

8 rows × 21 columns

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	1015.000000	1015.000000	1015.000000	1.015000e+03	1015.000000	1015.000000	1015.000000	1015.000000	1015.000000	1015.000000	...	1015.000000	1015.000000	1015.000000	1015.000000	1015.000000	1002.000000	1004.000000	1003.000000	1003.000000	1003.000000
mean	17413.966502	24372.422566	226.448276	7.111427e+04	373.295567	0.016179	7.553990	18.795961	49.890443	5.326305	...	14.513399	18.695074	26.218128	31.132217	205.667586	234.567764	34.878074	38.696510	68.491426	29.794915
std	35239.630504	5506.207523	453.917323	1.464475e+05	157.158042	0.008917	11.862685	17.305463	2.238655	1.872417	...	6.318933	6.224281	3.983797	1.018496	2212.807074	54.672491	12.708142	24.848158	12.469813	12.388141
min	32.000000	6911.447084	1.000000	4.620000e+02	27.000000	0.000809	0.000000	1.500000	19.166215	1.800000	...	3.800000	4.400000	15.800000	29.400000	0.100000	82.500000	-168.323353	-65.600000	16.900000	5.600000
25%	2331.500000	21097.270656	37.000000	1.039850e+04	262.000000	0.010903	2.000000	6.000000	49.403040	4.100000	...	9.700000	14.300000	23.300000	30.200000	16.700000	199.825000	28.279638	23.300000	60.750000	20.300000
50%	6279.000000	24563.536449	98.000000	2.640300e+04	359.000000	0.014236	3.500000	12.200000	50.314825	5.100000	...	12.900000	17.600000	26.100000	31.100000	42.900000	236.200000	34.537246	42.100000	70.100000	28.000000
75%	15557.000000	27653.785878	219.500000	6.302800e+04	468.000000	0.019337	7.300000	26.850000	50.970982	6.300000	...	18.200000	21.700000	28.800000	32.000000	98.600000	269.675000	40.807564	57.600000	78.050000	37.500000
max	402352.000000	61934.129379	7728.000000	1.584063e+06	1212.000000	0.085749	99.200000	99.400000	56.633907	21.800000	...	53.300000	44.600000	41.400000	32.800000	69468.400000	425.400000	100.000000	86.000000	92.000000	82.500000

8 rows × 21 columns

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	1036.000000	1036.000000	1036.000000	1.036000e+03	1036.000000	1036.000000	1036.000000	1036.000000	1036.000000	1036.000000	...	1036.000000	1036.000000	1036.000000	1036.000000	1036.000000	1026.000000	1029.000000	1029.000000	1029.000000	1029.000000
mean	11955.729730	25787.480672	178.888031	4.675872e+04	411.768340	0.016674	5.380309	23.938707	50.117482	6.053378	...	17.302220	15.871815	29.476834	35.479826	174.186969	237.092982	39.012722	35.537026	66.960641	31.423615
std	23933.119885	5361.167248	374.326997	9.641718e+04	141.464044	0.007432	8.434501	21.605540	2.295568	2.021584	...	6.487572	4.850790	4.223773	2.422581	1164.339521	48.778309	8.717395	29.339460	14.662722	14.686429
min	96.000000	9330.346798	1.000000	7.210000e+02	14.000000	0.000527	0.000000	0.400000	34.868478	1.800000	...	3.100000	5.800000	15.600000	32.800000	0.100000	61.800000	12.181303	-80.800000	8.800000	5.400000
25%	3003.250000	22698.614312	49.000000	1.222000e+04	321.750000	0.012394	1.600000	6.000000	49.643073	4.800000	...	12.275000	12.400000	26.500000	33.700000	22.950000	206.650000	33.091367	21.900000	60.000000	20.900000
50%	6047.000000	26021.133583	98.000000	2.313750e+04	411.000000	0.015669	2.800000	15.300000	50.510986	5.900000	...	16.400000	15.000000	29.500000	34.800000	46.350000	237.000000	39.270875	42.800000	70.300000	27.800000
75%	11853.500000	28860.094198	180.000000	4.566200e+04	494.000000	0.019305	5.400000	37.925000	51.268131	7.200000	...	22.200000	18.600000	32.400000	36.700000	102.325000	267.000000	44.732536	56.900000	77.500000	38.100000
max	410889.000000	72727.272727	7953.000000	1.749342e+06	996.000000	0.066603	91.800000	93.400000	56.526573	16.900000	...	40.500000	42.600000	41.300000	47.600000	35369.200000	458.300000	70.003709	87.900000	93.300000	89.600000

8 rows × 21 columns

In [25]:

view_partitions(merged, 'inactivity')

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	1.007000e+03	1007.000000	1007.000000	1.007000e+03	1007.000000	1007.000000	1007.000000	1007.000000	1007.000000	1007.000000	...	1007.000000	1007.000000	1007.000000	1007.000000	1007.000000	992.000000	982.000000	982.000000	982.000000	982.000000
mean	4.741475e+04	23140.636087	483.670308	2.030346e+05	271.937438	0.012093	12.365045	22.890566	49.707254	4.982423	...	11.348163	26.617577	20.354816	27.668818	461.912115	213.861895	31.293156	17.149898	57.540224	40.390326
std	1.367664e+05	6443.122064	1512.796366	5.351729e+05	139.972161	0.006343	15.732390	18.824834	2.499203	1.795947	...	5.740513	10.472965	2.836158	4.213967	2770.745246	61.473242	17.404163	32.866155	16.459077	16.417551
min	6.300000e+01	3274.314819	1.000000	1.680000e+02	11.000000	0.001057	0.000000	0.000000	19.166215	1.900000	...	1.900000	2.600000	8.100000	11.800000	0.100000	46.200000	-21.417069	-90.000000	4.000000	3.100000
25%	3.455000e+03	19408.061608	38.000000	1.509950e+04	173.000000	0.007854	3.000000	8.600000	49.336073	3.800000	...	7.700000	18.700000	18.700000	25.300000	12.250000	166.950000	21.203168	-5.675000	46.025000	28.400000
50%	1.100700e+04	22993.187599	120.000000	4.758000e+04	255.000000	0.011327	5.900000	16.200000	50.208257	4.700000	...	10.100000	24.800000	21.200000	28.100000	46.900000	211.450000	28.190647	19.750000	59.000000	39.200000
75%	3.888200e+04	26299.994779	379.000000	1.662710e+05	342.000000	0.014652	14.350000	31.900000	50.863073	5.800000	...	13.100000	32.450000	22.600000	30.600000	177.150000	254.500000	37.751744	40.775000	69.275000	51.800000
max	2.751220e+06	113017.751479	31736.000000	1.003911e+07	994.000000	0.081481	94.100000	94.800000	58.100420	24.000000	...	43.900000	75.100000	23.700000	39.700000	69468.400000	445.400000	99.552895	93.100000	96.200000	94.000000

8 rows × 21 columns

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	1028.000000	1028.000000	1028.000000	1.028000e+03	1028.000000	1028.000000	1028.000000	1028.000000	1028.000000	1028.000000	...	1028.000000	1028.000000	1028.000000	1028.000000	1028.000000	1009.000000	1017.000000	1016.000000	1016.000000	1016.000000
mean	19504.956226	23995.600128	270.469844	7.753021e+04	392.975681	0.017394	10.354864	22.804086	49.936301	5.255837	...	14.964105	18.713132	25.842802	31.003794	183.769747	229.305055	34.853118	37.605906	67.993012	30.387106
std	51624.583537	5760.949741	767.045525	1.961051e+05	165.631212	0.009503	15.641803	20.733552	2.292643	1.910234	...	6.407062	5.868454	1.219701	3.033814	1134.003874	54.053794	12.759806	27.366819	13.721102	13.655311
min	35.000000	6103.922159	1.000000	4.860000e+02	14.000000	0.000809	0.000000	0.200000	32.813627	1.800000	...	3.100000	4.400000	23.800000	22.400000	0.100000	70.000000	-168.323353	-76.800000	11.200000	3.900000
25%	2028.000000	20493.710413	34.000000	8.923250e+03	283.000000	0.011700	2.100000	6.400000	49.342871	4.100000	...	10.500000	14.500000	24.900000	28.700000	13.350000	194.300000	28.566943	22.500000	60.175000	20.300000
50%	5230.000000	24010.922870	87.000000	2.177150e+04	378.000000	0.015793	4.200000	14.500000	50.324591	5.000000	...	13.400000	17.900000	25.800000	31.000000	40.100000	230.700000	34.537328	41.200000	69.800000	28.450000
75%	14111.500000	27195.090081	202.250000	5.723600e+04	480.000000	0.020755	10.600000	35.025000	51.076514	6.200000	...	18.100000	21.600000	26.800000	33.100000	107.425000	263.200000	40.992547	58.100000	78.200000	38.025000
max	714053.000000	61934.129379	12823.000000	2.559902e+06	1229.000000	0.092937	99.200000	99.400000	56.633907	21.800000	...	53.300000	46.300000	28.000000	42.000000	32903.300000	433.900000	100.000000	92.000000	95.900000	88.000000

8 rows × 21 columns

	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	hispanic	minority	female	unemployed	...	nodegree	bachelor	inactivity	obesity	density	cancer	voter_turnout	voter_gap	trump	biden
count	1018.000000	1018.000000	1018.000000	1.018000e+03	1018.000000	1018.000000	1018.000000	1018.000000	1018.000000	1018.000000	...	1018.000000	1018.000000	1018.000000	1018.000000	1018.000000	1010.000000	1017.000000	1017.000000	1017.000000	1017.000000
mean	10061.785855	25805.500407	163.643418	3.817520e+04	433.435167	0.017819	4.975049	22.328193	50.173025	6.245678	...	18.565914	14.874361	31.541159	34.180157	105.106189	242.664257	40.216906	43.029695	70.764503	27.734808
std	27440.411056	5519.297680	481.803986	9.681484e+04	134.701725	0.008467	7.262961	20.043183	2.303834	1.958725	...	5.977301	4.333156	2.735776	3.406723	212.876177	48.768550	8.482994	27.054607	13.510946	13.552863
min	32.000000	6911.447084	1.000000	4.620000e+02	14.000000	0.000527	0.000000	0.400000	35.150327	1.800000	...	3.900000	5.800000	28.000000	23.800000	0.300000	94.400000	1.730104	-71.600000	13.500000	5.000000
25%	2675.000000	22639.004381	48.000000	1.110300e+04	353.000000	0.013173	1.500000	5.800000	49.796348	5.000000	...	14.200000	11.800000	29.300000	31.900000	24.525000	212.425000	34.466113	33.900000	65.900000	18.800000
50%	5405.000000	26231.941999	92.000000	2.070000e+04	431.000000	0.016078	2.600000	15.000000	50.616007	6.100000	...	18.300000	14.300000	31.000000	33.800000	46.450000	242.650000	40.067710	50.200000	74.200000	24.100000
75%	10497.250000	29331.867235	169.000000	4.047700e+04	521.000000	0.020233	5.100000	34.900000	51.302674	7.400000	...	22.600000	17.300000	33.375000	36.300000	93.525000	272.075000	45.901233	60.900000	79.800000	32.400000
max	661165.000000	53241.683639	11854.000000	2.253857e+06	996.000000	0.072477	63.500000	91.200000	56.633907	16.900000	...	40.500000	36.900000	41.400000	47.600000	2800.000000	458.300000	70.003709	88.300000	93.300000	85.100000

8 rows × 21 columns

In [26]:

view_partitions(merged, 'income',
                cols=['income','# total covid cases', '# covid cases per 100k', '# covid deaths',
                      'population', '# covid deaths per 100k', 'case_fatality_rate',
                      'obesity', 'inactivity', 'trump', 'biden'])

	income	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	obesity	inactivity	trump	biden
count	1007.000000	1007.000000	1007.000000	1007.000000	1.007000e+03	1007.000000	1007.000000	1007.000000	1007.000000	1002.000000	1002.000000
mean	35886.695134	8288.693148	25014.583343	142.903674	3.293902e+04	464.244290	0.020079	32.942304	29.309037	67.414072	31.157685
std	3799.138075	20595.465992	6509.586741	360.681418	8.483606e+04	157.840313	0.010388	4.450842	4.676727	15.976406	15.928395
min	21658.000000	35.000000	7186.858316	1.000000	4.030000e+02	48.000000	0.002333	17.400000	13.600000	11.200000	7.100000
25%	33575.000000	2237.500000	20856.173791	45.000000	9.733000e+03	367.500000	0.014051	30.100000	25.900000	58.600000	18.925000
50%	36566.000000	4471.000000	25428.263215	81.000000	1.778100e+04	458.000000	0.017502	32.900000	29.600000	72.200000	26.300000
75%	38891.000000	8407.500000	29159.082843	143.500000	3.193600e+04	551.500000	0.022740	35.900000	32.800000	79.600000	39.975000
max	41150.000000	402352.000000	56864.875543	7728.000000	1.584063e+06	1229.000000	0.092937	47.600000	41.400000	92.600000	88.000000

	income	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	obesity	inactivity	trump	biden
count	1007.000000	1.007000e+03	1007.000000	1007.000000	1.007000e+03	1007.000000	1007.000000	1007.000000	1007.000000	998.000000	998.000000
mean	45378.542205	2.117232e+04	24204.352233	261.272095	8.303669e+04	356.008937	0.015158	30.918272	25.678153	67.789178	30.421844
std	2620.626459	6.098195e+04	5334.500991	680.338463	2.086682e+05	139.711464	0.006599	3.699632	4.176813	13.072600	12.934935
min	41161.000000	3.200000e+01	6911.447084	1.000000	4.620000e+02	14.000000	0.000809	14.800000	11.200000	14.100000	5.000000
25%	43107.500000	2.489500e+03	20834.832707	36.000000	1.075750e+04	259.500000	0.010992	28.700000	23.000000	60.525000	21.000000
50%	45207.000000	6.889000e+03	24120.603015	100.000000	2.887900e+04	359.000000	0.014220	31.100000	25.700000	70.300000	27.900000
75%	47539.500000	1.574600e+04	27296.568873	229.000000	6.443850e+04	447.000000	0.018293	33.400000	28.300000	77.300000	37.600000
max	50134.000000	1.209302e+06	61745.301879	12823.000000	2.716939e+06	994.000000	0.048193	41.800000	37.800000	94.000000	85.000000

	income	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	obesity	inactivity	trump	biden
count	1007.000000	1.007000e+03	1007.000000	1007.000000	1.007000e+03	1007.000000	1007.000000	1007.000000	1007.000000	983.000000	983.000000
mean	60131.244290	4.693174e+04	23724.142668	502.025819	2.008166e+05	278.285998	0.012043	28.980735	22.745283	61.185453	36.840793
std	10901.921415	1.326955e+05	6079.872377	1543.128483	5.287700e+05	133.159375	0.006049	4.308677	4.360248	16.942304	16.875312
min	50154.000000	6.300000e+01	3274.314819	1.000000	1.680000e+02	11.000000	0.000527	11.800000	8.100000	4.000000	3.100000
25%	52596.000000	3.365500e+03	20339.539455	40.000000	1.461700e+04	182.500000	0.008142	26.800000	20.100000	50.700000	24.650000
50%	56270.000000	1.146300e+04	23930.162156	127.000000	4.890400e+04	270.000000	0.011547	29.400000	22.900000	63.100000	35.000000
75%	62830.500000	4.015200e+04	26817.374363	414.000000	1.692480e+05	351.000000	0.014857	32.100000	25.700000	73.400000	46.950000
max	125635.000000	2.751220e+06	113017.751479	31736.000000	1.003911e+07	1212.000000	0.081481	39.700000	37.800000	96.200000	94.000000

In [27]:

view_partitions(merged, 'trump',
                cols=['trump','income','# total covid cases', '# covid cases per 100k', '# covid deaths',
                      'population', '# covid deaths per 100k', 'case_fatality_rate',
                      'obesity', 'inactivity',])

	trump	income	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	obesity	inactivity
count	995.000000	995.000000	9.950000e+02	995.000000	995.000000	9.950000e+02	995.000000	995.000000	995.000000	995.000000
mean	47.241709	50299.426131	5.845852e+04	23489.186345	646.194975	2.454354e+05	305.234171	0.013401	29.725327	23.284322
std	11.340094	15572.223128	1.435519e+05	5632.102950	1677.245394	5.579268e+05	157.485499	0.007672	5.660119	5.411057
min	4.000000	21658.000000	1.840000e+02	7433.739051	2.000000	7.680000e+02	11.000000	0.001057	11.800000	8.100000
25%	40.250000	39014.500000	5.238500e+03	20108.638558	69.500000	2.243150e+04	190.500000	0.008494	26.200000	19.700000
50%	50.100000	48413.000000	1.591700e+04	23626.943005	169.000000	6.945000e+04	281.000000	0.012284	29.700000	23.000000
75%	56.700000	57390.000000	5.389150e+04	26781.198391	581.500000	2.343480e+05	389.000000	0.016087	33.200000	26.500000
max	61.000000	125635.000000	2.751220e+06	60447.419089	31736.000000	1.003911e+07	1064.000000	0.066603	47.600000	40.800000

	trump	income	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	obesity	inactivity
count	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000
mean	68.313831	46840.147264	12720.003980	24710.531247	174.843781	50517.506468	377.168159	0.016190	31.632537	26.274229
std	3.905655	9379.280941	17651.768209	5359.878850	224.233632	66151.583773	144.713271	0.008867	3.556402	4.149821
min	61.000000	25768.000000	85.000000	3274.314819	1.000000	403.000000	14.000000	0.000527	19.500000	14.100000
25%	65.200000	40422.000000	3191.000000	21552.770707	49.000000	13260.000000	279.000000	0.011377	29.400000	23.300000
50%	68.600000	45719.000000	7100.000000	25026.614998	106.000000	28529.000000	371.000000	0.014282	31.700000	26.000000
75%	71.700000	51839.000000	14953.000000	28180.451997	216.000000	60353.000000	456.000000	0.018604	34.000000	29.100000
max	74.500000	97936.000000	202690.000000	55122.917010	2964.000000	636234.000000	1229.000000	0.085749	42.300000	39.900000

	trump	income	# total covid cases	# covid cases per 100k	# covid deaths	population	# covid deaths per 100k	case_fatality_rate	obesity	inactivity
count	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000	1005.000000
mean	80.786965	43878.374129	5059.204975	24627.390904	85.427861	20120.834826	422.151244	0.017918	31.537114	28.320597
std	4.289503	9102.326701	6019.201974	6383.615594	103.655065	23113.800617	162.744774	0.008465	3.592986	4.534053
min	74.500000	23047.000000	32.000000	4916.885879	1.000000	168.000000	27.000000	0.001300	19.400000	13.600000
25%	77.500000	37417.000000	1364.000000	20633.467594	22.000000	5925.000000	327.000000	0.012937	28.900000	25.000000
50%	80.100000	42310.000000	3257.000000	24538.953707	54.000000	13287.000000	422.000000	0.016549	31.500000	27.900000
75%	83.600000	49184.000000	6730.000000	28499.655884	113.000000	25924.000000	525.000000	0.021352	33.900000	31.900000
max	96.200000	86354.000000	57505.000000	113017.751479	1435.000000	223233.000000	1212.000000	0.092937	43.200000	41.400000

We split the data into equal lower, middle, and upper quantiles based on first obeity and then inactivity. We can see that the the average death rates of counties in these partitions is positivly correlated with both of these features. This was expected as preexisting health conditions (obescity) and heath risks (inactivity) increase all cause mortality but also have a strong effect on how serious a covid infection can be. Finally we see that income has an even stronger relationship with the death rate, though here the correlation is a negative one. Obesity and inactivity are both negatively correlated with income as well. The relationship between voting for Trump and income is not a string one strong, though there is a positive correlation between Trump voting and obesity, inactivity, and covid death rate.

6. Data Weakness¶

We can tell from comparing the populations between the groups that this data is not treated granularly as would be ideal. Very small population counties that get weighted the same as very large population counties in regards to the mean. So rural areas get over represented in the averages nationwide. This also explains why the Trump vs Biden is so far skewed from the actual well known national average based on popular votes.

Also, the difference between income correlates more with population density than it might with an individual socio economic status. First, a higher income might not go as far towards standard of living in the city as it does in rural areas. Second, by using the average income over the whole county, income inequality in that county is not factored in. There could be many low income individuals living with many high income individuals in the same county.

7. Future Work¶

We have done data gathing, parsing, and exploring data. We can continue to predict some behavior of the data (e.g. how a particular county will respond to COVID on a weekly basis).

Alternatively, we could be interested in inference, whereby we are more concerned with trying to understand why and how a system behaves the way it does. We might wish to understand which factors most correlate and cause a certain event to happen. This could give us insights into where certain inequalities persist.