from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import re
import requests
import pandas as pd
import numpy as np
from time import sleep
from bs4 import BeautifulSoup
import pickle
from typing import Optional
We are interested in the degree to which the SARS-CoV-2 virus has affected United States citizens (SARS-CoV-2 is the virus that causes the COVID-19 disease). The Centers for Disease Control and Prevention (CDC) provides relevant data from USAFacts.org that includes the number of confirmed COVID-19 cases on a per-county basis. At the bottom of the web page (https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/), a blue table provides a list of states, each of which has its own web page displaying the reported numbers of cases and deaths.
We automatically downloaded each state's data with Requests and then manipulated it with BeautifulSoup. Specifically, we first fetched the web page located at base_url
and save the request's returned object (a respond object) to home_page
. We then used the BeautifulSoup object to parse the home page as an HTML document in order to extract the link for every state. With these extracted URLs, we populated a state_urls
dictionary by setting each key to be the state name and the value to be the full URL. To avoid download state web pages multiple times frequently, we iterated through the state_urls
, make a web request for each URL, and save the contents out to a file on the hard drive.
# Every state's url begins with this prefix
base_url = 'https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/'
# Datasets will be saved to this directory
state_dir = "./drive/MyDrive/state_data/"
# Fetch web page and save the response object
home_page = requests.get(base_url)
print(home_page.status_code)
print(home_page.text[:10000])
200 <!doctype html><html lang="en"><head><title data-react-helmet="true">US COVID-19 cases and deaths by state | USAFacts</title><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge" /><meta data-react-helmet="true" charSet="utf-8" /><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1" /><meta data-react-helmet="true" name="msapplication-TileColor" content="#E3007D" /><meta data-react-helmet="true" name="theme-color" content="#E3007D" /><meta data-react-helmet="true" name="keywords" content="coronavirus, covid-19, covid-cases, Cases and Deaths" /><meta data-react-helmet="true" name="twitter:site" content="@usafacts" /><meta data-react-helmet="true" name="twitter:title" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" name="twitter:description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" name="twitter:creator" content="@usafacts" /><meta data-react-helmet="true" name="twitter:image" content="https://staticweb.usafacts.org/media/images/Covid_Report_h5H7hpb.original.jpg" /><meta data-react-helmet="true" property="fb:admins" content="Facebook numberic ID" /><meta data-react-helmet="true" property="og:site_name" content="USAFacts" /><meta data-react-helmet="true" property="og:image:width" content="1600" /><meta data-react-helmet="true" property="og:image:height" content="900" /><meta data-react-helmet="true" property="og:locale" content="en" /><meta data-react-helmet="true" property="twitter:site" content="@usafacts" /><meta data-react-helmet="true" name="description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" name="robots" content="max-snippet:-1" /><meta data-react-helmet="true" name="twitter:card" content="summary_large_image" /><meta data-react-helmet="true" property="og:title" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" property="og:description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" property="og:type" content="article" /><meta data-react-helmet="true" property="og:url" content="/visualizations/coronavirus-covid-19-spread-map/" /><meta data-react-helmet="true" property="og:image" content="https://staticweb.usafacts.org/media/images/Covid_Report_h5H7hpb.original.jpg" /><meta data-react-helmet="true" property="og:image:alt" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" property="twitter:title" content="US COVID-19 cases and deaths by state" /><meta data-react-helmet="true" property="twitter:description" content="How is the nation faring against COVID-19, both overall and at a state level? Get the answers here, with data on hotspots and infection rates. This map tracks the history of coronavirus cases in America, both recent and all-time." /><meta data-react-helmet="true" property="twitter:image" content="https://staticweb.usafacts.org/media/images/Covid_Report_h5H7hpb.original.jpg" /><meta data-react-helmet="true" property="article:section" content="coronavirus" /><meta data-react-helmet="true" property="article:tag" content="coronavirus" /><meta data-react-helmet="true" property="article:published_time" content="2020-01-22T09:00:00-07:00" /><meta data-react-helmet="true" property="article:modified_time" content="2022-05-13T09:00:00-07:00" /><meta data-react-helmet="true" name="twitter:label0" content="How many COVID-19 cases are there in Alabama state?" /><meta data-react-helmet="true" name="twitter:data0" content="1304710" /><meta data-react-helmet="true" name="twitter:label1" content="How many COVID-19 cases are there in Alaska state?" /><meta data-react-helmet="true" name="twitter:data1" content="239856" /><meta data-react-helmet="true" name="twitter:label2" content="How many COVID-19 cases are there in Arizona state?" /><meta data-react-helmet="true" name="twitter:data2" content="2030926" /><link rel="preload" href="/frontend-issues/static/js/react-7dff463b.40bdbd99.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/react-e5bca7e4.2d1be3f8.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-utils-static-c6a584f7.ae4b8d8b.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-utils-static-130abb37.ebbb4cbc.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-utils-static-fbb8199d.78aaeced.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-graphs.36f60ce4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-85c5278b.6f7fd7ea.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-dbafc870.2d7ab14d.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-375540fe.3a60ebc4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-99c91fca.102e336b.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-1b90ef66.ad49e4d3.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-92770bb2.f5b2d191.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-1be980da.9c4722f9.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-83875ef7.98237c04.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-58265e03.bb85cfae.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/usafacts-2878d4f2.f3d842bf.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6136.5c170882.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/2417.49a3285a.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/7889.e23ca963.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/9556.74319ef6.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/829.9bfff25a.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1904.5b3b450b.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1385.15511551.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/826.0bdc9dac.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4468.928b4475.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/7918.8172dc28.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5261.e72e3e01.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5124.08339503.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6498.bff30bf4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4571.66b942e1.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4013.11a0dc61.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/8544.8c5bd262.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5207.856b72f4.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/4657.2525f1b2.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1999.30b3cce7.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/1426.970435fe.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6235.582ec870.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6844.1c4e2776.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6125.ff7e6068.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/7364.bff8689f.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/3056.9b8755ae.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/8216.c0657a8f.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/5513.65bf5c29.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/6732.d4d15d2c.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/9774.2c5aaa26.js" as="script" type="application/javascript" /><link rel="preload" href="/frontend-issues/static/js/9348.5789882c.js"
# Create a BeautifulSoup Object
bs_page = BeautifulSoup(home_page.content, "html.parser")
# Populate a dictionary with state name as keys and URL as values
state_urls = {}
for link in bs_page.findAll('a', attrs={'href': re.compile("/visualizations/coronavirus-covid-19-spread-map/state/")}):
state_urls[link.text] = "https://usafacts.org" + link.get('href')
# A sanity check
if len(state_urls.keys()) != 51 or \
state_urls["District of Columbia"] != "https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/state/district-of-columbia":
print("Incorrect")
else:
print("Correct")
Correct
# Save each webpage to our local hard drive
for state, url in state_urls.items():
state_webpage = requests.get(url).content
f = open(state_dir + state, 'wb')
f.write(state_webpage)
f.close()
# Leave a 2 second pause between requests
sleep(2)
We created a data object covid_data
including 4 pieces of information 1) 7-day average case (integer) 2) 7-day average deaths (integer) 3) # of confirmed cases (total) (float if # of confirmed cases per 100k) 4) # of deaths (integer).
Since the population data per county has been removed from usafacts.org, we collected those population data from other sources. We ignored counties with missing population data or populations of 0.
with open('./drive/MyDrive/population.pkl', 'rb') as f:
population_dict = pickle.load(f)
def get_pop(state: str, county: str) -> Optional[int]:
try:
return population_dict.get(state).get(county)
except AttributeError:
print('incorrect state name!')
return None
def load_covid_data(state_info):
covid_data = {}
for (state, state_path) in state_info:
covid_data[state] = []
with open(state_path, 'r') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
counties = soup.find_all('a', href=re.compile('county/'))
for c in counties:
row = c.find_parent('tr')
cols = [col.text.replace(',','') for col in row.find_all('td')]
county_name = c.text
pop = get_pop(state, county_name)
if ((get_pop(state, county_name)) is None) or (pop == 0):
continue
covid_data[state].append({'county_name': county_name,
'population': pop,
'7_day_avg_cases': float(cols[0]),
'7_date_ave_deaths': float(cols[1]),
'cases': int(cols[2]),
'deaths': int(cols[3])})
return covid_data
state_info = [(state, state_dir + state) for state in state_urls.keys()]
covid_data = load_covid_data(state_info)
We first observed the single-most extreme counties and states, then inspected all states, after having sorted the data by some features.
We computed
def calculate_county_stats(covid_data):
min_county_count = 999999
min_county_name = ""
max_county_count = -1
max_county_name = ""
# looks through every county in every state, while checking if we have a new low or high
for state in covid_data.keys():
for county in covid_data[state]:
pop = county['population']
if (pop is None) or (pop == 0):
continue
covid_rate = round(county['cases'] / (pop/100000),2)
if covid_rate < min_county_count:
min_county_count = covid_rate
min_county_name = county['county_name'] + " (" + state + ")"
if covid_rate > max_county_count:
max_county_count = covid_rate
max_county_name = county['county_name'] + " (" + state + ")"
print(min_county_name + " has the lowest COVID cases per 100k: " + str(float(min_county_count)))
print(max_county_name + " has the highest COVID cases per 100k: " + str(float(max_county_count)))
calculate_county_stats(covid_data)
Hoonah-Angoon Census Area (Alaska) has the lowest COVID cases per 100k: 0.0 Loving County (Texas) has the highest COVID cases per 100k: 115976.33
We calculated
def calculate_state_deaths(covid_data):
min_state_deaths = 999999
min_state_name = ""
max_state_deaths = -1
max_state_name = ""
for state in covid_data.keys():
cur_state_count = 0
for county in covid_data[state]:
cur_state_count += county['deaths']
if cur_state_count < min_state_deaths:
min_state_deaths = cur_state_count
min_state_name = state
if cur_state_count > max_state_deaths:
max_state_deaths = cur_state_count
max_state_name = state
print(min_state_name + " has the fewest COVID deaths: " + str(min_state_deaths))
print(max_state_name + " has the most COVID deaths: " + str(max_state_deaths))
calculate_state_deaths(covid_data)
Vermont has the fewest COVID deaths: 646 California has the most COVID deaths: 89931
We calculated
def calculate_state_deathrate(covid_data):
min_state_death_rate = -1
min_state_name = ""
max_state_death_rate = 9999999
max_state_name = ""
for state in covid_data.keys():
cur_state_deaths = 0
cur_state_population = 0
for county in covid_data[state]:
pop = county['population']
if (county['cases'] > 0) and (pop is not None):
cur_state_population += pop
cur_state_deaths += county['deaths']
cur_state_deathrate = float(cur_state_population) / cur_state_deaths
if cur_state_deathrate > min_state_death_rate:
min_state_death_rate = cur_state_deathrate
min_state_name = state
if cur_state_deathrate < max_state_death_rate:
max_state_death_rate = cur_state_deathrate
max_state_name = state
print(min_state_name + " has the lowest COVID death rate; 1 out of every " + str(round(min_state_death_rate)) + " people has died")
print(max_state_name + " has the highest COVID death rate; 1 out of every " + str(round(max_state_death_rate)) + " people has died")
calculate_state_deathrate(covid_data)
Hawaii has the lowest COVID death rate; 1 out of every 996 people has died Mississippi has the highest COVID death rate; 1 out of every 239 people has died
Complicated analysis requires a better data structure like pandas dataframe. We now convert the previous dictionary of lists of dictionaries to a pandas dataframe. Each row corresponds to a unique county. Five columns are county, state, # total covid cases (integer), # covid case per 100k (float), and # covid deaths (integer).
def convert_to_pandas(covid_data):
covid_data_flipped = []
for state, counties in covid_data.items():
for county in counties:
pop = county['population']
if (pop is None) or (pop == 0):
continue
cases = county['cases']
cur_dict = {"county":county['county_name'], "state":state,
"# total covid cases": cases,
"# covid cases per 100k": cases/(pop/100000),
"# covid deaths": county['deaths']}
covid_data_flipped.append(cur_dict)
covid_df = pd.json_normalize(covid_data_flipped)
return covid_df
covid_df = convert_to_pandas(covid_data)
print(covid_df.shape)
covid_df.head()
(3118, 5)
county | state | # total covid cases | # covid cases per 100k | # covid deaths | |
---|---|---|---|---|---|
0 | Autauga County | Alabama | 15863 | 28393.205534 | 216 |
1 | Baldwin County | Alabama | 55862 | 25023.965883 | 681 |
2 | Barbour County | Alabama | 5681 | 23013.043831 | 98 |
3 | Bibb County | Alabama | 6457 | 28833.616147 | 105 |
4 | Blount County | Alabama | 15005 | 25948.535261 | 243 |
covid_df.to_csv('./combined_data/covid_df.csv', index=False)
We can use this dataframe to compute same quantities as done above more easily
the single county (and the state to which it belongs) that has the lowest rate of COVID cases per 100k people
the single county (and the state to which it belongs) that has the highest rate of COVID cases per 100k people
the state that has the lowest number of deaths
the state that has the highest number of deaths
The state that has the lowest rate of deaths based on its entire population
The state that has the highest rate of deaths based on its entire population
def calculate_county_stats2(covid_df):
sorted_df = covid_df.sort_values(by=['# covid cases per 100k'])
lowest = sorted_df.iloc[0]
highest = sorted_df.iloc[-1]
print(f"{lowest['county']} ({lowest['state']}) has the lowest rate of confirmed COVID cases per 100k: {lowest['# covid cases per 100k']:,.2f}")
print(f"{highest['county']} ({highest['state']}) has the highest rate of confirmed COVID cases per 100k: {highest['# covid cases per 100k']:,.2f}")
calculate_county_stats2(covid_df)
Kalawao County (Hawaii) has the lowest rate of confirmed COVID cases per 100k: 0.00 Loving County (Texas) has the highest rate of confirmed COVID cases per 100k: 113,017.75
def calculate_state_deaths2(covid_df):
state_deaths = covid_df.groupby('state').sum().sort_values(by=['# covid deaths'])
lowest = state_deaths.iloc[0]
highest = state_deaths.iloc[-1]
print(lowest.name + " has the fewest COVID deaths: " + str(lowest['# covid deaths']))
print(highest.name + " has the most COVID deaths: " + str(highest['# covid deaths']))
calculate_state_deaths2(covid_df)
Vermont has the fewest COVID deaths: 640.0 California has the most COVID deaths: 89667.0
def calculate_state_deathrate2(covid_df):
covid_df2 = covid_df
covid_df2['population'] = 100000*covid_df2['# total covid cases'] / covid_df2['# covid cases per 100k']
covid_df2 = covid_df2.groupby('state').sum()
covid_df2['death_rate'] = covid_df2['population'] / covid_df2['# covid deaths']
covid_df2 = covid_df2.sort_values(by=['death_rate'])
print(covid_df2.iloc[-1].name + " has the lowest COVID death rate; 1 out of every " + str(int(covid_df2.iloc[-1]['death_rate'])) + " people has died")
print(covid_df2.iloc[0].name + " has the highest COVID death rate; 1 out of every " + str(int(covid_df2.iloc[0]['death_rate'])) + " people has died")
calculate_state_deathrate2(covid_df)
Hawaii has the lowest COVID death rate; 1 out of every 995 people has died Mississippi has the highest COVID death rate; 1 out of every 238 people has died
Furthermore, considering that the data is messy and some are not reliable, we attempted to understand some of the uncertainty around COVID data. We consider that false negatives of deaths of COVID-19 is minimal. Every disease has a mortality rate and we can consider it's constant throughout all people in the US. Although some are at highe risk (e.g. older folks, people with pre-existing conditions, etc), we can imagine that this variance in the population to be fairly uniform throughout the USA. Therefore, if all counties were equal in their testing, we are supposed to see a consistent ratio between # people who died from COVID and # of people who tested positive for COVID, which is called 'case fatality rate'.
def add_death_stats(covid_df):
# can add an infintesimal or fillna after the fact to handle nans from divide by 0.
covid_df['population'] = 100000*covid_df['# total covid cases'] / (covid_df['# covid cases per 100k']+0.0001)
# covid_df.fillna(0, inplace=True)
covid_df["population"] = covid_df["population"].astype('int32')
covid_df['# covid deaths per 100k'] = 100000*covid_df['# covid deaths'] / (covid_df['population']+0.0001)
# covid_df.fillna(0, inplace=True)
covid_df["# covid deaths per 100k"] = covid_df["# covid deaths per 100k"].astype('int32')
covid_df['case_fatality_rate'] = covid_df['# covid deaths'] / (covid_df['# total covid cases']+0.0001)
# covid_df.fillna(0, inplace=True)
covid_df = covid_df.sort_values(by=['case_fatality_rate'])
return covid_df
covid_updated = add_death_stats(covid_df)
covid_updated
county | state | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | |
---|---|---|---|---|---|---|---|---|
1702 | Loup County | Nebraska | 87 | 13102.409639 | 0 | 663 | 0 | 0.000000 |
1696 | Keya Paha County | Nebraska | 118 | 14640.198511 | 0 | 805 | 0 | 0.000000 |
544 | Kalawao County | Hawaii | 0 | 0.000000 | 0 | 0 | 0 | 0.000000 |
87 | Skagway Municipality | Alaska | 30 | 2535.925613 | 0 | 1182 | 0 | 0.000000 |
269 | Jackson County | Colorado | 166 | 11925.287356 | 0 | 1391 | 0 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2715 | Sabine County | Texas | 1254 | 11895.276039 | 89 | 10541 | 844 | 0.070973 |
427 | Dodge County | Georgia | 2110 | 10240.232953 | 154 | 20604 | 747 | 0.072986 |
1752 | Storey County | Nevada | 133 | 3225.806452 | 11 | 4122 | 266 | 0.082707 |
538 | Wilcox County | Georgia | 828 | 9588.882455 | 71 | 8634 | 822 | 0.085749 |
444 | Glascock County | Georgia | 269 | 9054.190508 | 25 | 2970 | 841 | 0.092937 |
3118 rows × 8 columns
covid_updated.to_csv('./combined_data/covid_updated.csv', index = False)
From all those analyses above, we learned that states vary wildly in their death rate (e.g., The number of deaths in New Jersey or California is orders of magnitude higher than those in Hawaii or Alaska) and COVID testing. States also fluctuate a lot amongst their counties, as some counties with very bad statistics are within states with good statistics.
When it comes to data reliability, some states and counties are probably more proactive when it comes to testing, so they could have higher cases counts. Other counties might have a similar number of cases or higher, but they are just not being represented in the data due to lower testing. Deaths are thus harder to overlook, so states with lax testing policies may have inflated deaths per case metrics. Perhaps we could supplement the data with some measure of testing rates in the county or state.
We are also interested in how COVID has impacted our world. We can better understand this by looking at how it relates to demographics, income, education, health, and politicala voting.
Our case_fatality_rate
column can be viewed as an approximation of how effective and thorough COVID testing is for a given county. Our # covid deaths
column can be viewed as an extreme indication of how severe COVID has impacted a given county. Our # covid cases per 100k
column be viewed as middle-ground between the two aforementioned features. That is, it measures the impact of the disease and is influenced by the thoroughness of COVID testing.
Using these three informative features, we can inspect how impacted each county is, while correlating this with other features of each county, such as income-level, health metrics, demographics, etc.
In this project, we merged our COVID case data with 'election2020_by_county.csv' dataset. We only care about 15 columns which are hispanic, minority, female, unemployed, income, nodegree, bachelor, inactivity, obesity, desity, cancer, voter_turnout, voter_gap, trump, biden. We droppde fipscode and population columns.
A data description is as follows:
def merge_data(covid_updated, filepath):
data2020 = pd.read_csv(filepath).drop(columns=['fipscode', 'population'])
return pd.merge(covid_updated, data2020, on=['state', 'county'])
merged = merge_data(covid_updated, './drive/MyDrive/election2020_by_county.csv')
print(merged.shape)
merged.head()
(3044, 23)
county | state | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Loup County | Nebraska | 87 | 13102.409639 | 0 | 663 | 0 | 0.0 | 0.0 | 0.4 | ... | 6.2 | 14.4 | 33.0 | 30.7 | 1.3 | NaN | -4.849885 | 65.0 | 81.5 | 16.5 |
1 | Keya Paha County | Nebraska | 118 | 14640.198511 | 0 | 805 | 0 | 0.0 | 1.1 | 2.2 | ... | 8.0 | 15.8 | 30.2 | 29.6 | 7.9 | 256.3 | 8.000000 | 80.7 | 90.0 | 9.3 |
2 | Jackson County | Colorado | 166 | 11925.287356 | 0 | 1391 | 0 | 0.0 | 24.4 | 25.1 | ... | 15.5 | 17.5 | 20.5 | 21.4 | 0.9 | 207.1 | 12.874251 | 58.1 | 77.9 | 19.8 |
3 | Daggett County | Utah | 37 | 3894.736842 | 0 | 949 | 0 | 0.0 | 3.3 | 5.1 | ... | 12.4 | 19.3 | 20.3 | 26.1 | 14.5 | 185.3 | -9.461967 | 62.4 | 80.2 | 17.8 |
4 | Hinsdale County | Colorado | 131 | 15975.609756 | 0 | 819 | 0 | 0.0 | 6.0 | 9.1 | ... | 5.1 | 41.3 | 14.8 | 20.5 | 0.8 | NaN | 5.952381 | 15.6 | 55.9 | 40.3 |
5 rows × 23 columns
merged.to_csv('./combined_data/merged.csv', index = False)
Due to mismatching happened during merging, we have lost some rows.
print(len(covid_updated) - len(merged))
74
missing_counties = set()
merged_counties = set()
for index, row in merged.iterrows():
merged_counties.add(row['county'].lower() + "_" + row['state'].lower())
missing_idxs = []
for index, row in covid_updated.iterrows():
cur_county = row['county'].lower() + "_" + row['state'].lower()
if cur_county not in merged_counties:
# print("missing",cur_county)
missing_idxs.append(index)
missing_counties.add(cur_county)
covid_updated.loc[missing_idxs]
county | state | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | |
---|---|---|---|---|---|---|---|---|
87 | Skagway Municipality | Alaska | 30 | 2535.925613 | 0 | 1182 | 0 | 0.000000 |
90 | Wrangell City and Borough | Alaska | 71 | 2837.729816 | 0 | 2501 | 0 | 0.000000 |
75 | Hoonah-Angoon Census Area | Alaska | 0 | 0.000000 | 0 | 0 | 0 | 0.000000 |
544 | Kalawao County | Hawaii | 0 | 0.000000 | 0 | 0 | 0 | 0.000000 |
68 | Aleutians West Census Area | Alaska | 863 | 15317.713880 | 2 | 5633 | 35 | 0.002317 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1127 | East Feliciana Parish | Louisiana | 7320 | 38254.507447 | 170 | 19134 | 888 | 0.023224 |
1122 | Claiborne Parish | Louisiana | 3280 | 20931.716656 | 77 | 15669 | 491 | 0.023476 |
1129 | Franklin Parish | Louisiana | 7347 | 36707.469398 | 178 | 20014 | 889 | 0.024228 |
1148 | Red River Parish | Louisiana | 2110 | 24994.077233 | 53 | 8441 | 627 | 0.025118 |
1115 | Bienville Parish | Louisiana | 3990 | 30133.675704 | 110 | 13240 | 830 | 0.027569 |
74 rows × 8 columns
# Remove rows with 0 deaths
merged = merged.loc[merged['# covid deaths'] != 0]
# Summary statistics
merged.describe()
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3.021000e+03 | 3021.000000 | 3021.000000 | 3.021000e+03 | 3021.000000 | 3021.000000 | 3021.000000 | 3021.000000 | 3021.000000 | 3021.000000 | ... | 3021.000000 | 3021.000000 | 3021.000000 | 3021.000000 | 3021.000000 | 2979.000000 | 2984.000000 | 2983.000000 | 2983.000000 | 2983.000000 |
mean | 2.546425e+04 | 24314.359415 | 302.067196 | 1.055975e+05 | 366.179742 | 0.015760 | 9.259484 | 22.694009 | 49.932903 | 5.495035 | ... | 14.970275 | 20.066137 | 25.910824 | 30.947104 | 250.792122 | 228.507486 | 35.459084 | 32.702749 | 65.487026 | 32.784278 |
std | 8.662408e+04 | 6015.982401 | 1006.517139 | 3.391059e+05 | 162.864254 | 0.008579 | 13.881480 | 19.920798 | 2.375417 | 1.969195 | ... | 6.740886 | 8.846873 | 5.161410 | 4.467600 | 1741.445511 | 56.125890 | 13.868524 | 31.279827 | 15.699166 | 15.590596 |
min | 3.200000e+01 | 3274.314819 | 1.000000 | 1.680000e+02 | 11.000000 | 0.000527 | 0.000000 | 0.000000 | 19.166215 | 1.800000 | ... | 1.900000 | 2.600000 | 8.100000 | 11.800000 | 0.100000 | 46.200000 | -168.323353 | -90.000000 | 4.000000 | 3.100000 |
25% | 2.560000e+03 | 20683.907225 | 41.000000 | 1.106500e+04 | 249.000000 | 0.010637 | 2.000000 | 6.900000 | 49.462074 | 4.100000 | ... | 9.900000 | 14.000000 | 22.600000 | 28.300000 | 17.000000 | 193.300000 | 27.675757 | 15.050000 | 56.700000 | 20.800000 |
50% | 6.401000e+03 | 24365.817878 | 96.000000 | 2.584300e+04 | 358.000000 | 0.014321 | 4.000000 | 15.200000 | 50.392410 | 5.300000 | ... | 13.500000 | 17.900000 | 25.800000 | 31.100000 | 45.300000 | 230.300000 | 35.055497 | 39.100000 | 68.700000 | 29.600000 |
75% | 1.705000e+04 | 27802.251591 | 229.000000 | 6.784200e+04 | 468.000000 | 0.018885 | 9.500000 | 33.900000 | 51.080252 | 6.500000 | ... | 19.200000 | 23.600000 | 29.400000 | 33.700000 | 112.900000 | 265.100000 | 42.473125 | 56.800000 | 77.500000 | 41.650000 |
max | 2.751220e+06 | 113017.751479 | 31736.000000 | 1.003911e+07 | 1229.000000 | 0.092937 | 99.200000 | 99.400000 | 58.100420 | 24.000000 | ... | 53.300000 | 75.100000 | 41.400000 | 47.600000 | 69468.400000 | 458.300000 | 100.000000 | 93.100000 | 96.200000 | 94.000000 |
8 rows × 21 columns
We can partition any quantitative feature by using quantiles. With arbitrarily chosen minv and maxv, we can partition certain feature of interest multiple times and observe interesting relationships.
# Given minv and maxv, this function returns a subset of the dataframe that has feature values between minv and maxv inclusive.
def partition_df(df, column_name, minv, maxv):
return df.loc[(merged[column_name] >= minv) & (merged[column_name] <= maxv)]
partition_df(merged, 'income', 21000, 31000).describe()
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | ... | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 | 117.000000 |
mean | 4634.606838 | 26336.198665 | 85.504274 | 16558.410256 | 520.239316 | 0.021598 | 8.474359 | 43.990598 | 49.624858 | 9.065812 | ... | 26.895726 | 11.892308 | 31.676923 | 36.230769 | 84.973504 | 244.556410 | 44.983493 | 13.635043 | 56.181197 | 42.546154 |
std | 5658.986674 | 7096.649003 | 87.830030 | 17899.159118 | 152.402917 | 0.010778 | 21.306839 | 32.372695 | 3.836346 | 2.228389 | ... | 5.736489 | 3.710822 | 4.358590 | 5.410948 | 176.212760 | 47.754773 | 10.879040 | 45.685134 | 22.877081 | 22.815828 |
min | 184.000000 | 7931.904161 | 3.000000 | 1536.000000 | 174.000000 | 0.004765 | 0.000000 | 0.400000 | 35.462777 | 4.400000 | ... | 8.000000 | 5.800000 | 19.700000 | 21.000000 | 1.400000 | 99.200000 | -10.411765 | -71.600000 | 13.500000 | 9.300000 |
25% | 2003.000000 | 22647.596516 | 43.000000 | 8110.000000 | 434.000000 | 0.015873 | 0.700000 | 5.800000 | 48.747893 | 7.700000 | ... | 23.400000 | 9.200000 | 29.000000 | 32.600000 | 17.500000 | 215.200000 | 39.175362 | -27.900000 | 35.100000 | 19.900000 |
50% | 2889.000000 | 27116.874368 | 56.000000 | 11839.000000 | 515.000000 | 0.018861 | 1.600000 | 49.700000 | 50.691193 | 8.700000 | ... | 26.900000 | 11.300000 | 32.000000 | 36.400000 | 36.800000 | 243.900000 | 44.536665 | 14.200000 | 54.900000 | 40.800000 |
75% | 5247.000000 | 31600.407747 | 96.000000 | 18067.000000 | 599.000000 | 0.022789 | 3.400000 | 73.100000 | 51.953125 | 10.100000 | ... | 29.800000 | 13.800000 | 35.000000 | 40.300000 | 74.100000 | 272.700000 | 51.138995 | 58.500000 | 78.900000 | 63.000000 |
max | 40588.000000 | 39821.693908 | 589.000000 | 130624.000000 | 1064.000000 | 0.067568 | 99.200000 | 99.400000 | 56.526573 | 17.600000 | ... | 53.300000 | 27.900000 | 41.300000 | 47.600000 | 1261.500000 | 380.000000 | 70.003709 | 80.400000 | 89.700000 | 85.100000 |
8 rows × 21 columns
def view_partitions(df, feature, n_partitions=3, cols=None):
if cols is None:
cols = df.columns
start = 0
for i in range(n_partitions):
stop = start + (1/n_partitions)
display(partition_df(merged, feature,
merged[feature].quantile(start),
merged[feature].quantile(stop))[cols].describe())
start = stop
view_partitions(merged, 'obesity')
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.033000e+03 | 1033.000000 | 1033.000000 | 1.033000e+03 | 1033.000000 | 1033.000000 | 1033.000000 | 1033.000000 | 1033.000000 | 1033.000000 | ... | 1033.000000 | 1033.000000 | 1033.000000 | 1033.000000 | 1033.000000 | 1013.000000 | 1014.000000 | 1014.000000 | 1014.000000 | 1014.000000 |
mean | 4.635578e+04 | 22771.754766 | 493.433688 | 1.961860e+05 | 313.152953 | 0.014420 | 14.784705 | 24.974831 | 49.784908 | 5.065247 | ... | 13.033495 | 25.626718 | 22.030591 | 26.236883 | 365.470474 | 213.825469 | 32.293654 | 24.426331 | 61.280375 | 36.854043 |
std | 1.397306e+05 | 6629.900988 | 1602.636109 | 5.424150e+05 | 172.554693 | 0.009160 | 17.944395 | 19.940681 | 2.537714 | 1.874395 | ... | 6.728381 | 10.933433 | 4.181953 | 2.969353 | 1641.553548 | 61.119227 | 17.665726 | 36.403793 | 18.306117 | 18.108292 |
min | 3.500000e+01 | 3274.314819 | 1.000000 | 1.680000e+02 | 11.000000 | 0.001057 | 0.000000 | 0.000000 | 31.955024 | 1.800000 | ... | 1.900000 | 2.600000 | 8.100000 | 11.800000 | 0.100000 | 46.200000 | -21.417069 | -90.000000 | 4.000000 | 3.100000 |
25% | 2.310000e+03 | 18617.709313 | 33.000000 | 1.072400e+04 | 180.000000 | 0.008413 | 3.100000 | 9.800000 | 49.321586 | 3.800000 | ... | 8.300000 | 17.300000 | 19.200000 | 25.000000 | 11.000000 | 169.700000 | 21.126119 | -2.200000 | 47.850000 | 21.900000 |
50% | 7.027000e+03 | 22735.283123 | 93.000000 | 3.256400e+04 | 285.000000 | 0.012746 | 7.400000 | 18.600000 | 50.331365 | 4.800000 | ... | 11.300000 | 23.300000 | 22.400000 | 27.100000 | 45.800000 | 212.600000 | 29.870988 | 27.500000 | 62.800000 | 35.200000 |
75% | 3.332500e+04 | 26132.178794 | 335.000000 | 1.529390e+05 | 414.000000 | 0.017883 | 19.200000 | 35.200000 | 51.018405 | 5.900000 | ... | 16.300000 | 32.200000 | 25.000000 | 28.400000 | 162.300000 | 251.600000 | 40.219742 | 54.300000 | 76.275000 | 49.975000 |
max | 2.751220e+06 | 113017.751479 | 31736.000000 | 1.003911e+07 | 1229.000000 | 0.092937 | 95.300000 | 97.400000 | 58.100420 | 24.000000 | ... | 46.800000 | 75.100000 | 38.800000 | 29.400000 | 32903.300000 | 445.400000 | 99.552895 | 93.100000 | 96.200000 | 94.000000 |
8 rows × 21 columns
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1015.000000 | 1015.000000 | 1015.000000 | 1.015000e+03 | 1015.000000 | 1015.000000 | 1015.000000 | 1015.000000 | 1015.000000 | 1015.000000 | ... | 1015.000000 | 1015.000000 | 1015.000000 | 1015.000000 | 1015.000000 | 1002.000000 | 1004.000000 | 1003.000000 | 1003.000000 | 1003.000000 |
mean | 17413.966502 | 24372.422566 | 226.448276 | 7.111427e+04 | 373.295567 | 0.016179 | 7.553990 | 18.795961 | 49.890443 | 5.326305 | ... | 14.513399 | 18.695074 | 26.218128 | 31.132217 | 205.667586 | 234.567764 | 34.878074 | 38.696510 | 68.491426 | 29.794915 |
std | 35239.630504 | 5506.207523 | 453.917323 | 1.464475e+05 | 157.158042 | 0.008917 | 11.862685 | 17.305463 | 2.238655 | 1.872417 | ... | 6.318933 | 6.224281 | 3.983797 | 1.018496 | 2212.807074 | 54.672491 | 12.708142 | 24.848158 | 12.469813 | 12.388141 |
min | 32.000000 | 6911.447084 | 1.000000 | 4.620000e+02 | 27.000000 | 0.000809 | 0.000000 | 1.500000 | 19.166215 | 1.800000 | ... | 3.800000 | 4.400000 | 15.800000 | 29.400000 | 0.100000 | 82.500000 | -168.323353 | -65.600000 | 16.900000 | 5.600000 |
25% | 2331.500000 | 21097.270656 | 37.000000 | 1.039850e+04 | 262.000000 | 0.010903 | 2.000000 | 6.000000 | 49.403040 | 4.100000 | ... | 9.700000 | 14.300000 | 23.300000 | 30.200000 | 16.700000 | 199.825000 | 28.279638 | 23.300000 | 60.750000 | 20.300000 |
50% | 6279.000000 | 24563.536449 | 98.000000 | 2.640300e+04 | 359.000000 | 0.014236 | 3.500000 | 12.200000 | 50.314825 | 5.100000 | ... | 12.900000 | 17.600000 | 26.100000 | 31.100000 | 42.900000 | 236.200000 | 34.537246 | 42.100000 | 70.100000 | 28.000000 |
75% | 15557.000000 | 27653.785878 | 219.500000 | 6.302800e+04 | 468.000000 | 0.019337 | 7.300000 | 26.850000 | 50.970982 | 6.300000 | ... | 18.200000 | 21.700000 | 28.800000 | 32.000000 | 98.600000 | 269.675000 | 40.807564 | 57.600000 | 78.050000 | 37.500000 |
max | 402352.000000 | 61934.129379 | 7728.000000 | 1.584063e+06 | 1212.000000 | 0.085749 | 99.200000 | 99.400000 | 56.633907 | 21.800000 | ... | 53.300000 | 44.600000 | 41.400000 | 32.800000 | 69468.400000 | 425.400000 | 100.000000 | 86.000000 | 92.000000 | 82.500000 |
8 rows × 21 columns
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1036.000000 | 1036.000000 | 1036.000000 | 1.036000e+03 | 1036.000000 | 1036.000000 | 1036.000000 | 1036.000000 | 1036.000000 | 1036.000000 | ... | 1036.000000 | 1036.000000 | 1036.000000 | 1036.000000 | 1036.000000 | 1026.000000 | 1029.000000 | 1029.000000 | 1029.000000 | 1029.000000 |
mean | 11955.729730 | 25787.480672 | 178.888031 | 4.675872e+04 | 411.768340 | 0.016674 | 5.380309 | 23.938707 | 50.117482 | 6.053378 | ... | 17.302220 | 15.871815 | 29.476834 | 35.479826 | 174.186969 | 237.092982 | 39.012722 | 35.537026 | 66.960641 | 31.423615 |
std | 23933.119885 | 5361.167248 | 374.326997 | 9.641718e+04 | 141.464044 | 0.007432 | 8.434501 | 21.605540 | 2.295568 | 2.021584 | ... | 6.487572 | 4.850790 | 4.223773 | 2.422581 | 1164.339521 | 48.778309 | 8.717395 | 29.339460 | 14.662722 | 14.686429 |
min | 96.000000 | 9330.346798 | 1.000000 | 7.210000e+02 | 14.000000 | 0.000527 | 0.000000 | 0.400000 | 34.868478 | 1.800000 | ... | 3.100000 | 5.800000 | 15.600000 | 32.800000 | 0.100000 | 61.800000 | 12.181303 | -80.800000 | 8.800000 | 5.400000 |
25% | 3003.250000 | 22698.614312 | 49.000000 | 1.222000e+04 | 321.750000 | 0.012394 | 1.600000 | 6.000000 | 49.643073 | 4.800000 | ... | 12.275000 | 12.400000 | 26.500000 | 33.700000 | 22.950000 | 206.650000 | 33.091367 | 21.900000 | 60.000000 | 20.900000 |
50% | 6047.000000 | 26021.133583 | 98.000000 | 2.313750e+04 | 411.000000 | 0.015669 | 2.800000 | 15.300000 | 50.510986 | 5.900000 | ... | 16.400000 | 15.000000 | 29.500000 | 34.800000 | 46.350000 | 237.000000 | 39.270875 | 42.800000 | 70.300000 | 27.800000 |
75% | 11853.500000 | 28860.094198 | 180.000000 | 4.566200e+04 | 494.000000 | 0.019305 | 5.400000 | 37.925000 | 51.268131 | 7.200000 | ... | 22.200000 | 18.600000 | 32.400000 | 36.700000 | 102.325000 | 267.000000 | 44.732536 | 56.900000 | 77.500000 | 38.100000 |
max | 410889.000000 | 72727.272727 | 7953.000000 | 1.749342e+06 | 996.000000 | 0.066603 | 91.800000 | 93.400000 | 56.526573 | 16.900000 | ... | 40.500000 | 42.600000 | 41.300000 | 47.600000 | 35369.200000 | 458.300000 | 70.003709 | 87.900000 | 93.300000 | 89.600000 |
8 rows × 21 columns
view_partitions(merged, 'inactivity')
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.007000e+03 | 1007.000000 | 1007.000000 | 1.007000e+03 | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | ... | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 992.000000 | 982.000000 | 982.000000 | 982.000000 | 982.000000 |
mean | 4.741475e+04 | 23140.636087 | 483.670308 | 2.030346e+05 | 271.937438 | 0.012093 | 12.365045 | 22.890566 | 49.707254 | 4.982423 | ... | 11.348163 | 26.617577 | 20.354816 | 27.668818 | 461.912115 | 213.861895 | 31.293156 | 17.149898 | 57.540224 | 40.390326 |
std | 1.367664e+05 | 6443.122064 | 1512.796366 | 5.351729e+05 | 139.972161 | 0.006343 | 15.732390 | 18.824834 | 2.499203 | 1.795947 | ... | 5.740513 | 10.472965 | 2.836158 | 4.213967 | 2770.745246 | 61.473242 | 17.404163 | 32.866155 | 16.459077 | 16.417551 |
min | 6.300000e+01 | 3274.314819 | 1.000000 | 1.680000e+02 | 11.000000 | 0.001057 | 0.000000 | 0.000000 | 19.166215 | 1.900000 | ... | 1.900000 | 2.600000 | 8.100000 | 11.800000 | 0.100000 | 46.200000 | -21.417069 | -90.000000 | 4.000000 | 3.100000 |
25% | 3.455000e+03 | 19408.061608 | 38.000000 | 1.509950e+04 | 173.000000 | 0.007854 | 3.000000 | 8.600000 | 49.336073 | 3.800000 | ... | 7.700000 | 18.700000 | 18.700000 | 25.300000 | 12.250000 | 166.950000 | 21.203168 | -5.675000 | 46.025000 | 28.400000 |
50% | 1.100700e+04 | 22993.187599 | 120.000000 | 4.758000e+04 | 255.000000 | 0.011327 | 5.900000 | 16.200000 | 50.208257 | 4.700000 | ... | 10.100000 | 24.800000 | 21.200000 | 28.100000 | 46.900000 | 211.450000 | 28.190647 | 19.750000 | 59.000000 | 39.200000 |
75% | 3.888200e+04 | 26299.994779 | 379.000000 | 1.662710e+05 | 342.000000 | 0.014652 | 14.350000 | 31.900000 | 50.863073 | 5.800000 | ... | 13.100000 | 32.450000 | 22.600000 | 30.600000 | 177.150000 | 254.500000 | 37.751744 | 40.775000 | 69.275000 | 51.800000 |
max | 2.751220e+06 | 113017.751479 | 31736.000000 | 1.003911e+07 | 994.000000 | 0.081481 | 94.100000 | 94.800000 | 58.100420 | 24.000000 | ... | 43.900000 | 75.100000 | 23.700000 | 39.700000 | 69468.400000 | 445.400000 | 99.552895 | 93.100000 | 96.200000 | 94.000000 |
8 rows × 21 columns
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1028.000000 | 1028.000000 | 1028.000000 | 1.028000e+03 | 1028.000000 | 1028.000000 | 1028.000000 | 1028.000000 | 1028.000000 | 1028.000000 | ... | 1028.000000 | 1028.000000 | 1028.000000 | 1028.000000 | 1028.000000 | 1009.000000 | 1017.000000 | 1016.000000 | 1016.000000 | 1016.000000 |
mean | 19504.956226 | 23995.600128 | 270.469844 | 7.753021e+04 | 392.975681 | 0.017394 | 10.354864 | 22.804086 | 49.936301 | 5.255837 | ... | 14.964105 | 18.713132 | 25.842802 | 31.003794 | 183.769747 | 229.305055 | 34.853118 | 37.605906 | 67.993012 | 30.387106 |
std | 51624.583537 | 5760.949741 | 767.045525 | 1.961051e+05 | 165.631212 | 0.009503 | 15.641803 | 20.733552 | 2.292643 | 1.910234 | ... | 6.407062 | 5.868454 | 1.219701 | 3.033814 | 1134.003874 | 54.053794 | 12.759806 | 27.366819 | 13.721102 | 13.655311 |
min | 35.000000 | 6103.922159 | 1.000000 | 4.860000e+02 | 14.000000 | 0.000809 | 0.000000 | 0.200000 | 32.813627 | 1.800000 | ... | 3.100000 | 4.400000 | 23.800000 | 22.400000 | 0.100000 | 70.000000 | -168.323353 | -76.800000 | 11.200000 | 3.900000 |
25% | 2028.000000 | 20493.710413 | 34.000000 | 8.923250e+03 | 283.000000 | 0.011700 | 2.100000 | 6.400000 | 49.342871 | 4.100000 | ... | 10.500000 | 14.500000 | 24.900000 | 28.700000 | 13.350000 | 194.300000 | 28.566943 | 22.500000 | 60.175000 | 20.300000 |
50% | 5230.000000 | 24010.922870 | 87.000000 | 2.177150e+04 | 378.000000 | 0.015793 | 4.200000 | 14.500000 | 50.324591 | 5.000000 | ... | 13.400000 | 17.900000 | 25.800000 | 31.000000 | 40.100000 | 230.700000 | 34.537328 | 41.200000 | 69.800000 | 28.450000 |
75% | 14111.500000 | 27195.090081 | 202.250000 | 5.723600e+04 | 480.000000 | 0.020755 | 10.600000 | 35.025000 | 51.076514 | 6.200000 | ... | 18.100000 | 21.600000 | 26.800000 | 33.100000 | 107.425000 | 263.200000 | 40.992547 | 58.100000 | 78.200000 | 38.025000 |
max | 714053.000000 | 61934.129379 | 12823.000000 | 2.559902e+06 | 1229.000000 | 0.092937 | 99.200000 | 99.400000 | 56.633907 | 21.800000 | ... | 53.300000 | 46.300000 | 28.000000 | 42.000000 | 32903.300000 | 433.900000 | 100.000000 | 92.000000 | 95.900000 | 88.000000 |
8 rows × 21 columns
# total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | hispanic | minority | female | unemployed | ... | nodegree | bachelor | inactivity | obesity | density | cancer | voter_turnout | voter_gap | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1018.000000 | 1018.000000 | 1018.000000 | 1.018000e+03 | 1018.000000 | 1018.000000 | 1018.000000 | 1018.000000 | 1018.000000 | 1018.000000 | ... | 1018.000000 | 1018.000000 | 1018.000000 | 1018.000000 | 1018.000000 | 1010.000000 | 1017.000000 | 1017.000000 | 1017.000000 | 1017.000000 |
mean | 10061.785855 | 25805.500407 | 163.643418 | 3.817520e+04 | 433.435167 | 0.017819 | 4.975049 | 22.328193 | 50.173025 | 6.245678 | ... | 18.565914 | 14.874361 | 31.541159 | 34.180157 | 105.106189 | 242.664257 | 40.216906 | 43.029695 | 70.764503 | 27.734808 |
std | 27440.411056 | 5519.297680 | 481.803986 | 9.681484e+04 | 134.701725 | 0.008467 | 7.262961 | 20.043183 | 2.303834 | 1.958725 | ... | 5.977301 | 4.333156 | 2.735776 | 3.406723 | 212.876177 | 48.768550 | 8.482994 | 27.054607 | 13.510946 | 13.552863 |
min | 32.000000 | 6911.447084 | 1.000000 | 4.620000e+02 | 14.000000 | 0.000527 | 0.000000 | 0.400000 | 35.150327 | 1.800000 | ... | 3.900000 | 5.800000 | 28.000000 | 23.800000 | 0.300000 | 94.400000 | 1.730104 | -71.600000 | 13.500000 | 5.000000 |
25% | 2675.000000 | 22639.004381 | 48.000000 | 1.110300e+04 | 353.000000 | 0.013173 | 1.500000 | 5.800000 | 49.796348 | 5.000000 | ... | 14.200000 | 11.800000 | 29.300000 | 31.900000 | 24.525000 | 212.425000 | 34.466113 | 33.900000 | 65.900000 | 18.800000 |
50% | 5405.000000 | 26231.941999 | 92.000000 | 2.070000e+04 | 431.000000 | 0.016078 | 2.600000 | 15.000000 | 50.616007 | 6.100000 | ... | 18.300000 | 14.300000 | 31.000000 | 33.800000 | 46.450000 | 242.650000 | 40.067710 | 50.200000 | 74.200000 | 24.100000 |
75% | 10497.250000 | 29331.867235 | 169.000000 | 4.047700e+04 | 521.000000 | 0.020233 | 5.100000 | 34.900000 | 51.302674 | 7.400000 | ... | 22.600000 | 17.300000 | 33.375000 | 36.300000 | 93.525000 | 272.075000 | 45.901233 | 60.900000 | 79.800000 | 32.400000 |
max | 661165.000000 | 53241.683639 | 11854.000000 | 2.253857e+06 | 996.000000 | 0.072477 | 63.500000 | 91.200000 | 56.633907 | 16.900000 | ... | 40.500000 | 36.900000 | 41.400000 | 47.600000 | 2800.000000 | 458.300000 | 70.003709 | 88.300000 | 93.300000 | 85.100000 |
8 rows × 21 columns
view_partitions(merged, 'income',
cols=['income','# total covid cases', '# covid cases per 100k', '# covid deaths',
'population', '# covid deaths per 100k', 'case_fatality_rate',
'obesity', 'inactivity', 'trump', 'biden'])
income | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | obesity | inactivity | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 1.007000e+03 | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 1002.000000 | 1002.000000 |
mean | 35886.695134 | 8288.693148 | 25014.583343 | 142.903674 | 3.293902e+04 | 464.244290 | 0.020079 | 32.942304 | 29.309037 | 67.414072 | 31.157685 |
std | 3799.138075 | 20595.465992 | 6509.586741 | 360.681418 | 8.483606e+04 | 157.840313 | 0.010388 | 4.450842 | 4.676727 | 15.976406 | 15.928395 |
min | 21658.000000 | 35.000000 | 7186.858316 | 1.000000 | 4.030000e+02 | 48.000000 | 0.002333 | 17.400000 | 13.600000 | 11.200000 | 7.100000 |
25% | 33575.000000 | 2237.500000 | 20856.173791 | 45.000000 | 9.733000e+03 | 367.500000 | 0.014051 | 30.100000 | 25.900000 | 58.600000 | 18.925000 |
50% | 36566.000000 | 4471.000000 | 25428.263215 | 81.000000 | 1.778100e+04 | 458.000000 | 0.017502 | 32.900000 | 29.600000 | 72.200000 | 26.300000 |
75% | 38891.000000 | 8407.500000 | 29159.082843 | 143.500000 | 3.193600e+04 | 551.500000 | 0.022740 | 35.900000 | 32.800000 | 79.600000 | 39.975000 |
max | 41150.000000 | 402352.000000 | 56864.875543 | 7728.000000 | 1.584063e+06 | 1229.000000 | 0.092937 | 47.600000 | 41.400000 | 92.600000 | 88.000000 |
income | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | obesity | inactivity | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1007.000000 | 1.007000e+03 | 1007.000000 | 1007.000000 | 1.007000e+03 | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 998.000000 | 998.000000 |
mean | 45378.542205 | 2.117232e+04 | 24204.352233 | 261.272095 | 8.303669e+04 | 356.008937 | 0.015158 | 30.918272 | 25.678153 | 67.789178 | 30.421844 |
std | 2620.626459 | 6.098195e+04 | 5334.500991 | 680.338463 | 2.086682e+05 | 139.711464 | 0.006599 | 3.699632 | 4.176813 | 13.072600 | 12.934935 |
min | 41161.000000 | 3.200000e+01 | 6911.447084 | 1.000000 | 4.620000e+02 | 14.000000 | 0.000809 | 14.800000 | 11.200000 | 14.100000 | 5.000000 |
25% | 43107.500000 | 2.489500e+03 | 20834.832707 | 36.000000 | 1.075750e+04 | 259.500000 | 0.010992 | 28.700000 | 23.000000 | 60.525000 | 21.000000 |
50% | 45207.000000 | 6.889000e+03 | 24120.603015 | 100.000000 | 2.887900e+04 | 359.000000 | 0.014220 | 31.100000 | 25.700000 | 70.300000 | 27.900000 |
75% | 47539.500000 | 1.574600e+04 | 27296.568873 | 229.000000 | 6.443850e+04 | 447.000000 | 0.018293 | 33.400000 | 28.300000 | 77.300000 | 37.600000 |
max | 50134.000000 | 1.209302e+06 | 61745.301879 | 12823.000000 | 2.716939e+06 | 994.000000 | 0.048193 | 41.800000 | 37.800000 | 94.000000 | 85.000000 |
income | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | obesity | inactivity | trump | biden | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1007.000000 | 1.007000e+03 | 1007.000000 | 1007.000000 | 1.007000e+03 | 1007.000000 | 1007.000000 | 1007.000000 | 1007.000000 | 983.000000 | 983.000000 |
mean | 60131.244290 | 4.693174e+04 | 23724.142668 | 502.025819 | 2.008166e+05 | 278.285998 | 0.012043 | 28.980735 | 22.745283 | 61.185453 | 36.840793 |
std | 10901.921415 | 1.326955e+05 | 6079.872377 | 1543.128483 | 5.287700e+05 | 133.159375 | 0.006049 | 4.308677 | 4.360248 | 16.942304 | 16.875312 |
min | 50154.000000 | 6.300000e+01 | 3274.314819 | 1.000000 | 1.680000e+02 | 11.000000 | 0.000527 | 11.800000 | 8.100000 | 4.000000 | 3.100000 |
25% | 52596.000000 | 3.365500e+03 | 20339.539455 | 40.000000 | 1.461700e+04 | 182.500000 | 0.008142 | 26.800000 | 20.100000 | 50.700000 | 24.650000 |
50% | 56270.000000 | 1.146300e+04 | 23930.162156 | 127.000000 | 4.890400e+04 | 270.000000 | 0.011547 | 29.400000 | 22.900000 | 63.100000 | 35.000000 |
75% | 62830.500000 | 4.015200e+04 | 26817.374363 | 414.000000 | 1.692480e+05 | 351.000000 | 0.014857 | 32.100000 | 25.700000 | 73.400000 | 46.950000 |
max | 125635.000000 | 2.751220e+06 | 113017.751479 | 31736.000000 | 1.003911e+07 | 1212.000000 | 0.081481 | 39.700000 | 37.800000 | 96.200000 | 94.000000 |
view_partitions(merged, 'trump',
cols=['trump','income','# total covid cases', '# covid cases per 100k', '# covid deaths',
'population', '# covid deaths per 100k', 'case_fatality_rate',
'obesity', 'inactivity',])
trump | income | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | obesity | inactivity | |
---|---|---|---|---|---|---|---|---|---|---|
count | 995.000000 | 995.000000 | 9.950000e+02 | 995.000000 | 995.000000 | 9.950000e+02 | 995.000000 | 995.000000 | 995.000000 | 995.000000 |
mean | 47.241709 | 50299.426131 | 5.845852e+04 | 23489.186345 | 646.194975 | 2.454354e+05 | 305.234171 | 0.013401 | 29.725327 | 23.284322 |
std | 11.340094 | 15572.223128 | 1.435519e+05 | 5632.102950 | 1677.245394 | 5.579268e+05 | 157.485499 | 0.007672 | 5.660119 | 5.411057 |
min | 4.000000 | 21658.000000 | 1.840000e+02 | 7433.739051 | 2.000000 | 7.680000e+02 | 11.000000 | 0.001057 | 11.800000 | 8.100000 |
25% | 40.250000 | 39014.500000 | 5.238500e+03 | 20108.638558 | 69.500000 | 2.243150e+04 | 190.500000 | 0.008494 | 26.200000 | 19.700000 |
50% | 50.100000 | 48413.000000 | 1.591700e+04 | 23626.943005 | 169.000000 | 6.945000e+04 | 281.000000 | 0.012284 | 29.700000 | 23.000000 |
75% | 56.700000 | 57390.000000 | 5.389150e+04 | 26781.198391 | 581.500000 | 2.343480e+05 | 389.000000 | 0.016087 | 33.200000 | 26.500000 |
max | 61.000000 | 125635.000000 | 2.751220e+06 | 60447.419089 | 31736.000000 | 1.003911e+07 | 1064.000000 | 0.066603 | 47.600000 | 40.800000 |
trump | income | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | obesity | inactivity | |
---|---|---|---|---|---|---|---|---|---|---|
count | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 |
mean | 68.313831 | 46840.147264 | 12720.003980 | 24710.531247 | 174.843781 | 50517.506468 | 377.168159 | 0.016190 | 31.632537 | 26.274229 |
std | 3.905655 | 9379.280941 | 17651.768209 | 5359.878850 | 224.233632 | 66151.583773 | 144.713271 | 0.008867 | 3.556402 | 4.149821 |
min | 61.000000 | 25768.000000 | 85.000000 | 3274.314819 | 1.000000 | 403.000000 | 14.000000 | 0.000527 | 19.500000 | 14.100000 |
25% | 65.200000 | 40422.000000 | 3191.000000 | 21552.770707 | 49.000000 | 13260.000000 | 279.000000 | 0.011377 | 29.400000 | 23.300000 |
50% | 68.600000 | 45719.000000 | 7100.000000 | 25026.614998 | 106.000000 | 28529.000000 | 371.000000 | 0.014282 | 31.700000 | 26.000000 |
75% | 71.700000 | 51839.000000 | 14953.000000 | 28180.451997 | 216.000000 | 60353.000000 | 456.000000 | 0.018604 | 34.000000 | 29.100000 |
max | 74.500000 | 97936.000000 | 202690.000000 | 55122.917010 | 2964.000000 | 636234.000000 | 1229.000000 | 0.085749 | 42.300000 | 39.900000 |
trump | income | # total covid cases | # covid cases per 100k | # covid deaths | population | # covid deaths per 100k | case_fatality_rate | obesity | inactivity | |
---|---|---|---|---|---|---|---|---|---|---|
count | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 | 1005.000000 |
mean | 80.786965 | 43878.374129 | 5059.204975 | 24627.390904 | 85.427861 | 20120.834826 | 422.151244 | 0.017918 | 31.537114 | 28.320597 |
std | 4.289503 | 9102.326701 | 6019.201974 | 6383.615594 | 103.655065 | 23113.800617 | 162.744774 | 0.008465 | 3.592986 | 4.534053 |
min | 74.500000 | 23047.000000 | 32.000000 | 4916.885879 | 1.000000 | 168.000000 | 27.000000 | 0.001300 | 19.400000 | 13.600000 |
25% | 77.500000 | 37417.000000 | 1364.000000 | 20633.467594 | 22.000000 | 5925.000000 | 327.000000 | 0.012937 | 28.900000 | 25.000000 |
50% | 80.100000 | 42310.000000 | 3257.000000 | 24538.953707 | 54.000000 | 13287.000000 | 422.000000 | 0.016549 | 31.500000 | 27.900000 |
75% | 83.600000 | 49184.000000 | 6730.000000 | 28499.655884 | 113.000000 | 25924.000000 | 525.000000 | 0.021352 | 33.900000 | 31.900000 |
max | 96.200000 | 86354.000000 | 57505.000000 | 113017.751479 | 1435.000000 | 223233.000000 | 1212.000000 | 0.092937 | 43.200000 | 41.400000 |
We split the data into equal lower, middle, and upper quantiles based on first obeity and then inactivity. We can see that the the average death rates of counties in these partitions is positivly correlated with both of these features. This was expected as preexisting health conditions (obescity) and heath risks (inactivity) increase all cause mortality but also have a strong effect on how serious a covid infection can be. Finally we see that income has an even stronger relationship with the death rate, though here the correlation is a negative one. Obesity and inactivity are both negatively correlated with income as well. The relationship between voting for Trump and income is not a string one strong, though there is a positive correlation between Trump voting and obesity, inactivity, and covid death rate.
We can tell from comparing the populations between the groups that this data is not treated granularly as would be ideal. Very small population counties that get weighted the same as very large population counties in regards to the mean. So rural areas get over represented in the averages nationwide. This also explains why the Trump vs Biden is so far skewed from the actual well known national average based on popular votes.
Also, the difference between income correlates more with population density than it might with an individual socio economic status. First, a higher income might not go as far towards standard of living in the city as it does in rural areas. Second, by using the average income over the whole county, income inequality in that county is not factored in. There could be many low income individuals living with many high income individuals in the same county.
We have done data gathing, parsing, and exploring data. We can continue to predict some behavior of the data (e.g. how a particular county will respond to COVID on a weekly basis).
Alternatively, we could be interested in inference, whereby we are more concerned with trying to understand why and how a system behaves the way it does. We might wish to understand which factors most correlate and cause a certain event to happen. This could give us insights into where certain inequalities persist.