Data Cleaning and Exploration¶

Overview of today's topics:

Data cleaning and feature engineering with real world data sets
Exploring data sets with descriptive stats and visualization

To set this lecture up, I downloaded the most popular data sets a couple years ago from 1) LA's covid dashboard, 2) the LA city data portal, and 3) the LA county data portal. This gives us a variety of real-world data sets that are relatively messy and require some cleaning and transformation prior to analysis.

In [ ]:

import ast

import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

1. Data cleaning¶

1.1. LA County Covid Cases¶

Data source

Note from the Covid data source: "Crude and Adjusted Rates are Per 100,000 population (2018 Population Estimates). Adjusted Rate is age-adjusted by year 2000 US Standard Population. Adjusted rates account for differences in the distribution of age in the underlying population. Adjusted rates are useful for comparing rates across geographies (i.e. comparing the rate between cities that have different age distributions)."

In [ ]:

# load the data
df = pd.read_csv("../../data/LA_County_Covid19_CSA_case_death_table.csv")
df.shape

In [ ]:

# what do you see in the raw data?
df

In [ ]:

# check the data types: do we need to change/convert any?
df.dtypes

In [ ]:

# drop the duplicate IDs and rename the place column to something meaningful
df = df.drop(columns=["Unnamed: 0"]).rename(columns={"geo_merge": "place_name"})
df

In [ ]:

# clean up place names
df["place_name"] = (
    df["place_name"]
    .str.replace("City of ", "")
    .str.replace("Unincorporated - ", "")
    .str.replace("Los Angeles - ", "")
)
df.sort_values("place_name")

In [ ]:

df_covid = df.copy()

In [ ]:

# now it's your turn
# create a new column representing the proportion of cases that were fatal

1.2. LA County Top Earners¶

Data source

In [ ]:

# load the data
df = pd.read_csv("../../data/Top_County_Earners.csv")
df.shape

In [ ]:

# what do you see in the raw data?
df

In [ ]:

# check the data types: do we need to change/convert any?
df.dtypes

In [ ]:

# why does the total earnings column name above look weird?
df.columns

In [ ]:

# rename the total earnings column to something that won't trip you up
df = df.rename(columns={" Total Earnings": "Total Earnings"})

In [ ]:

# convert the float columns to ints: a couple ways you could do it (either works)...
# OPTION 1: use IndexSlice from last week's lecture
slicer = pd.IndexSlice[:, "Base Earnings":"Total Compensation"]
df.loc[slicer] = df.loc[slicer].astype(int)

# OPTION 2: select columns where type is float64
float_cols = df.columns[df.dtypes == "float64"]
df[float_cols] = df[float_cols].astype(int)

In [ ]:

# move year to end and employee name to beginning
cols = [df.columns[-1]] + df.columns[1:-1].to_list() + [df.columns[0]]
df = df.reindex(columns=cols)
df

In [ ]:

# convert from USD to 1000s of USD
df["Total Compensation 1000s"] = df["Total Compensation"] / 1000

In [ ]:

# improve the capitalization (note, only Series can do vectorized str methods)
slicer = pd.IndexSlice[:, "Employee Name":"Department"]
df.loc[slicer] = df.loc[slicer].apply(lambda col: col.str.title(), axis="rows")
df

Idea: you could use NLTK to classify male vs female names and examine average pay differences between the two groups.

In [ ]:

df_earnings = df.copy()

In [ ]:

# now it's your turn
# convert all the earnings/compensation columns from USD to Euros, using today's exchange rate

1.3. LA City Active Businesses¶

Data source

Note: NAICS is the North American Industry Classification System

In [ ]:

# load the data
df = pd.read_csv("../../data/Listing_of_Active_Businesses.csv")
df.shape

In [ ]:

# what do you see in the raw data?
df

In [ ]:

# check the data types: do we need to change/convert any?
df.dtypes

In [ ]:

# you have to make a decision: NAICS should be int, but it contains nulls
# you could drop nulls then convert to int, or just leave it as float
# OR in recent versions of pandas, you could cast to type pd.Int64Dtype() which allows nulls
pd.isnull(df["NAICS"]).sum()

In [ ]:

# make sure end dates are all null, then drop that column
assert pd.isnull(df["LOCATION END DATE"]).all()
df = df.drop(columns=["LOCATION END DATE"])

In [ ]:

# make the column names lower case and without spaces or hash signs
cols = df.columns.str.lower().str.replace(" ", "_").str.strip("_#")
df.columns = cols

In [ ]:

# make sure account numbers are unique, then set as index and sort index
assert df["location_account"].is_unique
df = df.set_index("location_account").sort_index()
df

In [ ]:

# convert the start date from strings to datetimes
df["location_start_date"] = pd.to_datetime(df["location_start_date"])

In [ ]:

# improve the capitalization
slicer = pd.IndexSlice[:, "business_name":"mailing_city"]
df.loc[slicer] = df.loc[slicer].apply(lambda col: col.str.title(), axis="rows")
df

In [ ]:

# what's going on with those location coordinates?
df["location"].iloc[0]

So, the location column contains a mix of nulls and alongside strings of tuples of coordinates. Yikes. There are different ways to parse these coordinates out. Here's a relatively efficient option. First, some explanation:

Create a mask of True/False identifying where location is not null, so we don't try to parse nulls.
Select all the non-null locations, literal_eval them (this "runs" each string as Python code, rendering them as tuples), and capture the result as a Series called latlng.
Create new lat and lng columns in df (only assigning values to them where the mask is True) by breaking-out the tuples from the previous step into a DataFrame with two columns.
Drop the now-redundant location column.

In [ ]:

mask = pd.notnull(df["location"])
latlng = df.loc[mask, "location"].map(ast.literal_eval)
df.loc[mask, ["lat", "lng"]] = pd.DataFrame(
    latlng.to_list(), index=latlng.index, columns=["lat", "lng"]
)
df = df.drop(columns=["location"])
df

In [ ]:

df_business = df.copy()

In [ ]:

# now it's your turn
# create a new column containing only the 5-digit zip
# which zip codes appear the most in the data set?

2. Exploration: description and visualization¶

Python data visualization tool landscape:

matplotlib is powerful but unwieldy; good for basic plotting (scatter, line, bar), and pandas can use it directly
seaborn (built on top of matplotlib) is best for statistical visualization: summarizing data, understanding distributions, searching for patterns and trends
bokeh is for interactive visualization to let your audience explore the data themselves

We will focus on seaborn in this class. It is the easiest to work with to produce meaningful and aesthetically-pleasing visuals. Seaborn makes generally smart decisions about color for you. But you can tweak the colors in your plot usually by passing in a palette argument (the name of a colormap or a list of colors to use). More info:

How seaborn handles color: https://seaborn.pydata.org/tutorial/color_palettes.html
Available color maps: https://matplotlib.org/tutorials/colors/colormaps.html
Available named colors: https://matplotlib.org/gallery/color/named_colors.html

In [ ]:

# configure seaborn's style for subsequent use
sns.set_style("whitegrid")  # visual styles
sns.set_context("paper")  # presets for scaling figure element sizes

In [ ]:

# our cleaned data sets from earlier
print(df_business.shape)
print(df_covid.shape)
print(df_earnings.shape)

2.1. Understanding the data's distribution¶

In [ ]:

# quick descriptive stats for some variable
# but... looking across the whole population obscures between-group heterogeneity
df_earnings["Total Compensation 1000s"].describe()

In [ ]:

# which departments have the most employees in the data set?
dept_counts = df_earnings["Department"].value_counts().head()
dept_counts

In [ ]:

# recall grouping and summarizing from last week
# look at compensation distribution across the 5 largest departments
mask = df_earnings["Department"].isin(dept_counts.index)
df_earnings.loc[mask].groupby("Department")["Total Compensation 1000s"].describe().astype(int)

That's better... but it's still hard to pick out patterns and trends by just staring at a table full of numbers. Let's visualize it.

Box plots illustrate the data's distribution via the "5 number summary": min, max, median, and the two quartiles (plus outliers). We will use seaborn for our visualization. In seaborn, you can control what's considered an outlier by changing min/max of whiskers with whis parameter... the convention is outliers > 1.5 IQR. For a vertical boxplot, x = your variable's column and y = categorical column to group by.

In [ ]:

# visualize compensation distribution across the 5 largest departments
x = df_earnings.loc[mask, "Total Compensation 1000s"]
y = df_earnings.loc[mask, "Department"]

# fliersize changes the size of the outlier dots
# boxprops lets you set more configs with a dict, such as alpha (which means opacity)
ax = sns.boxplot(x=x, y=y, fliersize=0.3, boxprops={"alpha": 0.7})

# set the x-axis limit, the figure title, and x/y axis labels
ax.set_xlim(left=0)
ax.set_title("Total compensation by department")
ax.set_xlabel("Total compensation (USD, 1000s)")
ax.set_ylabel("")

# save figure to disk at 300 dpi and with a tight bounding box
ax.get_figure().savefig("boxplot-earnings.png", dpi=300, bbox_inches="tight")

Ideally, your xlabel would state what year the USD are in (e.g., "2017 inflation-adjusted USD") but the data source doesn't say clearly. My guess is that they are nominal dollars from the reported year.

What does this figure tell you? Which department had the highest total compensations? By what measure?

In [ ]:

# what is this "ax" variable we created?
type(ax)

In [ ]:

# every matplotlib axes is associated with a "figure" which is like a container
fig = ax.get_figure()
type(fig)

In [ ]:

# manually change the plot's size/dimension by adjusting its figure's size
fig = ax.get_figure()
fig.set_size_inches(16, 4)  # width, height in inches
fig

Histograms visualize the distribution of some variable by binning it then counting observations per bin. KDE plots are similar, but continuous and smooth.

In [ ]:

# histplot visualizes the variable's distribution as a histogram and optionally a KDE
ax = sns.histplot(df_earnings["Total Compensation 1000s"].dropna(), kde=False, bins=30)
_ = ax.set_xlim(left=0)

You can compare multiple histograms to see how different groups overlap or differ by some measure.

In [ ]:

# typical LASD employee earns more than the typical regional planner :(
df_earnings.groupby("Department")["Total Compensation 1000s"].median().sort_values(
    ascending=False
).head(10)

In [ ]:

# visually compare sheriff and social services dept subsets
mask = df_earnings["Department"].isin(["Public Social Services Dept", "Sheriff"])
ax = sns.histplot(
    data=df_earnings.loc[mask], x="Total Compensation 1000s", hue="Department", bins=50, kde=False
)

ax.set_xlim(0, 400)
ax.set_xlabel("Total compensation (USD, 1000s)")
ax.set_title("Employee Compensation: LASD vs Social Services")
ax.get_figure().savefig("boxplot-hists.png", dpi=300, bbox_inches="tight")

Looks like a pretty big difference! But is it statistically significant?

In [ ]:

# difference-in-means: compute difference, t-statistic, and p-value
group1 = df_earnings[df_earnings["Department"] == "Public Social Services Dept"][
    "Total Compensation 1000s"
]
group2 = df_earnings[df_earnings["Department"] == "Sheriff"]["Total Compensation 1000s"]
t, p = stats.ttest_ind(group1, group2, equal_var=False, nan_policy="omit")
print(group1.mean() - group2.mean(), t, p)

Social service workers in LA county make, on average, $56k less than LASD employees and this difference is statistically significant (p<0.001).

Note also that you can divide your p-value by 2 if you need to convert it from a two-tailed to one-tailed hypothesis test.

In [ ]:

# the big reveal... who (individually) had the highest earnings?
cols = ["Employee Name", "Position Title", "Department", "Total Compensation 1000s"]
df_earnings[cols].sort_values("Total Compensation 1000s", ascending=False).head(10)

In [ ]:

# now it's your turn
# choose 3 departments and visualize their overtime earnings distributions with histograms

2.2. Pairwise relationships¶

Histograms and box plots visualize univariate distributions: how a single variable's values are distributed. Scatter plots essentially visualize bivariate distributions so that we can see patterns and trends jointly between two variables.

In [ ]:

df_covid

In [ ]:

# use seaborn to scatter-plot two variables
ax = sns.scatterplot(x=df_covid["cases_final"], y=df_covid["deaths_final"])
ax.set_xlim(left=0)
ax.set_ylim(bottom=0)
ax.get_figure().set_size_inches(5, 5)  # make it square

In [ ]:

# show a pair plot of these places across these 3 variables
cols = ["cases_final", "deaths_final", "population"]
ax = sns.pairplot(df_covid[cols].dropna())

Do you see patterns in these scatter plots? Correlation tells us to what extent two variables are linearly related to one another. Pearson correlation coefficients range from -1 to 1, with 0 indicating no linear relationship, -1 indicating a perfect negative linear relationship, and 1 indicating a perfect positive linear relationship. If you are hypothesis-testing a correlation, make sure to report and interpret the p-value.

In [ ]:

# calculate correlation (and significance) between two variables
r, p = stats.pearsonr(x=df_covid["population"], y=df_covid["cases_final"])
print(round(r, 3), round(p, 3))

In [ ]:

# a correlation matrix
correlations = df_covid[cols].corr()
correlations.round(2)

In [ ]:

# visual correlation matrix via seaborn heatmap
# use vmin, vmax, center to set colorbar scale properly
ax = sns.heatmap(
    correlations, vmin=-1, vmax=1, center=0, cmap="coolwarm", square=True, linewidths=1
)

In [ ]:

# now it's your turn
# visualize a correlation matrix of the various compensation columns in the earnings dataframe
# from the visualize, pick two variables, calculate their correlation coefficient and p-value

In [ ]:

# regress one variable on another: a change in x is associated with what change in y?
m, b, r, p, se = stats.linregress(x=df_covid["population"], y=df_covid["cases_final"])
print(m, b, r, p, se)

In [ ]:

# a linear (regression) trend line + confidence interval
ax = sns.regplot(x=df_covid["population"], y=df_covid["cases_final"])
ax.get_figure().set_size_inches(5, 5)

In [ ]:

# now it's your turn
# does logarithmic transformation improve the heteroskedasticity and linear fit?

2.3. Bar plots and count plots¶

Count plots let you count things across categories.

Bar plots let you estimate a measure of central tendency across categories.

In [ ]:

# extract the two-digit sector code from each NAICS classification
sectors = df_business["naics"].dropna().astype(int).astype(str).str.slice(0, 2)
sectors

In [ ]:

# count plot: like a histogram counting observations across categorical instead of continuous data
order = sectors.value_counts().index
ax = sns.countplot(x=sectors, order=order, alpha=0.9)
ax.set_xlabel("NAICS Sector")
ax.set_ylabel("Number of businesses")
ax.get_figure().savefig("countplot-naics.png", dpi=300, bbox_inches="tight")

NAICS sector 54 is "professional, scientific, and technical services" and sector 53 is "real estate and rental and leasing."

In [ ]:

# bar plot: estimate mean total compensation per dept + 95% confidence interval
order = (
    df_earnings.groupby("Department")["Total Compensation 1000s"]
    .mean()
    .sort_values(ascending=False)
    .index
)
ax = sns.barplot(
    x=df_earnings["Total Compensation 1000s"],
    y=df_earnings["Department"],
    estimator=np.mean,
    errorbar=("ci", 95),
    order=order,
    alpha=0.9,
)

ax.set_xlabel("Mean Total Compensation (USD, 1000s)")
ax.set_ylabel("")
ax.get_figure().set_size_inches(4, 12)

In [ ]:

# now it's your turn
# use the businesses dataframe to visualize a bar plot of mean start year

2.4. Line plots¶

Line plots are most commonly used to visualize time series: how one or more variables change over time. We don't have time series data here, so we'll improvise with a bit of an artificial example.

In [ ]:

# extract years from each start date then count their appearances
years = df_business["location_start_date"].dropna().dt.year.value_counts().sort_index()
years

In [ ]:

# reindex so we're not missing any years
labels = range(years.index.min(), years.index.max() + 1)
years = years.reindex(labels).fillna(0).astype(int)
years

In [ ]:

# line plot showing counts per start year over past 40 years
ax = sns.lineplot(data=years.loc[1980:2020])

# rotate the tick labels
ax.tick_params(axis="x", labelrotation=45)

ax.set_xlim(1980, 2020)
ax.set_ylim(bottom=0)
ax.set_xlabel("Year")
ax.set_ylabel("Count")
ax.set_title("Business Location Starts by Year")
ax.get_figure().savefig("lineplot-businesses.png", dpi=300, bbox_inches="tight")

In [ ]:

# now it's your turn
# extract month + year from the original date column
# re-create the line plot to visualize location starts by month + year

In [ ]: