Ryan Othniel Kearns | rkearns@montecarlodata.com
16 March 2023
Welcome to the Data Quality Bootcamp! In this demo, we demonstrate some best practices for data cleaning, and illustrate why they're important for data quality. We'll use a sample dataset containing information about exoplanets.
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
import sys
sys.path.append("..")
Let's connect to our database.
We're making use of sqlite3
and pandas
in python to be able to fluidly display SQL query results in a notebook format, with easy visualizations. If you're not familiar with python, don't worry about the scaffolding that's going on -- the SQL is the "meat" of this presentation.
import sqlite3
conn = sqlite3.connect("../data/dbs/strata.db")
cur = conn.cursor()
def setup_local_db():
import random
import uuid
for i in range(20): cur.execute(f"""--sql
INSERT INTO exoplanets
VALUES (
'{str(uuid.uuid4())}',
{random.random() * 200},
{random.random() * 15},
{random.random() * 500},
{random.random() * 150},
'2020-03-05',
{random.random()},
'co2'
)
;
""").fetchall()
for i in range(57): cur.execute(f"""--sql
INSERT INTO exoplanets
VALUES (
'{str(uuid.uuid4())}',
{random.random() * 200},
{random.random() * 15},
{random.random() * 500},
{random.random() * 150},
'2020-03-05',
{random.random()},
'o2'
)
;
""").fetchall()
for i in range(18): cur.execute(f"""--sql
INSERT INTO exoplanets
VALUES (
'{str(uuid.uuid4())}',
{random.random() * 200},
{random.random() * 15},
{random.random() * 500},
{random.random() * 150},
'2020-03-05',
{random.random()},
'n2'
)
;
""").fetchall()
for i in range(23): cur.execute(f"""--sql
INSERT INTO exoplanets
VALUES (
'{str(uuid.uuid4())}',
{random.random() * 200},
{random.random() * 15},
{random.random() * 500},
{random.random() * 150},
'2020-03-05',
{random.random()},
'h2so4'
)
;
""").fetchall()
setup_local_db()
We can take a look at our main dataset, called exoplanets
.
sql = """--sql
SELECT * FROM exoplanets LIMIT 5;
"""
pd.read_sql_query(sql, conn)
What's the schema of this table?
sql = """--sql
-- note that this syntax is specific to SQLite... in Snowflake for example, we would use `DESCRIBE TABLE EXOPLANETS;`
PRAGMA TABLE_INFO(exoplanets);
"""
pd.read_sql_query(sql, conn)
Hmm, so some questions immediately pop up about this dataset. Let's go through them formulaically to ensure clean data!
First: The notnull
column in our table info says that none of the columns are non-null (that is, enforced to be non-null by the database). That seems bad, especially for a column name like _id
, which presumably identifies the planet each row represents in our system.
Are there actually null values present for every one of these columns?
sql = """--sql
SELECT COUNT(*) AS count_null_id
FROM exoplanets
WHERE _id IS NULL;
"""
pd.read_sql_query(sql, conn)
Phew! No null _id
s!
How about other columns?
sql = """--sql
SELECT COUNT(*) AS count_null_distance
FROM exoplanets
WHERE distance IS NULL;
"""
pd.read_sql_query(sql, conn)
So we have some nulls for the distance
field... how many as a share of the total?
sql = """--sql
SELECT
CAST(SUM(
CASE
WHEN distance IS NULL THEN 1
ELSE 0
END
) AS FLOAT) / COUNT(*) AS distance_null_rate
FROM exoplanets;
"""
pd.read_sql_query(sql, conn)
5% is a workable number of nulls for me. Let's get a sense of how the distances are distributed so we can decide on what to do about this pesky share of null data.
sql = """--sql
SELECT distance
FROM exoplanets
WHERE distance IS NOT NULL;
"""
px.histogram(pd.read_sql_query(sql, conn))
Looks like distance
can't be a negative value (that's good...) and is otherwise distributed log-normally. (For non-statisticians, this is what we'd expect from a natural phenomena that can take on only positive values.)
Anyway! There are three things we can do with null
values when data cleaning:
Approach #1 would reduce the size of our dataset by 5% without affecting the distribution of distances.
Alternatively (#2), we could replace any null distance
value with the median distance. That seems pretty sensible, right? This means we can use 100% of our data, though our log-normal distribution is going to be biased now by the interpolation. Take a look:
sql = """--sql
SELECT
CASE
WHEN distance IS NULL THEN average_distance
ELSE distance
END AS distance_interpolated
FROM exoplanets
JOIN (SELECT AVG(distance) AS average_distance FROM exoplanets);
"""
px.histogram(pd.read_sql_query(sql, conn))
Option #3 is to replace null distance
s with another default value. For example, 0 might be a sensible default for null values, depending on your use case. This approach will also introduce bias in your data, though:
sql = """--sql
SELECT
CASE
WHEN distance IS NULL THEN 0
ELSE distance
END AS distance_interpolated
FROM exoplanets;
"""
px.histogram(pd.read_sql_query(sql, conn))
So above we did some interesting data cleaning on a numerical column, distance
. But should we be worried about cleaning other kinds of data? As it turns out, yes -- the alternative type of data in our table is categorical data, for example, the atmosphere
column:
sql = """--sql
SELECT atmosphere
FROM exoplanets
WHERE atmosphere IS NOT NULL
LIMIT 5;
"""
pd.read_sql_query(sql, conn)
Looks like this data takes text that represents a chemical compound -- probably the most common chemical in that planet's atmosphere.
Notice that atmosphere
is null a high amount of the time:
sql = """--sql
SELECT
CAST(SUM(
CASE
WHEN atmosphere IS NULL THEN 1
ELSE 0
END
) AS FLOAT) / COUNT(*) AS atmosphere_null_rate
FROM exoplanets;
"""
pd.read_sql_query(sql, conn)
Almost all of the time! What gives?
Let's look at the distinct categorical values to figure out why:
sql = """--sql
SELECT atmosphere, COUNT(*)
FROM exoplanets
GROUP BY 1;
"""
pd.read_sql_query(sql, conn)
So, first observation: there's no option for "No atmosphere" in these values. It's likely that null
in this case records that the exoplanet does not have an atmosphere. If we know that to be the case (we shouldn't just assume it!), then we should replace these null
s with a default value, "no_atmosphere":
sql = """--sql
WITH atmosphere_default_none AS(
SELECT
CASE
WHEN atmosphere IS NULL THEN 'no_atmosphere'
ELSE atmosphere
END AS atmosphere
FROM exoplanets
)
SELECT atmosphere, COUNT(*)
FROM atmosphere_default_none
GROUP BY 1;
"""
pd.read_sql_query(sql, conn)
Second, we have a problem with our categorical values! Some of them are lowercased and some are not. If we leave the data like this, we'll be ignoring something crucial about the values -- namely that O2
and o2
are the same chemical compound.
Let's fix it:
sql = """--sql
WITH atmosphere_default_none AS(
SELECT
CASE
WHEN atmosphere IS NULL THEN 'NO_ATMOSPHERE'
ELSE UPPER(atmosphere)
END AS atmosphere
FROM exoplanets
)
SELECT atmosphere, COUNT(*)
FROM atmosphere_default_none
GROUP BY 1;
"""
pd.read_sql_query(sql, conn)