Notebook

Data Quality Bootcamp | Data Cleaning Demo¶

Ryan Othniel Kearns | rkearns@montecarlodata.com

16 March 2023

Welcome to the Data Quality Bootcamp! In this demo, we demonstrate some best practices for data cleaning, and illustrate why they're important for data quality. We'll use a sample dataset containing information about exoplanets.

In [ ]:

import warnings
warnings.filterwarnings("ignore")

In [ ]:

import pandas as pd
import plotly.express as px
import plotly.io as pio

pio.renderers.default = "plotly_mimetype+notebook"

In [ ]:

import sys
sys.path.append("..")

Let's connect to our database.

We're making use of sqlite3 and pandas in python to be able to fluidly display SQL query results in a notebook format, with easy visualizations. If you're not familiar with python, don't worry about the scaffolding that's going on -- the SQL is the "meat" of this presentation.

In [ ]:

import sqlite3

conn = sqlite3.connect("../data/dbs/strata.db")
cur = conn.cursor()

In [ ]:

def setup_local_db():
    import random
    import uuid

    for i in range(20): cur.execute(f"""--sql
        INSERT INTO exoplanets
        VALUES (
            '{str(uuid.uuid4())}',
            {random.random() * 200},
            {random.random() * 15},
            {random.random() * 500},
            {random.random() * 150},
            '2020-03-05',
            {random.random()},
            'co2'
        )
    ;
    """).fetchall()
        
    for i in range(57): cur.execute(f"""--sql
        INSERT INTO exoplanets
        VALUES (
            '{str(uuid.uuid4())}',
            {random.random() * 200},
            {random.random() * 15},
            {random.random() * 500},
            {random.random() * 150},
            '2020-03-05',
            {random.random()},
            'o2'
        )
    ;
    """).fetchall()
        
    for i in range(18): cur.execute(f"""--sql
        INSERT INTO exoplanets
        VALUES (
            '{str(uuid.uuid4())}',
            {random.random() * 200},
            {random.random() * 15},
            {random.random() * 500},
            {random.random() * 150},
            '2020-03-05',
            {random.random()},
            'n2'
        )
    ;
    """).fetchall()
        
    for i in range(23): cur.execute(f"""--sql
        INSERT INTO exoplanets
        VALUES (
            '{str(uuid.uuid4())}',
            {random.random() * 200},
            {random.random() * 15},
            {random.random() * 500},
            {random.random() * 150},
            '2020-03-05',
            {random.random()},
            'h2so4'
        )
    ;
    """).fetchall()
        
setup_local_db()

We can take a look at our main dataset, called exoplanets.

In [ ]:

sql = """--sql

SELECT * FROM exoplanets LIMIT 5;

"""

pd.read_sql_query(sql, conn)

What's the schema of this table?

In [ ]:

sql = """--sql

-- note that this syntax is specific to SQLite... in Snowflake for example, we would use `DESCRIBE TABLE EXOPLANETS;`
PRAGMA TABLE_INFO(exoplanets);

"""

pd.read_sql_query(sql, conn)

Hmm, so some questions immediately pop up about this dataset. Let's go through them formulaically to ensure clean data!

1. Handling Nulls¶

First: The notnull column in our table info says that none of the columns are non-null (that is, enforced to be non-null by the database). That seems bad, especially for a column name like _id, which presumably identifies the planet each row represents in our system.

Are there actually null values present for every one of these columns?

In [ ]:

sql = """--sql

SELECT COUNT(*) AS count_null_id
FROM exoplanets
WHERE _id IS NULL;

"""

pd.read_sql_query(sql, conn)

Phew! No null _ids!

How about other columns?

In [ ]:

sql = """--sql

SELECT COUNT(*) AS count_null_distance
FROM exoplanets
WHERE distance IS NULL;

"""

pd.read_sql_query(sql, conn)

So we have some nulls for the distance field... how many as a share of the total?

In [ ]:

sql = """--sql

SELECT
    CAST(SUM(
        CASE
            WHEN distance IS NULL THEN 1
            ELSE 0
        END
    ) AS FLOAT) / COUNT(*) AS distance_null_rate
FROM exoplanets;

"""

pd.read_sql_query(sql, conn)

5% is a workable number of nulls for me. Let's get a sense of how the distances are distributed so we can decide on what to do about this pesky share of null data.

In [ ]:

sql = """--sql

SELECT distance
FROM exoplanets
WHERE distance IS NOT NULL;

"""

px.histogram(pd.read_sql_query(sql, conn))

Looks like distance can't be a negative value (that's good...) and is otherwise distributed log-normally. (For non-statisticians, this is what we'd expect from a natural phenomena that can take on only positive values.)

Anyway! There are three things we can do with null values when data cleaning:

Drop the rows with null values.
Interpolate the values using the distribution of non-null data.
Overwrite some other default value to the null values.

Approach #1 would reduce the size of our dataset by 5% without affecting the distribution of distances.

Alternatively (#2), we could replace any null distance value with the median distance. That seems pretty sensible, right? This means we can use 100% of our data, though our log-normal distribution is going to be biased now by the interpolation. Take a look:

In [ ]:

sql = """--sql

SELECT
    CASE
        WHEN distance IS NULL THEN average_distance
        ELSE distance
    END AS distance_interpolated
FROM exoplanets
JOIN (SELECT AVG(distance) AS average_distance FROM exoplanets);

"""

px.histogram(pd.read_sql_query(sql, conn))

Option #3 is to replace null distances with another default value. For example, 0 might be a sensible default for null values, depending on your use case. This approach will also introduce bias in your data, though:

In [ ]:

sql = """--sql

SELECT
    CASE
        WHEN distance IS NULL THEN 0
        ELSE distance
    END AS distance_interpolated
FROM exoplanets;

"""

px.histogram(pd.read_sql_query(sql, conn))

2. Validating Categorical Values¶

So above we did some interesting data cleaning on a numerical column, distance. But should we be worried about cleaning other kinds of data? As it turns out, yes -- the alternative type of data in our table is categorical data, for example, the atmosphere column:

In [ ]:

sql = """--sql

SELECT atmosphere
FROM exoplanets
WHERE atmosphere IS NOT NULL
LIMIT 5;

"""

pd.read_sql_query(sql, conn)

Looks like this data takes text that represents a chemical compound -- probably the most common chemical in that planet's atmosphere.

Notice that atmosphere is null a high amount of the time:

In [ ]:

sql = """--sql

SELECT
    CAST(SUM(
        CASE
            WHEN atmosphere IS NULL THEN 1
            ELSE 0
        END
    ) AS FLOAT) / COUNT(*) AS atmosphere_null_rate
FROM exoplanets;

"""

pd.read_sql_query(sql, conn)

Almost all of the time! What gives?

Let's look at the distinct categorical values to figure out why:

In [ ]:

sql = """--sql

SELECT atmosphere, COUNT(*)
FROM exoplanets
GROUP BY 1;

"""

pd.read_sql_query(sql, conn)

So, first observation: there's no option for "No atmosphere" in these values. It's likely that null in this case records that the exoplanet does not have an atmosphere. If we know that to be the case (we shouldn't just assume it!), then we should replace these nulls with a default value, "no_atmosphere":

In [ ]:

sql = """--sql

WITH atmosphere_default_none AS(
    SELECT
        CASE
            WHEN atmosphere IS NULL THEN 'no_atmosphere'
            ELSE atmosphere
        END AS atmosphere
    FROM exoplanets
)

SELECT atmosphere, COUNT(*)
FROM atmosphere_default_none
GROUP BY 1;

"""

pd.read_sql_query(sql, conn)

Second, we have a problem with our categorical values! Some of them are lowercased and some are not. If we leave the data like this, we'll be ignoring something crucial about the values -- namely that O2 and o2 are the same chemical compound.

Let's fix it:

In [ ]:

sql = """--sql

WITH atmosphere_default_none AS(
    SELECT
        CASE
            WHEN atmosphere IS NULL THEN 'NO_ATMOSPHERE'
            ELSE UPPER(atmosphere)
        END AS atmosphere
    FROM exoplanets
)

SELECT atmosphere, COUNT(*)
FROM atmosphere_default_none
GROUP BY 1;

"""

pd.read_sql_query(sql, conn)