Notebook

Website

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan.

Data Preparation for Machine Learning - Creating Labels¶

Python Setup¶

Back to Table of Contents

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with numpy, pandas, and psycopg2 from previous tutorials. Here we'll also be using scikit-learn to fit modeling.

In [ ]:

%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import time

In [ ]:

# and set our database connection parameters
db_name = "appliedda"
hostname = "10.10.2.10"

Creating Labels¶

Labels are the dependent variables, or Y variables, that we are trying to predict. In the machine learning framework, your labels are usually binary: true or false, often encoded as 1 or 0.

It is important to clearly and explicitly define the rows (aka observations) of your analysis to ensure you properly combine input datasets and populate the columns (aka features).

In this notebook, we define each row as an individual finishing a TANF spell. A spell could be participation in just one case or a series of multiple cases.

For this example, let's use January 1, 2008, as our "date of prediction" to simulate predicting return to TANF after 1 year of being off TANF. With this definition, we can consider the workforce participation of individuals who exited TANF in Q4 of 2006 as a "feature" (more on that later) in our prediction.

In [ ]:

# set database connections - use psycopg2 to more easily execute queries without returning data (eg for series of CREATE queries)
conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

In [ ]:

start_time = time.time()
sql = '''
CREATE TEMP TABLE cohort_20080101 AS
SELECT recptno, start_date, end_date
FROM il_dhs.ind_spells
WHERE end_date >= (('2008-01-01'::date - '1 year'::interval)-'3 months'::interval) AND 
        end_date < ('2008-01-01'::date - '1 year'::interval)
        AND benefit_type = 'tanf46';
COMMIT;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [ ]:

sql = '''
SELECT *
FROM cohort_20080101
'''
df = pd.read_sql(sql, conn)

In [ ]:

print('there are {} TANF spells that end in our selected study period'.format(df.shape[0]))

Outcome example: return to TANF in years 2 and 3 after 1 year of being off TANF¶

For our prediction problem we will focus on the ind_spells table, which has the start and end dates of individual level spells on three different benefit programs; TANF, SNAP, and cash assistance.

We defined our cohort above as those who exited the TANF program between 12 and 15 months prior to our prediction date. Now we will find those in our cohort who returned to TANF in the two years after our prediction date.

In [ ]:

start_time = time.time()
# only return the first spell in the event they returned more than
# once in the following 2 years.
sql = '''
CREATE TEMP TABLE cohort_returned_20080101 AS
SELECT DISTINCT ON (recptno) recptno, start_date, end_date
FROM il_dhs.ind_spells
WHERE start_date >= '2008-01-01'::date
    AND start_date < ('2008-01-01'::date + '2 year'::interval)
    AND benefit_type = 'tanf46'
    AND recptno IN (SELECT recptno FROM cohort_20080101)
ORDER BY recptno, start_date;
COMMIT;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [ ]:

# Load the  Labels into Python Pandas 
sql = '''
SELECT *
FROM cohort_returned_20080101
'''
df = pd.read_sql(sql, conn)
print('of our study cohort, {} returned to TANF'.format(df.shape[0]))

We will now create a label variable that is set to 0 if the individual does not return to TANF in the following two years after our prediction date and 1 if the individual does have another TANF spell beginning within our time horizon (2 years after the prediction date)

In [ ]:

# create label table
sql = """
CREATE TEMP TABLE label_20080101 AS
SELECT a.recptno, a.start_date, a.end_date, 
    CASE WHEN b.recptno IS NULL THEN 0 ELSE 1 END as label
FROM cohort_20080101 a
LEFT JOIN cohort_returned_20080101 b
ON a.recptno = b.recptno;
commit;
"""
cursor.execute(sql)

df = pd.read_sql("SELECT * FROM label_20080101", conn)

In [ ]:

df.shape

In [ ]:

pd.crosstab(index = df['label'], columns =  'count')

In [ ]:

# or use .value_counts(normalize=True) to show ratio
df['label'].value_counts(normalize=True)

Repeating the Label Creation Process for Prediction date¶

We will need at least one (but preferably many) training and test sets for our machine learning analysis. We will put the above steps into a function with parameters for easier reuse.

In [ ]:

conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

Writing a Function to Create Labels¶

In the above, the SQL queries were all hard coded. In ths section, we demonstrate how to use functions with parameters for the choices we made to define our observations (rows) and label (outcome variable). The complete list of parameters is given in parentheses after the def generate_labels statement. Some parameters are given a default value (like months_back=3), others (like preddate) are not.

Paramaters of generate_labels() function

preddate: date of prediction, note that this should be the first day of the quarter
months_off: months off of TANF before prediction date
months_back: months before "date of prediction - months off TANF" to define cohort
months_ahead: time horizon ahead of date of prediction to consider
schema: Your team schema, where the label table will be written. The default value is set to myschema, which you define in the cell above the function.
overwrite: Whether you want the function to overwrite tables that already exist. Before writing a table, the function will check whether this table exists, and by default will not overwrite existing tables.

In [ ]:

myschema = 'ada_tanf'

In [ ]:

def generate_labels(preddate, months_off=12, months_back=3, months_ahead=24, schema=myschema, overwrite=False):
    
    #database connection
    conn = psycopg2.connect(database=db_name, host = hostname) 
    cursor = conn.cursor()
    
    # set variables based on prediction date
    tbl_suffix = preddate.replace('-', '') #remove dashes
   
    # create full set of queries to create labels
    sql = """
    -- create the our study cohort for this prediction date
    CREATE TEMP TABLE cohort_{tbl_suffix} AS
    SELECT recptno, start_date, end_date
    FROM il_dhs.ind_spells
    WHERE end_date >= (('{pred_date}'::date - '{months_off} months'::interval)-'{months_back} months'::interval) 
        AND end_date < ('{pred_date}'::date - '{months_off} months'::interval)
            AND benefit_type = 'tanf46';
    COMMIT;
    
    -- find how many in our cohort returned to TANF
    CREATE TEMP TABLE cohort_returned_{tbl_suffix} AS
    SELECT DISTINCT ON (recptno) recptno, start_date, end_date
    FROM il_dhs.ind_spells
    WHERE start_date >= '{pred_date}'::date
        AND start_date < ('{pred_date}'::date + '{months_ahead} months'::interval)
        AND benefit_type = 'tanf46'
        AND recptno IN (SELECT recptno FROM cohort_{tbl_suffix})
    ORDER BY recptno, start_date;
    COMMIT;
    
    -- create the label table for this prediction date in our schema
    -- first DROP to handle the overwrite case
    DROP TABLE IF EXISTS {schema}.labels_{tbl_suffix};
    CREATE TABLE {schema}.labels_{tbl_suffix} AS
    SELECT a.recptno, a.start_date, a.end_date, 
        CASE WHEN b.recptno IS NULL THEN 0 ELSE 1 END as label
    FROM cohort_{tbl_suffix} a
    LEFT JOIN cohort_returned_{tbl_suffix} b
    ON a.recptno = b.recptno;
    commit;
    
    -- change owner of table to schema group
    ALTER TABLE {schema}.labels_{tbl_suffix} OWNER TO {schema}_admin;
    """.format(tbl_suffix=tbl_suffix, pred_date=preddate, months_off=months_off,
               months_back=months_back, months_ahead=months_ahead,
               schema=schema)
    
    
    # Let's check if the table already exists:
    # This query will return an empty table (with no rows) if the table does not exist
    cursor.execute('''
    SELECT * FROM pg_tables 
    WHERE tablename = 'labels_{tbl_suffix}' 
    AND schemaname = '{schema}';
    '''.format(tbl_suffix=tbl_suffix, schema=schema))
    
    # Let's write table if it does not exist (or if overwrite = True)
    if not(cursor.rowcount) or overwrite:
        print("Creating table")
        cursor.execute(sql)
    else:
        print("Table already exists")

    cursor.close()
    
    # Load table into pandas dataframe
    sql = '''
    SELECT * FROM {schema}.labels_{tbl_suffix}
    '''.format(tbl_suffix=tbl_suffix, schema=schema)
    df = pd.read_sql(sql, conn)  
    
    return df

Let's test the function with a couple different paramaters:

In [ ]:

start_time = time.time()

# Insert your team schema here:
myschema = 'ada_tanf'

# Set parameters:
preddate = '2008-01-01' # "date of prediction"
months_off=12 # months off of TANF before prediction date
months_back=3 # months before "date of prediction - months off TANF" to define cohort
months_ahead=24 # time horizon ahead of date of prediction to consider

# create labels and return DataFrame
# note: when functions have defaults only need to set parameters that change
df_test1 = generate_labels(preddate)
print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))
pd.crosstab(index = df_test1['label'], columns =  'count')

In [ ]:

start_time = time.time()
# Set parameters:
preddate = '2009-01-01' # "date of prediction"
months_off=12 # months off of TANF before prediction date
months_back=3 # months before "date of prediction - months off TANF" to define cohort
months_ahead=24 # time horizon ahead of date of prediction to consider

# create labels and return DataFrame
# note: when functions have defaults only need to set parameters that change
df_test2 = generate_labels(preddate)
print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))
pd.crosstab(index = df_test2['label'], columns =  'count')

In [ ]:

start_time = time.time()
# Set parameters:
preddate = '2010-01-01' # "date of prediction"
months_off=12 # months off of TANF before prediction date
months_back=3 # months before "date of prediction - months off TANF" to define cohort
months_ahead=24 # time horizon ahead of date of prediction to consider

# create labels and return DataFrame
df_test3 = generate_labels(preddate)
print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))
pd.crosstab(index = df_test3['label'], columns =  'count')

In [ ]:

# here's an easy way to compare proportions
df_test1['label'].value_counts(normalize=True)

In [ ]:

df_test2['label'].value_counts(normalize=True)

In [ ]:

df_test3['label'].value_counts(normalize=True)