Health and Lifestyle Survey Questions Tutorial

In this tutorial, we showcase how the Protodash explainer algorithm from AI Explainability 360 Toolkit implemented through the ProtodashExplainer class could be used to summarize the National Health and Nutrition Examination Survey (NHANES) datasets (Study 1) available through the Center for Disease Control and Prevention (CDC). Moreover, we also show how the algorithm could be used to distill interesting relationships between different facets of life (i.e. early childhood and income), which were found by scientists (Study 2) through decades of rigorous experimentation. This study shows that in using Protodash, one can potentially uncover such insights cheaply, which could then be reaffirmed through rigorous experimentation.

Data from this survey is typically used in epidemiological studies and health science research, which helps develop public health policy, direct and design health programs and services, and expand health knowledge. Thus, the impact of understanding these datasets and the relationships that may exist between them are far reaching for a social scientist.

Introduction to Center for Disease Control and Prevention (CDC) datasets

The NHANES CDC questionnaire datasets are surveys conducted by the organization involving thousands of civilians about various facets of their daily lives. There are 44 questionnaires that collect data about income, occupation, health, early childhood and many other behavioral and lifestyle aspects of individuals living in the US. These questionnaires are thus a rich source of information indicative of the quality of life of many civilians.

This tutorial presents two studies. We first see how a CDC questionaire answered by thousands of individuals could be summarized by looking at answers given by a few prototypical users. Next, an interesting endeavor is to uncover relationships between different aspects of life by analyzing data across the different CDC questionnaires. In the second study, we do exactly that with the help of the Protodash explainer algorithm. We show how the algorithm is able to uncover an interesting insight known only through decades of experimentation, solely from the questionnaire datasets. This by no means suggests the method as a substitute for rigorous experimentation, but showcases it as an avenue for obtaining interesting insights at low cost, which could inspire further indepth studies. The manner in which this is accomplished is by finding prototypical individuals for each of the questionnaires and then evaluating how well they represent the income questionnaire (w.r.t. the method's objective function). The more representative these prototypes are, the more that questionnaire is indicative/representative of income.

For this use case, we are selecting prototypes from specific questionnaires. Hence, the group we want to explain is the dataset itself, which — in this case — are the questionnaires. We are not training an AI model. Rather, we are trying to summarize each questionnaire, which was filled by thousands of people, by selecting a few representative individuals for each of them.

The rest of the tutorial is organized as follows:
Explore Income questionaire
Study 1: Summarize Income Questionnaire using Prototypes
Study 2: Find Questionnaire/s most representative of Income

Protodash: Fast Interpretable Prototype Selection

We now provide a brief overview of the method. The method takes as input a datapoint (or group of datapoints) that we want to explain with respect to instances in a training set belonging to the same feature space. The method then tries to minimize the maximum mean discrepancy (MMD metric) between the datapoints we want to explain and a prespecified number of instances from the training set that it will select. In other words, it will try to select training instances that have the same distribution as the datapoints we want to explain. The method does greedy selection and has quality guarantees with it also returning importance weights for the chosen prototypical training instances indicative of how similar/representative they are.

Why Protodash?

Before we showcase the two studies, we provide some motivation for using this method. The method is able to select in a deterministic fashion examples from a dataset, which we term as prototypes that represent the different segments in a dataset. For example, if we take people that answered the income questionnaire, we might find that there are three categories of people: i) those that are high earners, ii) those that are middle class and iii) those that don't earn much or are unemployed and receive unemployment benefits. Protodash would be able to find these segments by pointing to specific individuals that lie in these categories. Looking at the objective function value of Protodash, one would also be able to say that three segments is the right number here as adding one more segment may not improve the objective value by much.

Compared with other methods such as k-medoids, it has the advantage that it is deterministic and does not have randomizations as in, say, k-medoids clustering, where the centers a typically randomly initialized. So the solutions are repeatable and it picks prototypes that are representative as well as diverse, which may not be the case with standard distance metrics such as euclidean distance. Diversity is important in practical settings (viz. income example above) where we want to capture all the different segments/modes in the dataset, not missing any of the key behaviors.

Another benefit of the method is that, since it performs distribution matching between the user/users in question and those available in the training set, it could, in principle, also be applied in non-iid settings such as for time series data. Other approaches which find similar profiles using standard distance measures (viz. euclidean, cosine) do not have this property. Additionally, we can also highlight important features for the different prototypes that made them similar to the user/users in question.

Import Statements

Import relevant libraries, datasets and Protodash explainer algorithm.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import os
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder

from aix360.algorithms.protodash import ProtodashExplainer
from aix360.datasets.cdc_dataset import CDCDataset
Using TensorFlow backend.

Load CDC dataset

In [2]:
nhanes = CDCDataset()
nhanes_files = nhanes.get_csv_file_names()
(nhanesinfo, _, _) = nhanes._cdc_files_info()

Explore Income questionnaire

Now let us explore the income questionnaire dataset and find out the types of responses received in the survey. Each column in the dataset corresponds to a question and each row denotes the answers given by a respondent to those questions. Both column names and answers by respondents are encoded. For example, 'SEQN' denotes the sequence number assigned to a respondent and 'IND235' corresponds to a question about monthly family income. As seen below, in most cases a value of 1 implies "Yes" to the question, while a value of 2 implies "No." More details about the income questionaire and how questions and answers are encoded can be seen here

Column Description Values and Meaning
SEQN Respondent sequence number
INQ020 Income from wages/salaries 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ012 Income from self employment 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ030 Income from Social Security or RR 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ060 Income from other disability pension 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ080 Income from retirement/survivor pension 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ090 Income from Supplemental Security Income 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ132 Income from state/county cash assistance 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ140 Income from interest/dividends or rental 1->Yes, 2->No, 7->Refused, 9->Don't know
INQ150 Income from other sources 1->Yes, 2->No, 7->Refused, 9->Don't know
IND235 Monthly family income 1-12->Increasing income brackets, 77->Refused, 99->Don't know
INDFMMPI Family monthly poverty level index 0-5->Higher value more affluent
INDFMMPC Family monthly poverty level category 1-3->Increasing INDFMMPI brackets, 7->Refused, 9->Don't know
INQ244 Family has savings more than $5000 1->Yes, 2->No, 7->Refused, 9->Don't know
IND247 Total savings/cash assets for the family 1-6->Increasing savings brackets, 77->Refused, 99->Don't know
In [3]:
# replace encoded column names by the associated question text. 
df_inc = nhanes.get_csv_file('INQ_H.csv')
df_inc.columns[0]
dict_inc = {
'SEQN': 'Respondent sequence number', 
'INQ020': 'Income from wages/salaries',
'INQ012': 'Income from self employment',
'INQ030':'Income from Social Security or RR',
'INQ060':  'Income from other disability pension', 
'INQ080':  'Income from retirement/survivor pension',
'INQ090':  'Income from Supplemental Security Income',
'INQ132':  'Income from state/county cash assistance', 
'INQ140':  'Income from interest/dividends or rental', 
'INQ150':  'Income from other sources',
'IND235':  'Monthly family income',
'INDFMMPI':  'Family monthly poverty level index', 
'INDFMMPC':  'Family monthly poverty level category',
'INQ244':  'Family has savings more than $5000',
'IND247':  'Total savings/cash assets for the family'
}
qlist = []
for i in range(len(df_inc.columns)):
    qlist.append(dict_inc[df_inc.columns[i]])
df_inc.columns = qlist
print("Answers given by some respondents to the income questionnaire:")
df_inc.head(5).transpose()
Answers given by some respondents to the income questionnaire:
Out[3]:
0 1 2 3 4
Respondent sequence number 73557.00 73558.00 73559.00 73560.00 73561.0
Income from wages/salaries 2.00 1.00 2.00 1.00 2.0
Income from self employment 2.00 1.00 2.00 2.00 2.0
Income from Social Security or RR 1.00 1.00 1.00 2.00 1.0
Income from other disability pension 2.00 2.00 2.00 2.00 2.0
Income from retirement/survivor pension 2.00 2.00 1.00 2.00 2.0
Income from Supplemental Security Income 2.00 2.00 2.00 2.00 2.0
Income from state/county cash assistance 2.00 2.00 2.00 2.00 2.0
Income from interest/dividends or rental 2.00 1.00 1.00 2.00 2.0
Income from other sources 2.00 2.00 2.00 1.00 2.0
Monthly family income 4.00 5.00 10.00 9.00 11.0
Family monthly poverty level index 0.86 0.92 4.37 2.52 5.0
Family monthly poverty level category 1.00 1.00 3.00 3.00 3.0
Family has savings more than $5000 9.00 1.00 NaN NaN NaN
Total savings/cash assets for the family NaN NaN NaN NaN NaN

Now, to get more of a feel for the dataset, let us look at the distribution of responses for two questions related to family financial status.

In [4]:
print("Number of respondents to Income questionnaire:", df_inc.shape[0])
print("Distribution of answers to \'monthly family income\' and \'Family savings\' questions:")

fig, axes = plt.subplots(1, 2, figsize=(10,5))
fig.subplots_adjust(wspace=0.5)
hist1 = df_inc['Monthly family income'].value_counts().plot(kind='bar', ax=axes[0])
hist2 = df_inc['Family has savings more than $5000'].value_counts().plot(kind='bar', ax=axes[1])
plt.show()
Number of respondents to Income questionnaire: 10175
Distribution of answers to 'monthly family income' and 'Family savings' questions:

Observe that the majority of individuals responded with a "12" o the question related to monthly family income, which means their income is above USD 8400 as explained here. Similarly, to the question of whether the family has savings more than USD 5000, the majority of individuals responded with a "2", which means "No".

Study 1: Summarize Income Questionnaire using Prototypes

We just explored the income dataset and looked at the distribution of answers for a couple of questions. Now, consider a social scientist who would like to quickly obtain a summary report of this dataset in terms of types of people that span this dataset. Is it possible to summarize this dataset by looking at answers given by a few representative/prototypical respondents?

We now show how the Protodash algorithm can be used to obtain a few prototypical respondents (about 10 in this example) that span the diverse set of individuals answering the income questionnaire making it easy for the social scientist to summarize the dataset.

In [5]:
# convert pandas dataframe to numpy
data = df_inc.to_numpy()

#sort the rows by sequence numbers in 1st column 
idx = np.argsort(data[:, 0])  
data = data[idx, :]

# replace nan's (missing values) with 0's
original = data
original[np.isnan(original)] = 0

# delete 1st column (sequence numbers)
original = original[:, 1:]

# one hot encode all features as they are categorical
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(original)

explainer = ProtodashExplainer()

# call Protodash explainer
# S contains indices of the selected prototypes
# W contains importance weights associated with the selected prototypes 
(W, S, _) = explainer.explain(onehot_encoded, onehot_encoded, m=10) 

# sort the order of prototypes in set S
idx = np.argsort(S)
S = S[idx]
W = W[idx]
In [6]:
# Display the prototypes along with their computed weights
inc_prototypes = df_inc.iloc[S, :].copy()
# Compute normalized importance weights for prototypes
inc_prototypes["Weights of Prototypes"] = np.around(W/np.sum(W), 2) 
inc_prototypes.transpose()
Out[6]:
8 132 690 1475 2449 2912 3899 5077 6895 7475
Respondent sequence number 73565.00 73689.00 74247.00 75032.00 76006.00 76469.00 77456.00 78634.00 80452.00 81032.00
Income from wages/salaries 1.00 1.00 2.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00
Income from self employment 2.00 2.00 2.00 2.00 1.00 2.00 2.00 2.00 2.00 2.00
Income from Social Security or RR 2.00 2.00 2.00 2.00 2.00 2.00 1.00 2.00 2.00 1.00
Income from other disability pension 2.00 2.00 2.00 2.00 2.00 1.00 2.00 2.00 2.00 2.00
Income from retirement/survivor pension 2.00 2.00 2.00 2.00 2.00 1.00 2.00 1.00 2.00 2.00
Income from Supplemental Security Income 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 1.00 2.00
Income from state/county cash assistance 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 1.00 2.00
Income from interest/dividends or rental 2.00 1.00 2.00 1.00 2.00 2.00 2.00 2.00 2.00 1.00
Income from other sources 2.00 2.00 2.00 2.00 1.00 2.00 2.00 2.00 2.00 2.00
Monthly family income 12.00 11.00 8.00 4.00 1.00 6.00 7.00 2.00 5.00 3.00
Family monthly poverty level index 5.00 4.30 3.05 1.65 0.00 1.32 2.71 0.44 0.86 1.08
Family monthly poverty level category 3.00 3.00 3.00 2.00 1.00 2.00 3.00 1.00 1.00 1.00
Family has savings more than $5000 NaN NaN NaN 2.00 2.00 1.00 NaN 2.00 2.00 2.00
Total savings/cash assets for the family NaN NaN NaN 3.00 1.00 NaN NaN 1.00 1.00 1.00
Weights of Prototypes 0.18 0.07 0.09 0.06 0.15 0.07 0.12 0.07 0.09 0.10

Explanation:

The 10 people shown above (i.e. 10 prototypes) are representative of the income questionnaire according to Protodash. Firstly, in the distribution plot for family finance related questions, we saw that there roughly were five times as many people not having savings in excess of $5000 compared with others. Our prototypes also have a similar spread, which is reassuring. Also, for monthly family income, we get a more even spread over the more commonly occurring categories. This is a kind of spot check to see if our prototypes actually match the distribution of values in the dataset.

Looking at the other questions in the questionnaire and the corresponding answers given by the prototypical people above, the social scientist realizes that most people are employed (3rd question) and work for an organization earning through salary/wages (1st two questions). Most of them are also young (5th question) and fit to work (4th question). However, they don't seem to have much savings (last question). The insights that the social scientist acquired from studying the prototypes could also be conveyed to the appropriate government authorities that affect future public policy decisions.

Study 2: Find Questionnaire/s that are most representative of Income

We now move on to our second study, where we want to see how the remaining 39 questionnaires represent or relate to income. This will provide us with an idea of which lifestyle factors are likely to affect income the most. To do this we compute prototypes for each of the questionnaires and evaluate how well they represent the income questionnaire relative to our objective function.

Compute prototypes for all questionaires

This step uses Protodash explainer to compute 10 prototypes for each of the questionaires and saves these for further evaluation.

In [7]:
# Iterate through all questionnaire datasets and find 10 prototypes for each.

prototypes = {}

for i in range(len(nhanes_files)):
    
    f = nhanes_files[i]
    
    print("processing ", f)
    
    # read data to pandas dataframe
    df = nhanes.get_csv_file(f)
    
    # convert data to numpy
    data = df.to_numpy()

    #sort the rows by sequence numbers in 1st column 
    idx = np.argsort(data[:, 0])
    data = data[idx, :]

    # replace nan's with 0's.
    original = data
    original[np.isnan(original)] = 0

    # delete 1st column (contains sequence numbers)
    original = original[:, 1:]

    # one hot encode all features as they are categorical
    onehot_encoder = OneHotEncoder(sparse=False)
    onehot_encoded = onehot_encoder.fit_transform(original)

    explainer = ProtodashExplainer()

    # call Protodash explainer
    # S contains indices of the selected prototypes
    # W contains importance weights associated with the selected prototypes 

    (W, S, _) = explainer.explain(onehot_encoded, onehot_encoded, m=10) 

    prototypes[f]={}
    prototypes[f]['W']= W
    prototypes[f]['S']= S
    prototypes[f]['data'] = data
    prototypes[f]['original'] = original
processing  ACQ_H.csv
processing  ALQ_H.csv
processing  BPQ_H.csv
processing  CDQ_H.csv
processing  CFQ_H.csv
processing  CBQ_H.csv
processing  CKQ_H.csv
processing  HSQ_H.csv
processing  DEQ_H.csv
processing  DIQ_H.csv
processing  DBQ_H.csv
processing  DLQ_H.csv
processing  DUQ_H.csv
processing  ECQ_H.csv
processing  FSQ_H.csv
processing  HIQ_H.csv
processing  HEQ_H.csv
processing  HUQ_H.csv
processing  HOQ_H.csv
processing  IMQ_H.csv
processing  INQ_H.csv
processing  KIQ_U_H.csv
processing  MCQ_H.csv
processing  DPQ_H.csv
processing  OCQ_H.csv
processing  OHQ_H.csv
processing  OSQ_H.csv
processing  PAQ_H.csv
processing  PFQ_H.csv
processing  RXQASA_H.csv
processing  RHQ_H.csv
processing  SXQ_H.csv
processing  SLQ_H.csv
processing  SMQFAM_H.csv
processing  SMQRTU_H.csv
processing  SMQSHS_H.csv
processing  CSQ_H.csv
processing  VTQ_H.csv
processing  WHQ_H.csv
processing  WHQMEC_H.csv

Evaluate the set of prototypical respondents from various questionaires using their income questionaire.

Now that we have the prototypes for each of the questionnaires we evaluate how well the prototypes of each questionaire represent the Income questionnaire based on the objective function that Protodash uses. We see below a ranked list of different questionnaires with their objective function values in ascending order. The higher a questionaire appears in the list, the better its prototypes represent the income questionaire. The values on the right indicate our objective value where lower value is better.

In [8]:
#load income dataset INQ_H and its prototypes
X = prototypes['INQ_H.csv']['original']
Xdata = prototypes['INQ_H.csv']['data']

# Iterate through all questionnaires and evaluate how well their prototypes represent the income dataset. 
objs = []
for i in range(len(nhanes_files)):
        
    #load a dataset, its prototypes & weights

    f = nhanes_files[i]
    
    Ydata = prototypes[f]['data']
    S = prototypes[f]['S']
    W = prototypes[f]['W']
    
    
    # sort the order of prototypes in set S
    idx = np.argsort(S)
    S = S[idx]
    W = W[idx]
    
    # access corresponding prototypes in X. 
    XS = X[np.isin(Xdata[:, 0], Ydata[S, 0]), :]
    
    #print(Ydata[S, 0])
    #print(Xdata[np.isin(Xdata[:, 0], Ydata[S, 0]), 0])   
    
    temp = np.dot(XS, np.transpose(X))    
    u = np.sum(temp, axis=1)/temp.shape[1]
    
    K = np.dot(XS, XS.T)
    
    # evaluate prototypes on income based on our objective function with dot product as similarity measure
    obj = 0.5 * np.dot(np.dot(W.T, K), W) - np.dot(W.T, u)
    objs.append(obj)    
    

# sort the objectives (ascending order) 
index = np.argsort(np.array(objs))

# load the results in a dataframe to print
evalresult = []
for i in range(0,len(index)):    
    evalresult.append([ nhanesinfo[index[i]], objs[index[i]] ])
    
    
df_evalresult = pd.DataFrame.from_records(evalresult)
df_evalresult.columns = ['Questionaire', 'Prototypes representative of Income']
df_evalresult
Out[8]:
Questionaire Prototypes representative of Income
0 Early Childhood -96.119374
1 Physical Functioning -95.584090
2 Acculturation -95.355652
3 Disability -93.984344
4 Physical Activity -93.945023
5 Smoking - Secondhand Smoke Exposure -93.193000
6 Cognitive Functioning -93.050538
7 Sleep Disorders -92.330593
8 Diabetes -91.381703
9 Taste & Smell -90.633708
10 Smoking - Recent Tobacco Use -88.301894
11 Preventive Aspirin Use -88.292714
12 Kidney Conditions - Urology -87.817560
13 Cardiovascular Health -85.699530
14 Mental Health - Depression Screener -85.634763
15 Volatile Toxicant (Subsample) -85.580458
16 Smoking - Household Smokers -85.514068
17 Alcohol Use -84.735015
18 Dermatology -84.350459
19 Consumer Behavior -84.151348
20 Food Security -82.769481
21 Immunization -82.186591
22 Housing Characteristics -81.655651
23 Drug Use -81.187136
24 Occupation -81.005558
25 Oral Health -78.920883
26 Weight History - Youth -77.454574
27 Income -76.364734
28 Diet Behavior & Nutrition -75.799336
29 Weight History -75.136793
30 Blood Pressure & Cholesterol -74.227314
31 Reproductive Health -73.813445
32 Osteoporosis -71.564194
33 Creatine Kinase -67.288085
34 Hepatitis -67.152639
35 Medical Conditions -65.222360
36 Current Health Status -44.587781
37 Hospital Utilization & Access to Care 10.916130
38 Sexual Behavior 53.155880
39 Health Insurance 146.419436

Insight from Protodash

Looking at the table above, what is interesting is that early childhood represents income the most. The early childhood questionnaire has information about the environment that the child was born and raised in. This is consistent with a long term study (https://www.theatlantic.com/business/archive/2016/07/social-mobility-america/491240/) which talks about significant decrease in social mobility in recent times, stressing the fact that your childhood impacts how monetarily successful you are likely to be. It is interesting that our method was able to uncover this relationship with access to just these survey questionnaires. Other such insights could be obtained and ones that a social scientist or policy maker finds interesting could potentially spawn long-term studies like the one just mentioned.