You can use this notebook to try out StickyLand!
To launch StickyLand, click the note icon in the toobar above.
# Install dependencies
%pip install numpy pandas matplotlib sklearn
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from collections import Counter
%config InlineBackend.figure_format = 'retina'
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep=', ',
engine='python',
header=None
)
column_names = [
'Age', 'WorkClass', 'fnlwgt', 'Education', 'EducationNum',
'MaritalStatus', 'Occupation', 'Relationship', 'Race', 'Gender',
'CapitalGain', 'CapitalLoss', 'HoursPerWeek', 'NativeCountry', 'Income'
]
df.columns = [n.lower() for n in column_names]
df.shape
(32561, 15)
The Adult dataset has 14 features.
The output variable is binary (income > 50k
).
df.head()
age | workclass | fnlwgt | education | educationnum | maritalstatus | occupation | relationship | race | gender | capitalgain | capitalloss | hoursperweek | nativecountry | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
sub_df = df[df['age'] < 20]
sub_df.head()
age | workclass | fnlwgt | education | educationnum | maritalstatus | occupation | relationship | race | gender | capitalgain | capitalloss | hoursperweek | nativecountry | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26 | 19 | Private | 168294 | HS-grad | 9 | Never-married | Craft-repair | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
37 | 19 | Private | 544091 | HS-grad | 9 | Married-AF-spouse | Adm-clerical | Wife | White | Female | 0 | 0 | 25 | United-States | <=50K |
51 | 18 | Private | 226956 | HS-grad | 9 | Never-married | Other-service | Own-child | White | Female | 0 | 0 | 30 | ? | <=50K |
70 | 19 | Private | 101509 | Some-college | 10 | Never-married | Prof-specialty | Own-child | White | Male | 0 | 0 | 32 | United-States | <=50K |
78 | 18 | Private | 309634 | 11th | 7 | Never-married | Other-service | Own-child | White | Female | 0 | 0 | 22 | United-States | <=50K |
age
vs. income
Also support $\LaTeX$!
def overlay_hist(df, c):
"""
Plot two histogram of two values overlaying each other.
"""
num_unique = len(df[c].unique())
if df[c].dtype == 'object':
counter_1 = Counter(df[c][df['target'] == 1])
counter_2 = Counter(df[c][df['target'] != 1])
bar_names = []
bar_densities_1 = []
bar_densities_2 = []
for f in counter_1:
bar_names.append(f)
bar_densities_1.append(counter_1[f] / df.shape[0])
bar_densities_2.append(counter_2[f] / df.shape[0])
for f in counter_2:
if f not in counter_1:
bar_names.append(f)
bar_densities_1.append(counter_1[f] / df.shape[0])
bar_densities_2.append(counter_2[f] / df.shape[0])
count_df = pd.DataFrame(np.c_[bar_densities_2, bar_densities_1], index=bar_names)
ax = count_df.plot.bar(alpha=0.5)
ax.set_title(c)
ax.figure.autofmt_xdate(rotation=45)
else:
plt.hist(df[c][df['target'] == 1], alpha=0.5, density=True, label='>50k', bins=50)
plt.hist(df[c][df['target'] != 1], alpha=0.5, density=True, label='<=50k', bins=50)
plt.title(c)
plt.legend(loc='upper right')
print('Num of unique values: ', num_unique)
plt.show()
age
vs. income
Also support $\LaTeX$!
Transform the target variable Income
as a binary variable.
df['target'] = [0 if l else 1 for l in (df['income'] == '<=50K')]
new_df = df.copy()
In this section, we delete or transform some features before training the binary classifier.
intersted_feature = 'maritalstatus'
overlay_hist(df, intersted_feature)
Num of unique values: 7
The distribution difference between these two groups on age is quite significant.
overlay_hist(df, 'workclass')
Num of unique values: 9
overlay_hist(df, 'fnlwgt')
Num of unique values: 21648
fnlwgt
stands for "Final Weight", which is used to give weight to different sample so that people with similar demographic characteristics have the same weight. This feature is not really useful in this model.
del new_df['fnlwgt']
overlay_hist(df, 'education')
Num of unique values: 16
overlay_hist(df, 'educationnum')
Num of unique values: 16
overlay_hist(df, 'maritalstatus')
Num of unique values: 7
overlay_hist(df, 'occupation')
Num of unique values: 15
overlay_hist(df, 'relationship')
Num of unique values: 6
overlay_hist(df, 'race')
Num of unique values: 5
overlay_hist(df, 'gender')
Num of unique values: 2
overlay_hist(df, 'capitalgain')
Num of unique values: 119
overlay_hist(df, 'capitalloss')
Num of unique values: 92
These two features capitalgain
and capitalloss
have many 0 values. It makes sense, because the census define capital gain/loss as the profit/loss of asset sales (stocks or real estate). Not all people would yield cpaital gain/loss in a particular. We can convert these two variables as binary features has_capitalgain
and has_capitalloss
.
new_df['has_capitalgain'] = [int(t) for t in df['capitalgain'] != 0]
new_df['has_capitalloss'] = [int(t) for t in df['capitalloss'] != 0]
del new_df['capitalgain']
del new_df['capitalloss']
overlay_hist(df, 'hoursperweek')
Num of unique values: 94
Working 40 hours a week is typical in the dataset. Interestingly people who earn more tend to work longer.
overlay_hist(df, 'nativecountry')
Num of unique values: 42
The majority of the native country is the US. We can encode it as another binary variable from-usa
to decrease the number of levels.
new_df['from_usa'] = [int(t) for t in df['nativecountry'] == 'United-States']
del new_df['nativecountry']
overlay_hist(df, 'income')
Num of unique values: 2
It shows this dataset is quite imbalanced.
new_df.head()
age | workclass | education | educationnum | maritalstatus | occupation | relationship | race | gender | hoursperweek | income | target | has_capitalgain | has_capitalloss | from_usa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 40 | <=50K | 0 | 1 | 0 | 1 |
1 | 50 | Self-emp-not-inc | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 13 | <=50K | 0 | 0 | 0 | 1 |
2 | 38 | Private | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 40 | <=50K | 0 | 0 | 0 | 1 |
3 | 53 | Private | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 40 | <=50K | 0 | 0 | 0 | 1 |
4 | 28 | Private | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 40 | <=50K | 0 | 0 | 0 | 0 |
# Install dependencies
%pip install imageio imgaug
Requirement already satisfied: imageio in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (2.13.0) Requirement already satisfied: imgaug in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (0.4.0) Requirement already satisfied: pillow>=8.3.2 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imageio) (8.4.0) Requirement already satisfied: numpy in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imageio) (1.20.3) Requirement already satisfied: scikit-image>=0.14.2 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imgaug) (0.18.3) Requirement already satisfied: Shapely in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imgaug) (1.8.0) Requirement already satisfied: matplotlib in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imgaug) (3.4.3) Requirement already satisfied: six in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imgaug) (1.16.0) Requirement already satisfied: scipy in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imgaug) (1.7.2) Requirement already satisfied: opencv-python in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from imgaug) (4.5.4.60) Requirement already satisfied: networkx>=2.0 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from scikit-image>=0.14.2->imgaug) (2.6.3) Requirement already satisfied: PyWavelets>=1.1.1 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from scikit-image>=0.14.2->imgaug) (1.2.0) Requirement already satisfied: tifffile>=2019.7.26 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from scikit-image>=0.14.2->imgaug) (2021.11.2) Requirement already satisfied: python-dateutil>=2.7 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from matplotlib->imgaug) (2.8.2) Requirement already satisfied: cycler>=0.10 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from matplotlib->imgaug) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from matplotlib->imgaug) (1.3.2) Requirement already satisfied: pyparsing>=2.2.1 in /Users/JayWong/miniconda3/envs/sticky/lib/python3.9/site-packages (from matplotlib->imgaug) (3.0.4) Note: you may need to restart the kernel to use updated packages.
import imageio
import numpy as np
import imgaug as ia
import imgaug.augmenters as iaa
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
counter = 0
def load_random_image(dataset):
global counter
images = [
'https://i.imgur.com/xnrNBo3.png',
'https://i.imgur.com/Ch4p4ds.png',
'https://i.imgur.com/DUSjJ5U.png',
'https://i.imgur.com/pfM32N4.png'
]
# image = imageio.imread(np.random.choice(images))
image = imageio.imread(images[counter % 4])
counter += 1
image = image[:, :, :3]
s = 250
aug = iaa.size.Resize([s, s])
image = aug(image=image)
return image
dataset = 25
def load_random_image(dataset):
# image = imageio.imread('https://i.imgur.com/Ch4p4ds.png')
image = imageio.imread('https://i.imgur.com/DUSjJ5U.png')
image = image[:, :, :3]
s = 250
aug = iaa.size.Resize([s, s])
image = aug(image=image)
return image
def rotate(image):
"""Rotate the image"""
aug = iaa.Affine(rotate=(-10, -9))
image_aug = aug(image=image)
return image_aug
def add_noise(image):
"""Add random noise on the image"""
aug = iaa.CoarseDropout(0.02, size_percent=0.5)
image_aug = aug(image=image)
return image_aug
def corrupt(image):
"""Corrupt the image"""
aug = iaa.MultiplyHueAndSaturation(mul_hue=4)
image = aug(image=image)
aug = iaa.MultiplyHueAndSaturation(mul_hue=4)
image = aug(image=image)
return image
image = load_random_image(dataset)
plt.imshow(image);
image = rotate(image)
plt.imshow(image);
image = add_noise(image)
plt.imshow(image);
image = corrupt(image)
plt.imshow(image);