mlcourse.ai – Open Machine Learning Course

Author: Archit Rungta

Tutorial

Imputing missing data with fancyimpute

Hi folks!

Often in real world applications of data analysis, we run into the problem of missing data. This can happen due to a multitude of reasons such as:

  • The data was compiled from different sources/times
  • Corrupted during storage
  • Certain fields were optional
  • etc.

This notebook has the following sections:

  1. Introduction
  2. The Problem
  3. KNN Imputation
  4. Comparison And Application
  5. Summary
  6. Further Reading

In this tutorial, we look at the problem of missing data in data analytics. Then, we categorize the different types of missing data and briefly discuss the specific issue presented by each specific type. Finally, we look at various methods of handling data imputation and compare their accuracy on a real-world dataset with logistic regression. We also look at the validity of a commonly held assumption about imputation techniques.

Introduction

Broadly, missing data is classified into 3 categories.

  • Missing Completely At Random (MCAR)

    Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random

  • Missing At Random (MAR)

    Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information

  • Missing Not At Random (MNAR)

    Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR

Data compilation from different sources is an example of MAR while data corruption is an example of MCAR. MNAR is not a problem we can fix with imputation because this is non-ignorable non-response. The only thing we can do about MNAR is to gather more information from different sources or ignore it all-together. As such we are not going to talk about MNAR anymore in this tutorial.

All of the techniques that follow are applicable only for MCAR. However, in real world scenarios, MAR is more common. As such, we will treat MAR as MCAR only which gives a reasonably good approximation in practice.

The Problem

Let's start with a toy example,

\begin{align} \ y & = \sin(x) x\, \text{for $|x|<=6$} \end{align}
In [ ]:
import matplotlib.pyplot as plt  # plots
import numpy as np  # vectors and matrices
import pandas as pd  # tables and data manipulations
import seaborn as sns  # more plots

%matplotlib inline
In [ ]:
x = np.linspace(-6, 6)
y = np.asarray([x1 * np.sin(x1) for x1 in x])
plt.scatter(x, y)

Let's delete some points on random to get an MCAR dataset

In [ ]:
missing_fraction = 0.3
indices = np.random.randint(1, len(x) - 1, size=int((1 - missing_fraction) * len(x)))
x_mcar = x[indices]
y_mcar = y[indices]
In [ ]:
plt.scatter(x_mcar, y_mcar)

Throughout this tutorial, we will use MSE as an indicator of how good an imputation technique is when we have the original dataset and accuracy on predictions when we don't

In [ ]:
from sklearn.metrics import mean_squared_error as mse

Let's try the easiest methods first:

  • Mean
  • Median
In [ ]:
y_pred_mean = np.array(y)
for ind in list(set(np.linspace(0, len(x) - 1)) - set(indices)):
    y_pred_mean[int(ind)] = np.mean(y_mcar)
plt.scatter(x, y_pred_mean)
mse(y_pred_mean, y)
In [ ]:
y_pred_median = np.array(y)
for ind in list(set(np.linspace(0, len(x) - 1)) - set(indices)):
    y_pred_median[int(ind)] = np.median(y_mcar)
plt.scatter(x, y_pred_median)
mse(y_pred_median, y)

Well, this seems like pretty awful. Let's see what fancyimpute has to offer. Note: You need TensorFlow

In [ ]:
!pip install fancyimpute
In [ ]:
import fancyimpute
In [ ]:
y_pred_knn = np.concatenate(
    (np.array(y).reshape(-1, 1), np.array(x).reshape(-1, 1)), axis=1
)
for ind in indices:
    y_pred_knn[int(ind)] = [float("NaN"), y_pred_knn[int(ind)][1]]
y_pred_knn = fancyimpute.KNN(k=3).fit_transform(y_pred_knn)
In [ ]:
y_pred_knn_2 = [x[0] for x in y_pred_knn]
In [ ]:
plt.scatter(x, y_pred_knn_2)
mse(y_pred_knn_2, y)

As we can see, fancyimpute has performed much better than mean or median methods on this toy dataset.

Next up, we get some in-depth understanding of how the KNN algorithm for fancyimpute works and apply it to some real datasets.

KNN Imputation

In pattern recognition, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression

The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.

The fancyimpute KNN algorithm works by calculating the k nearest neighbors which have the missing features available and then weights them based on Euclidean distance from the target row. The missing value is then calculated as a weighted mean from these neighboring rows.

Below is an implementation for k = 2. Because we know our data is sorted we can code this much more efficiently. However, this isn't a general implementation. We also ignore the possibility that both of the closest neighbors can be on the same side to reduce the complexity of the code.

In [ ]:
y_cust = np.array(y)
for ind in indices:
    low1 = ind - 1
    while low1 in indices:
        low1 = low1 - 1
    high1 = ind + 1
    while high1 in indices:
        high1 = high1 + 1
    d1 = 1 / (ind - low1)
    d2 = 1 / (high1 - ind)
    y_cust[ind] = (d1 * y_cust[low1] + d2 * y_cust[high1]) / (d1 + d2)
In [ ]:
plt.scatter(x, y_cust)
mse(y_cust, y)

Comparison and Application

We will use the Pima Indians Diabetes database for our example use case. This is an example of a MAR dataset but we will treat it as MCAR to make the best out of what we have. You can download the data from - https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

In [ ]:
df = pd.read_csv("pima-indians-diabetes.csv", header=None)
In [ ]:
df.head()
  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skin fold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)^2)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0 or 1)

Clearly, a person cannot have triceps sking fold thickness as 0 mm. This is a missing value and we need to replace 0 with NaN to let our algorithms know that it's a missing value.

By reading the descriptions we can be sure that columns 1,2,3,4,5,6 and 7 cannot have zero values. As such, we will mark 0s as missing.

Also, imputing functions work better with scaled features so we will use MinMaxScaler to scale every feature between 0 to 1.

In [ ]:
(df[[1, 2, 3, 4, 5, 6, 7]] == 0).sum()
In [ ]:
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame(
    data=MinMaxScaler().fit_transform(df.values), columns=df.columns, index=df.index
)
df[[1, 2, 3, 4, 5, 6, 7]] = df[[1, 2, 3, 4, 5, 6, 7]].replace(0, float("NaN"))
In [ ]:
df.head()

fancyimpute offers many different forms of imputation methods, however, we are only comparing the four mentioned below. You can read about all of these at https://pypi.org/project/fancyimpute/

Now, we will compare Logistic Regression using four different imputation methods:

  • KNN
  • Mean
  • IterativeImputer
  • SoftImpute

We will first construct the dataframe for the bottom three because for KNN we need to find the optimum value of the hyperparameter.

In [ ]:
df_mean = pd.DataFrame(
    data=fancyimpute.SimpleFill().fit_transform(df.values),
    columns=df.columns,
    index=df.index,
)
df_iterative = pd.DataFrame(
    data=fancyimpute.IterativeImputer().fit_transform(df.values),
    columns=df.columns,
    index=df.index,
)
df_soft = pd.DataFrame(
    data=fancyimpute.SoftImpute().fit_transform(df.values),
    columns=df.columns,
    index=df.index,
)
In [ ]:
from sklearn.linear_model import LogisticRegression

logisticRegr = LogisticRegression()
validation_split = 0.8
input_columns = [0, 1, 2, 3, 4, 5, 6, 7]
In [ ]:
logisticRegr.fit(
    df_mean[: int(len(df) * validation_split)][input_columns],
    df[: int(len(df) * validation_split)][8].values,
)
mean_score = logisticRegr.score(
    df_mean[int(len(df) * validation_split) :][input_columns],
    df[int(len(df) * validation_split) :][8].values,
)
mean_score
In [ ]:
logisticRegr = LogisticRegression()

logisticRegr.fit(
    df_iterative[: int(len(df) * validation_split)][input_columns],
    df[: int(len(df) * validation_split)][8].values,
)
iter_score = logisticRegr.score(
    df_iterative[int(len(df) * validation_split) :][input_columns],
    df[int(len(df) * validation_split) :][8].values,
)
iter_score
In [ ]:
logisticRegr = LogisticRegression()

logisticRegr.fit(
    df_soft[: int(len(df) * validation_split)][input_columns],
    df[: int(len(df) * validation_split)][8].values,
)
soft_score = logisticRegr.score(
    df_soft[int(len(df) * validation_split) :][input_columns],
    df[int(len(df) * validation_split) :][8].values,
)
soft_score
In [ ]:
results_knn = []

for k in range(2, 30):
    df_knn = pd.DataFrame(
        data=fancyimpute.KNN(k=k).fit_transform(df.values),
        columns=df.columns,
        index=df.index,
    )
    logisticRegr.fit(
        df_knn[: int(len(df) * validation_split)][input_columns],
        df[: int(len(df) * validation_split)][8].values,
    )
    results_knn.append(
        logisticRegr.score(
            df_knn[int(len(df) * validation_split) :][input_columns],
            df[int(len(df) * validation_split) :][8].values,
        )
    )
In [ ]:
plt.plot(results_knn)

Summarising the results:

  • Mean Imputation - 75.97%
  • Iterative Imputer - 77.27%
  • Soft Imputer - 77.27%
  • KNN Imputation - 80.52%

It is often claimed that mean imputation is just as good as the fancier methods such as KNN when used in conjunction with more complicated models. To test it, we build a simple neural network and train it with mean imputed data and compare results with KNN imputed data.

In [ ]:
!pip install keras
In [ ]:
from keras.layers import Dense, Dropout
from keras.models import Sequential

model = Sequential()
model.add(Dense(10, activation="relu", input_dim=8))

model.add(Dense(10, activation="relu"))

model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
model.fit(
    df_mean[input_columns], df[8], batch_size=32, epochs=400, validation_split=0.2
)
In [ ]:
df_knn = pd.DataFrame(
    data=fancyimpute.KNN(k=8).fit_transform(df.values),
    columns=df.columns,
    index=df.index,
)
model = Sequential()
model.add(Dense(10, activation="relu", input_dim=8))

model.add(Dense(10, activation="relu"))

model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
model.fit(df_knn[input_columns], df[8], batch_size=32, epochs=400, validation_split=0.2)
In [ ]:
model.summary()

As evident from this highly unscientific test, the common wisdom that mean imputation is just as good is not necessarily true. Even with this overkill of a model, KNN imputed data performs significantly better than mean imputed data(0.8701 - epoch 396 vs 0.7987 - epoch 324 in this run)

Summary

Missing data is broadly classified into three categories: MCAR, MAR and MNAR. We show the abysmal performance of mean imputation and median imputation with a toy example. Next, we create an intuitive understanding of KNN imputation and write sample code for its implementation.

Finally, we apply the techniques to Pima Indian Diabetes set and use four different imputation strategies. We show the superiority of KNN imputation technique over other imputation strategies for both logistic regression and neural networks, discrediting a common belief about imputation techniques.

Further Reading