Hi folks!
Often in real world applications of data analysis, we run into the problem of missing data. This can happen due to a multitude of reasons such as:
This notebook has the following sections:
In this tutorial, we look at the problem of missing data in data analytics. Then, we categorize the different types of missing data and briefly discuss the specific issue presented by each specific type. Finally, we look at various methods of handling data imputation and compare their accuracy on a real-world dataset with logistic regression. We also look at the validity of a commonly held assumption about imputation techniques.
Broadly, missing data is classified into 3 categories.
Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random
Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information
Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR
Data compilation from different sources is an example of MAR while data corruption is an example of MCAR. MNAR is not a problem we can fix with imputation because this is non-ignorable non-response. The only thing we can do about MNAR is to gather more information from different sources or ignore it all-together. As such we are not going to talk about MNAR anymore in this tutorial.
All of the techniques that follow are applicable only for MCAR. However, in real world scenarios, MAR is more common. As such, we will treat MAR as MCAR only which gives a reasonably good approximation in practice.
Let's start with a toy example,
\begin{align} \ y & = \sin(x) x\, \text{for $|x|<=6$} \end{align}import matplotlib.pyplot as plt # plots
import numpy as np # vectors and matrices
import pandas as pd # tables and data manipulations
import seaborn as sns # more plots
%matplotlib inline
x = np.linspace(-6, 6)
y = np.asarray([x1 * np.sin(x1) for x1 in x])
plt.scatter(x, y)
Let's delete some points on random to get an MCAR dataset
missing_fraction = 0.3
indices = np.random.randint(1, len(x) - 1, size=int((1 - missing_fraction) * len(x)))
x_mcar = x[indices]
y_mcar = y[indices]
plt.scatter(x_mcar, y_mcar)
Throughout this tutorial, we will use MSE as an indicator of how good an imputation technique is when we have the original dataset and accuracy on predictions when we don't
from sklearn.metrics import mean_squared_error as mse
Let's try the easiest methods first:
y_pred_mean = np.array(y)
for ind in list(set(np.linspace(0, len(x) - 1)) - set(indices)):
y_pred_mean[int(ind)] = np.mean(y_mcar)
plt.scatter(x, y_pred_mean)
mse(y_pred_mean, y)
y_pred_median = np.array(y)
for ind in list(set(np.linspace(0, len(x) - 1)) - set(indices)):
y_pred_median[int(ind)] = np.median(y_mcar)
plt.scatter(x, y_pred_median)
mse(y_pred_median, y)
Well, this seems like pretty awful. Let's see what fancyimpute has to offer. Note: You need TensorFlow
!pip install fancyimpute
import fancyimpute
y_pred_knn = np.concatenate(
(np.array(y).reshape(-1, 1), np.array(x).reshape(-1, 1)), axis=1
)
for ind in indices:
y_pred_knn[int(ind)] = [float("NaN"), y_pred_knn[int(ind)][1]]
y_pred_knn = fancyimpute.KNN(k=3).fit_transform(y_pred_knn)
y_pred_knn_2 = [x[0] for x in y_pred_knn]
plt.scatter(x, y_pred_knn_2)
mse(y_pred_knn_2, y)
As we can see, fancyimpute has performed much better than mean or median methods on this toy dataset.
Next up, we get some in-depth understanding of how the KNN algorithm for fancyimpute works and apply it to some real datasets.
In pattern recognition, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression
The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.
The fancyimpute KNN algorithm works by calculating the k nearest neighbors which have the missing features available and then weights them based on Euclidean distance from the target row. The missing value is then calculated as a weighted mean from these neighboring rows.
Below is an implementation for k = 2. Because we know our data is sorted we can code this much more efficiently. However, this isn't a general implementation. We also ignore the possibility that both of the closest neighbors can be on the same side to reduce the complexity of the code.
y_cust = np.array(y)
for ind in indices:
low1 = ind - 1
while low1 in indices:
low1 = low1 - 1
high1 = ind + 1
while high1 in indices:
high1 = high1 + 1
d1 = 1 / (ind - low1)
d2 = 1 / (high1 - ind)
y_cust[ind] = (d1 * y_cust[low1] + d2 * y_cust[high1]) / (d1 + d2)
plt.scatter(x, y_cust)
mse(y_cust, y)
We will use the Pima Indians Diabetes database for our example use case. This is an example of a MAR dataset but we will treat it as MCAR to make the best out of what we have. You can download the data from - https://www.kaggle.com/kumargh/pimaindiansdiabetescsv
df = pd.read_csv("pima-indians-diabetes.csv", header=None)
df.head()
Clearly, a person cannot have triceps sking fold thickness as 0 mm. This is a missing value and we need to replace 0 with NaN to let our algorithms know that it's a missing value.
By reading the descriptions we can be sure that columns 1,2,3,4,5,6 and 7 cannot have zero values. As such, we will mark 0s as missing.
Also, imputing functions work better with scaled features so we will use MinMaxScaler to scale every feature between 0 to 1.
(df[[1, 2, 3, 4, 5, 6, 7]] == 0).sum()
from sklearn.preprocessing import MinMaxScaler
df = pd.DataFrame(
data=MinMaxScaler().fit_transform(df.values), columns=df.columns, index=df.index
)
df[[1, 2, 3, 4, 5, 6, 7]] = df[[1, 2, 3, 4, 5, 6, 7]].replace(0, float("NaN"))
df.head()
fancyimpute offers many different forms of imputation methods, however, we are only comparing the four mentioned below. You can read about all of these at https://pypi.org/project/fancyimpute/
Now, we will compare Logistic Regression using four different imputation methods:
We will first construct the dataframe for the bottom three because for KNN we need to find the optimum value of the hyperparameter.
df_mean = pd.DataFrame(
data=fancyimpute.SimpleFill().fit_transform(df.values),
columns=df.columns,
index=df.index,
)
df_iterative = pd.DataFrame(
data=fancyimpute.IterativeImputer().fit_transform(df.values),
columns=df.columns,
index=df.index,
)
df_soft = pd.DataFrame(
data=fancyimpute.SoftImpute().fit_transform(df.values),
columns=df.columns,
index=df.index,
)
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
validation_split = 0.8
input_columns = [0, 1, 2, 3, 4, 5, 6, 7]
logisticRegr.fit(
df_mean[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
mean_score = logisticRegr.score(
df_mean[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
mean_score
logisticRegr = LogisticRegression()
logisticRegr.fit(
df_iterative[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
iter_score = logisticRegr.score(
df_iterative[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
iter_score
logisticRegr = LogisticRegression()
logisticRegr.fit(
df_soft[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
soft_score = logisticRegr.score(
df_soft[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
soft_score
results_knn = []
for k in range(2, 30):
df_knn = pd.DataFrame(
data=fancyimpute.KNN(k=k).fit_transform(df.values),
columns=df.columns,
index=df.index,
)
logisticRegr.fit(
df_knn[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
results_knn.append(
logisticRegr.score(
df_knn[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
)
plt.plot(results_knn)
Summarising the results:
It is often claimed that mean imputation is just as good as the fancier methods such as KNN when used in conjunction with more complicated models. To test it, we build a simple neural network and train it with mean imputed data and compare results with KNN imputed data.
!pip install keras
from keras.layers import Dense, Dropout
from keras.models import Sequential
model = Sequential()
model.add(Dense(10, activation="relu", input_dim=8))
model.add(Dense(10, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
model.fit(
df_mean[input_columns], df[8], batch_size=32, epochs=400, validation_split=0.2
)
df_knn = pd.DataFrame(
data=fancyimpute.KNN(k=8).fit_transform(df.values),
columns=df.columns,
index=df.index,
)
model = Sequential()
model.add(Dense(10, activation="relu", input_dim=8))
model.add(Dense(10, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
model.fit(df_knn[input_columns], df[8], batch_size=32, epochs=400, validation_split=0.2)
model.summary()
As evident from this highly unscientific test, the common wisdom that mean imputation is just as good is not necessarily true. Even with this overkill of a model, KNN imputed data performs significantly better than mean imputed data(0.8701 - epoch 396 vs 0.7987 - epoch 324 in this run)
Missing data is broadly classified into three categories: MCAR, MAR and MNAR. We show the abysmal performance of mean imputation and median imputation with a toy example. Next, we create an intuitive understanding of KNN imputation and write sample code for its implementation.
Finally, we apply the techniques to Pima Indian Diabetes set and use four different imputation strategies. We show the superiority of KNN imputation technique over other imputation strategies for both logistic regression and neural networks, discrediting a common belief about imputation techniques.