Hi folks!

Often in real world applications of data analysis, we run into the problem of missing data. This can happen due to a multitude of reasons such as:

- The data was compiled from different sources/times
- Corrupted during storage
- Certain fields were optional
- etc.

This notebook has the following sections:

- Introduction
- The Problem
- KNN Imputation
- Comparison And Application
- Summary
- Further Reading

Broadly, missing data is classified into 3 categories.

- Missing Completely At Random (MCAR)

Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random

- Missing At Random (MAR)

Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information

- Missing Not At Random (MNAR)

Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR

Data compilation from different sources is an example of MAR while data corruption is an example of MCAR. MNAR is not a problem we can fix with imputation because this is **non-ignorable non-response.** The only thing we can do about MNAR is to gather more information from different sources or ignore it all-together. As such we are not going to talk about MNAR anymore in this tutorial.

All of the techniques that follow are applicable only for MCAR. However, in real world scenarios, MAR is more common. As such, we will treat MAR as MCAR only which gives a reasonably good approximation in practice.

Let's start with a toy example,

\begin{align} \ y & = \sin(x) x\, \text{for $|x|<=6$} \end{align}In [ ]:

```
import matplotlib.pyplot as plt # plots
import numpy as np # vectors and matrices
import pandas as pd # tables and data manipulations
import seaborn as sns # more plots
%matplotlib inline
```

In [ ]:

```
x = np.linspace(-6, 6)
y = np.asarray([x1 * np.sin(x1) for x1 in x])
plt.scatter(x, y)
```

Let's delete some points on random to get an MCAR dataset

In [ ]:

```
missing_fraction = 0.3
indices = np.random.randint(1, len(x) - 1, size=int((1 - missing_fraction) * len(x)))
x_mcar = x[indices]
y_mcar = y[indices]
```

In [ ]:

```
plt.scatter(x_mcar, y_mcar)
```

In [ ]:

```
from sklearn.metrics import mean_squared_error as mse
```

Let's try the easiest methods first:

- Mean
- Median

In [ ]:

```
y_pred_mean = np.array(y)
for ind in list(set(np.linspace(0, len(x) - 1)) - set(indices)):
y_pred_mean[int(ind)] = np.mean(y_mcar)
plt.scatter(x, y_pred_mean)
mse(y_pred_mean, y)
```

In [ ]:

```
y_pred_median = np.array(y)
for ind in list(set(np.linspace(0, len(x) - 1)) - set(indices)):
y_pred_median[int(ind)] = np.median(y_mcar)
plt.scatter(x, y_pred_median)
mse(y_pred_median, y)
```

**Note: You need TensorFlow**

In [ ]:

```
!pip install fancyimpute
```

In [ ]:

```
import fancyimpute
```

In [ ]:

```
y_pred_knn = np.concatenate(
(np.array(y).reshape(-1, 1), np.array(x).reshape(-1, 1)), axis=1
)
for ind in indices:
y_pred_knn[int(ind)] = [float("NaN"), y_pred_knn[int(ind)][1]]
y_pred_knn = fancyimpute.KNN(k=3).fit_transform(y_pred_knn)
```

In [ ]:

```
y_pred_knn_2 = [x[0] for x in y_pred_knn]
```

In [ ]:

```
plt.scatter(x, y_pred_knn_2)
mse(y_pred_knn_2, y)
```

As we can see, fancyimpute has performed much better than mean or median methods on this toy dataset.

Next up, we get some in-depth understanding of how the KNN algorithm for fancyimpute works and apply it to some real datasets.

The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.

The fancyimpute KNN algorithm works by calculating the k nearest neighbors which have the missing features available and then weights them based on Euclidean distance from the target row. The missing value is then calculated as a weighted mean from these neighboring rows.

Below is an implementation for k = 2. Because we know our data is sorted we can code this much more efficiently. However, this isn't a general implementation. We also ignore the possibility that both of the closest neighbors can be on the same side to reduce the complexity of the code.

In [ ]:

```
y_cust = np.array(y)
for ind in indices:
low1 = ind - 1
while low1 in indices:
low1 = low1 - 1
high1 = ind + 1
while high1 in indices:
high1 = high1 + 1
d1 = 1 / (ind - low1)
d2 = 1 / (high1 - ind)
y_cust[ind] = (d1 * y_cust[low1] + d2 * y_cust[high1]) / (d1 + d2)
```

In [ ]:

```
plt.scatter(x, y_cust)
mse(y_cust, y)
```

In [ ]:

```
df = pd.read_csv("pima-indians-diabetes.csv", header=None)
```

In [ ]:

```
df.head()
```

- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Class variable (0 or 1)

Clearly, a person cannot have triceps sking fold thickness as 0 mm. This is a missing value and we need to replace 0 with NaN to let our algorithms know that it's a missing value.

By reading the descriptions we can be sure that columns 1,2,3,4,5,6 and 7 cannot have zero values. As such, we will mark 0s as missing.

Also, imputing functions work better with scaled features so we will use MinMaxScaler to scale every feature between 0 to 1.

In [ ]:

```
(df[[1, 2, 3, 4, 5, 6, 7]] == 0).sum()
```

In [ ]:

```
from sklearn.preprocessing import MinMaxScaler
df = pd.DataFrame(
data=MinMaxScaler().fit_transform(df.values), columns=df.columns, index=df.index
)
df[[1, 2, 3, 4, 5, 6, 7]] = df[[1, 2, 3, 4, 5, 6, 7]].replace(0, float("NaN"))
```

In [ ]:

```
df.head()
```

Now, we will compare Logistic Regression using four different imputation methods:

- KNN
- Mean
- IterativeImputer
- SoftImpute

In [ ]:

```
df_mean = pd.DataFrame(
data=fancyimpute.SimpleFill().fit_transform(df.values),
columns=df.columns,
index=df.index,
)
df_iterative = pd.DataFrame(
data=fancyimpute.IterativeImputer().fit_transform(df.values),
columns=df.columns,
index=df.index,
)
df_soft = pd.DataFrame(
data=fancyimpute.SoftImpute().fit_transform(df.values),
columns=df.columns,
index=df.index,
)
```

In [ ]:

```
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
validation_split = 0.8
input_columns = [0, 1, 2, 3, 4, 5, 6, 7]
```

In [ ]:

```
logisticRegr.fit(
df_mean[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
mean_score = logisticRegr.score(
df_mean[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
mean_score
```

In [ ]:

```
logisticRegr = LogisticRegression()
logisticRegr.fit(
df_iterative[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
iter_score = logisticRegr.score(
df_iterative[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
iter_score
```

In [ ]:

```
logisticRegr = LogisticRegression()
logisticRegr.fit(
df_soft[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
soft_score = logisticRegr.score(
df_soft[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
soft_score
```

In [ ]:

```
results_knn = []
for k in range(2, 30):
df_knn = pd.DataFrame(
data=fancyimpute.KNN(k=k).fit_transform(df.values),
columns=df.columns,
index=df.index,
)
logisticRegr.fit(
df_knn[: int(len(df) * validation_split)][input_columns],
df[: int(len(df) * validation_split)][8].values,
)
results_knn.append(
logisticRegr.score(
df_knn[int(len(df) * validation_split) :][input_columns],
df[int(len(df) * validation_split) :][8].values,
)
)
```

In [ ]:

```
plt.plot(results_knn)
```

Summarising the results:

- Mean Imputation - 75.97%
- Iterative Imputer - 77.27%
- Soft Imputer - 77.27%
- KNN Imputation - 80.52%

In [ ]:

```
!pip install keras
```

In [ ]:

```
from keras.layers import Dense, Dropout
from keras.models import Sequential
model = Sequential()
model.add(Dense(10, activation="relu", input_dim=8))
model.add(Dense(10, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
model.fit(
df_mean[input_columns], df[8], batch_size=32, epochs=400, validation_split=0.2
)
```

In [ ]:

```
df_knn = pd.DataFrame(
data=fancyimpute.KNN(k=8).fit_transform(df.values),
columns=df.columns,
index=df.index,
)
model = Sequential()
model.add(Dense(10, activation="relu", input_dim=8))
model.add(Dense(10, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
model.fit(df_knn[input_columns], df[8], batch_size=32, epochs=400, validation_split=0.2)
```

In [ ]:

```
model.summary()
```

Missing data is broadly classified into three categories: MCAR, MAR and MNAR. We show the abysmal performance of mean imputation and median imputation with a toy example. Next, we create an intuitive understanding of KNN imputation and write sample code for its implementation.

Finally, we apply the techniques to Pima Indian Diabetes set and use four different imputation strategies. We show the superiority of KNN imputation technique over other imputation strategies for both logistic regression and neural networks, discrediting a common belief about imputation techniques.