HW1 - Basics of ML¶

Include your code in the relevant cells below. Subparts labeled as questions (Q1.1, Q1.2, etc.) should have their answers filled in or plots placed prominently, as appropriate.

Important notes:¶

On this and future homeworks, depending on the data size and your hardware configuration, experiments may take too long if you use the complete dataset. This may be challenging, as you may need to run multiple experiments. So, if an experiment takes too much time, start first with a smaller sample that will allow you to run your code within a reasonable time. Once you complete all tasks, before the final submission, you can allow longer run times and run your code with the complete set. However, if this is still taking too much time or causing your computer to freeze, it will be OK to submit experiments using a sample size that is feasible for your setting (indicate it clearly in your submission). Grading of the homework will not be affected from this type of variations in the design of your experiments.
You can switch between 2D image data and 1D vector data using the numpy functions flatten() and resize()

S1: Understanding the data¶

Load MNIST dataset (hint it's available as part of https://keras.io/api/datasets)

Q1.1: What is the number of features in the training dataset: ___

Q1.2: What is the number of samples in the training dataset: ___

Q1.1: What is the number of features in the testing dataset: ___

Q1.4: What is the number of samples in the testing dataset: ___

Q1.3: What is the dimensionality of each data sample: ___

In [ ]:

import keras
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [ ]:

from keras.datasets import mnist

In [ ]:

(X_tr, Y_tr), (X_te, Y_te) = mnist.load_data()

In [ ]:

print('X_tr: ' + str(X_tr.shape))
print('Y_tr: ' + str(Y_tr.shape))
print('X_te:  '  + str(X_te.shape))
print('Y_te:  '  + str(Y_te.shape))

X_tr: (60000, 28, 28)
Y_tr: (60000,)
X_te:  (10000, 28, 28)
Y_te:  (10000,)

In [ ]:

print('Unique labels: ' + str(np.unique(Y_tr, return_counts=True)))

Unique labels: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8), array([5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851, 5949]))

S2: Viewing the data¶

Select one random example from each category from the training set. Display the 2D image with the name of the category

Q2.1: Visualize the example image: ___

In [ ]:

## Read categories
labels = np.unique(Y_tr)

## Create 2x5 subplot
fig, ax = plt.subplots(nrows = 2, ncols = 5)

## Loop to plot each category
for i, tmpl in enumerate(labels):

  ## Select a random image with the current label
  indAll = np.where(Y_tr == tmpl)[0]
  indSel = indAll[np.random.randint(indAll.shape)]
  selImg = X_tr[indSel,:,:].squeeze()

  ## Convert loop index to 2D index for the 2x5 plot
  a1, a2 = np.unravel_index(i, [2,5])      
  
  ## Plot image
  ax[a1, a2].imshow(selImg)
  ax[a1, a2].set_title('Label ' + str(tmpl))

## Show image
plt.show()

S3: Sub-sampling the data¶

Reduce training and testing sample sizes by randomly selecting %10 of the initial samples

Q3.1: What is the distribution of each label in the initial train data (i.e. percentage of each label): ___

Q3.2: What is the distribution of each label in the reduced train data: ___

In [ ]:

## Randomly sample data
def sample_data(X, Y, p):
  ## Shuffle array indices
  num_sample = X.shape[0]
  ind_shuf = np.random.permutation(num_sample)

  ## Select p percent of the shuffled indices
  num_sel = int(num_sample / 100 * p )
  ind_sel = ind_shuf[0:num_sel]
  
  ## Select data
  X_out = X[ind_sel, :]
  Y_out = Y[ind_sel]

  return X_out, Y_out

X_tr_sel1, Y_tr_sel1 = sample_data(X_tr, Y_tr, 1)
X_te_sel1, Y_te_sel1 = sample_data(X_te, Y_te, 1)

In [ ]:

## Find number of samples for each label
_, c1 = np.unique(Y_tr, return_counts=True)
_, c2 = np.unique(Y_tr_sel1, return_counts=True)

## Find percentage of samples for each label
p1 = (100*c1 / Y_tr.shape[0]).reshape(1,-1)
p2 = (100*c2 / Y_tr_sel1.shape[0]).reshape(1,-1)

print(p1)
print(p2)

## or
p12 = np.concatenate((p1, p2), axis=0)
df = pd.DataFrame(data = p12, columns = labels, index = ['X_tr','X_tr_sel1'])

df.plot.bar()
plt.show()

[[ 9.87166667 11.23666667  9.93       10.21833333  9.73666667  9.035
   9.86333333 10.44166667  9.75166667  9.915     ]]
[[ 9.         11.33333333 10.66666667 12.          9.66666667  9.83333333
   9.          8.83333333  8.         11.66666667]]

S4: Sub-sampling the data (again)¶

Reduce training and testing sample sizes by selecting the first %10 of the initial samples

Q4.1: What is the distribution of each label in the initial train data (i.e. percentage of each label): ___

Q4.2: What is the distribution of each label in the reduced train data: ___

Q4.3: What are your comments/interpretation on comparison of the results for S3 and S4

! For the rest of the HW, please discard sub-sampled data from S3 and use subsampled data from S4¶

In [ ]:

## Randomly sample data
def sample_data(X, Y, p, is_shuffle = False):
  ## Shuffle array indices
  num_sample = X.shape[0]

  if is_shuffle == True:
    ind_shuf = np.random.permutation(num_sample)
  else:
    ind_shuf = np.arange(0,num_sample)

  ## Select p percent of the shuffled indices
  num_sel = int(num_sample / 100 * p )
  ind_sel = ind_shuf[0:num_sel]
  
  ## Select data
  X_out = X[ind_sel, :]
  Y_out = Y[ind_sel]

  return X_out, Y_out

X_tr_sel2, Y_tr_sel2 = sample_data(X_tr, Y_tr, 1)
X_te_sel2, Y_te_sel2 = sample_data(X_te, Y_te, 1)

In [ ]:

## Find number of samples for each label
_, c1 = np.unique(Y_tr, return_counts=True)
_, c2 = np.unique(Y_tr_sel2, return_counts=True)

## Find percentage of samples for each label
p1 = (100*c1 / Y_tr.shape[0]).reshape(1,-1)
p2 = (100*c2 / Y_tr_sel2.shape[0]).reshape(1,-1)

print(p1)
print(p2)

## or
p12 = np.concatenate((p1, p2), axis=0)
df = pd.DataFrame(data = p12, columns = labels, index = ['X_tr','X_tr_sel1'])

df.plot.bar()
plt.show()

[[ 9.87166667 11.23666667  9.93       10.21833333  9.73666667  9.035
   9.86333333 10.44166667  9.75166667  9.915     ]]
[[ 9.66666667 13.16666667 10.66666667  9.83333333  9.83333333  8.5
   9.         10.33333333  8.16666667 10.83333333]]

S5: Exploring the dataset¶

Select all train images in category "3". Create and display a single pixel-wise "average image" for this category.
Create and display a single pixel-wise "standard deviation image" for this category?
Repeat the items above for category "3" images in the test set. Compare the average and standard deviation images.
Repeat the items above for a different category you select.

Q5.1: Plot the 2D mean and std images for category 3 in training and testing sets: ___

Q5.2: Plot the 2D mean and std images for the category you selected in training and testing sets: ___

Q5.3: Comment on differences between the mean and std images from training and testing datasets? ___

In [ ]:

## Select images
X = X_tr_sel2
Y = Y_tr_sel2
indAll = np.where(Y == 7)[0]
imgAll = X[indAll, :, :]
print('Num sel: ' + str(indAll.shape))
print('Num img mat: ' + str(imgAll.shape))

Num sel: (62,)
Num img mat: (62, 28, 28)

In [ ]:

## Calculate avg img
img_mean = np.mean(imgAll, axis = 0)
img_mean.shape

Out[ ]:

(28, 28)

In [ ]:

## Show mean img
plt.imshow(img_mean)
plt.show()

S6: Image distances¶

In the training set, find the image in category 3 that is most dissimilar to the mean image of category 3. Show it as a 2D image
In the training set, find the image in category 3 that is most similar to mean image of category 3. Show it as a 2D image
In the training set, find the image in category 9 that is most similar to mean image of category 3. Show it as a 2D image

Hint: You can use the "euclidean distance" as your similarity metric. Given that an image i is represented with a flattened feature vector v_i , and the second image j with v_m, the distance between these two images can be calculated using the vector norm of their differences ( | v_i - v_j | )

Q6.1: What is the index of most dissimilar image in category 3: ___

Q6.2: Plot the most dissimilar category 3 image in 2D: ___

Q6.3: Plot the most similar category 3 image in 2D: ___

In [ ]:

## Find pixelwise "distance" of each image to the mean image
vec_mean = img_mean.flatten() 
arr_d = np.zeros(indAll.shape[0])
for i, ind_sel in enumerate(indAll):
  img_sel = X[ind_sel, :, :].flatten()

  d_sel = np.sqrt(np.dot(vec_mean - img_sel, vec_mean - img_sel))
  #d_sel = np.linalg.norm(img_sel - vec_mean)
  #d_sel = np.sqrt(np.square(img_sel - vec_mean).sum())

  arr_d[i] = d_sel

In [ ]:

ind_similar = indAll[arr_d.argmin()]
ind_dissimilar = indAll[arr_d.argmax()]

In [ ]:

## Show similar / dissimilar images
fig, ax = plt.subplots(nrows = 1, ncols = 3)
ax[0].imshow(img_mean)
ax[1].imshow(X[ind_similar,:,:])
ax[2].imshow(X[ind_dissimilar,:,:])
plt.show()

S7: Image distances, part 2¶

Repeat questions S5 and S6 after binarizing the images first

Q7.1: What is the index of most dis-similar category 3 image: ___

Q7.2: What is the index of most similar category 3 image: ___

Q7.3: Did the answer change after binarization? How do you interprete this finding?: ___

In [ ]:

## Select images
X = X_tr_sel2
Y = Y_tr_sel2

## Binarize images
X = (X>128).astype(int)

indAll = np.where(Y == 7)[0]
imgAll = X[indAll, :, :]
print('Num sel: ' + str(indAll.shape))
print('Num img mat: ' + str(imgAll.shape))

## Calculate avg img
img_mean = np.mean(imgAll, axis = 0)
img_mean.shape

## Find pixelwise "distance" of each image to the mean image
vec_mean = img_mean.flatten() 
arr_d = np.zeros(indAll.shape[0])
for i, ind_sel in enumerate(indAll):
  img_sel = X[ind_sel, :, :].flatten()

  d_sel = np.sqrt(np.dot(vec_mean - img_sel, vec_mean - img_sel))
  #d_sel = np.linalg.norm(img_sel - vec_mean)
  #d_sel = np.sqrt(np.square(img_sel - vec_mean).sum())

  arr_d[i] = d_sel
  
ind_similar = indAll[arr_d.argmin()]
ind_dissimilar = indAll[arr_d.argmax()]

## Show similar / dissimilar images
fig, ax = plt.subplots(nrows = 1, ncols = 3)
ax[0].imshow(img_mean)
ax[1].imshow(X[ind_similar,:,:])
ax[2].imshow(X[ind_dissimilar,:,:])
plt.show()

Num sel: (62,)
Num img mat: (62, 28, 28)

S8: Binary classification between category 3 and 9 (split train data)¶

Select images from these two categories in the training dataset
Split them into two sets (Set1, Set2) with a %60 and %40 random split
Replace category labels as 0 (for 3) and 1 (for 9)
Use Set1 to train a linear SVM classifier with default parameters and predict the class labels for Set2
Use Set2 to train a linear SVM classifier with default parameters and predict the class labels for Set1

Q8.1: What is the prediction accuracy using the model trained on Set1: ___

Q8.2: What is the prediction accuracy using the model trained on Set2: ___

In [ ]:

from sklearn.model_selection import train_test_split

## Select data
X = X_tr_sel2
Y = Y_tr_sel2

indSel = np.where((Y == 3) | (Y == 9))[0]
X = X[indSel, :, :]
Y = Y[indSel]

## Replace labels
Y[Y == 3] = 0
Y[Y == 9] = 1

## Flatten images
X = X.reshape(X.shape[0], -1)

## Create train test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

In [ ]:

from sklearn import svm

# Create the svm classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

# Train the model
clf.fit(X_train, y_train)

# Predict the label
y_pred = clf.predict(X_test)

In [ ]:

from sklearn import metrics
print('Accuracy: ', metrics.accuracy_score(y_test, y_pred))

Accuracy:  0.86

In [ ]:

# Train the model
clf.fit(X_test, y_test)

# Predict the label
y_pred = clf.predict(X_train)

print('Accuracy: ', metrics.accuracy_score(y_train, y_pred))

Accuracy:  0.9459459459459459

S9: Binary classification between category 3 and 9 (train + test sets)¶

Select images from these two categories in the training and testing datasets
Replace category labels as 0 (for 3) and 1 (for 9)
Use training set to train a linear SVM classifier with default parameters and predict the class labels for the testing set
Use testing set to train a linear SVM classifier with default parameters and predict the class labels for the training set

Q9.1: What is the prediction accuracy using the model trained on the training set: ___

Q9.2: What is the prediction accuracy using the model trained on the testing set: ___

S10: k-NN Error Analysis¶

In training and testing datasets select the images in categories: 1, 3, 5, 7 or 9
Train k-NN classifiers using 4 to 40 nearest neighbors with a step size of 4
Calculate and plot overall testing accuracy for each experiment

Q10.1: For k=4 what is the label that was predicted with lowest accuracy: ___

Q10.2: For k=20 what is the label that was predicted with lowest accuracy: ___

Q10.3: What is the label pair that was confused most often (i.e. class A is labeled as B, and vice versa): ___

Q10.4: Visualize 5 mislabeled samples with their actual and predicted labels

In [ ]:

from sklearn.model_selection import train_test_split

## Select data
X = X_tr_sel2
Y = Y_tr_sel2

indSel = np.where(np.isin(Y,[1, 3, 5, 7, 9]) == True)[0]
X = X[indSel, :, :]
Y = Y[indSel]

## Flatten images
X = X.reshape(X.shape[0], -1)

## Create train test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

In [ ]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

arr_k = np.arange(4, 40, 4)
mat_pred = np.zeros([arr_k.shape[0], y_test.shape[0]])     ## Matrix to keep predictions from each experiment
acc_pred = np.zeros([arr_k.shape[0], 1])     ## Array to keep accuracy from each experiment
for i, k in enumerate(arr_k):
  classifier = KNeighborsClassifier(n_neighbors = k)
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)
 mat_pred [i, :] = y_pred
  acc_pred[i] = metrics.accuracy_score(y_test, y_pred)

In [ ]:

print(acc_pred)

[[0.79527559]
 [0.77952756]
 [0.73228346]
 [0.7480315 ]
 [0.71653543]
 [0.71653543]
 [0.70866142]
 [0.69291339]
 [0.68503937]]

In [ ]:

result = confusion_matrix(y_test, mat_pred[0])
print("Confusion Matrix:")
print(result)

Confusion Matrix:
[[30  0  0  1  1]
 [ 4 17  0  2  1]
 [ 1  5 15  0  0]
 [ 5  0  0 14  3]
 [ 0  0  0  3 25]]

In [ ]:

mat_pred_bin = mat_pred.copy()
for i in np.arange(0, mat_pred.shape[0]):
  mat_pred_bin[i,:] = (mat_pred[i,:] == y_test).astype(int)
mat_pred_sum = mat_pred_bin.sum(axis=0) 

In [ ]:

indsel = mat_pred_sum.argmin()

In [ ]:

print('Label / Pred : ' + str(y_test[indsel]) + '    ' + str(mat_pred[:, indsel]))

Label / Pred : 3    [7. 7. 7. 7. 7. 7. 7. 7. 7.]

In [ ]:

plt.imshow(X_test[indsel,:].reshape([28,28]))
plt.show()

S11: Feature extraction¶

We describe each image by using a reduced set of features (compared to n=784 initial features for each pixel value) as follows:
1. Binarize the image (background=0, foreground=1)
2. For each image row i, find n_i, the sum of 1's in the row (28 features)
3. For each image column j, find n_j, the sum of 1's in the column (28 features)
4. Concatenate these features into a feature vector of 56 features

Repeat classification experiments in S9 using this reduced feature set.

Q11.1: What is the prediction accuracy using the model trained using the train data: ___

Q11.2: What is the prediction accuracy using the model trained using the test data: ___

In [ ]:

X = X_tr_sel2
X = (X > 0).astype(int)

X.shape

Out[ ]:

(600, 28, 28)

In [ ]:

X0 = X[0,:,:]
plt.imshow(X0)
plt.show()

In [ ]:

f1 = X0.sum(axis=0)
print(f1)

[ 0  0  0  0  2  2  3  5  8  9 10 12 14 15 15 13 13 14 10  8  5  3  3  2
  0  0  0  0]

In [ ]:

f2 = X0.sum(axis=1)
print(f2)

[ 0  0  0  0  0 12 16 16 11  9  5  4  4  6  6  6  5  4  7  8  9 10 10 10
  8  0  0  0]

In [ ]:

f12 = np.concatenate((f1,f2))
f12.shape

Out[ ]:

(56,)

Bonus:¶

This time we describe each 28 x 28 image by using a different feature set (n = 28 x 4 features). This feature set encodes "index of the first non-zero pixel in image columns or rows" from each direction (from left, right, top, bottom)

Example for a 6 x 6 image:

Img: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

Extracted features: 0 3 3 3 3 0 0 2 2 2 2 0 0 0 0 1 0 0 0 0 0 1 0 0 (left, right, top, bottom)

Repeat classification experiments in S9 using this reduced feature set.

Q11.1: What is the prediction accuracy using the model trained using the train data: ___

Q11.2: What is the prediction accuracy using the model trained using the test data: ___