Include your code in the relevant cells below. Subparts labeled as questions (Q1.1, Q1.2, etc.) should have their answers filled in or plots placed prominently, as appropriate.
On this and future homeworks, depending on the data size and your hardware configuration, experiments may take too long if you use the complete dataset. This may be challenging, as you may need to run multiple experiments. So, if an experiment takes too much time, start first with a smaller sample that will allow you to run your code within a reasonable time. Once you complete all tasks, before the final submission, you can allow longer run times and run your code with the complete set. However, if this is still taking too much time or causing your computer to freeze, it will be OK to submit experiments using a sample size that is feasible for your setting (indicate it clearly in your submission). Grading of the homework will not be affected from this type of variations in the design of your experiments.
You can switch between 2D image data and 1D vector data using the numpy functions flatten() and resize()
Q1.1: What is the number of features in the training dataset: ___
Q1.2: What is the number of samples in the training dataset: ___
Q1.1: What is the number of features in the testing dataset: ___
Q1.4: What is the number of samples in the testing dataset: ___
Q1.3: What is the dimensionality of each data sample: ___
import keras
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from keras.datasets import mnist
(X_tr, Y_tr), (X_te, Y_te) = mnist.load_data()
print('X_tr: ' + str(X_tr.shape))
print('Y_tr: ' + str(Y_tr.shape))
print('X_te: ' + str(X_te.shape))
print('Y_te: ' + str(Y_te.shape))
X_tr: (60000, 28, 28) Y_tr: (60000,) X_te: (10000, 28, 28) Y_te: (10000,)
print('Unique labels: ' + str(np.unique(Y_tr, return_counts=True)))
Unique labels: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8), array([5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851, 5949]))
Q2.1: Visualize the example image: ___
## Read categories
labels = np.unique(Y_tr)
## Create 2x5 subplot
fig, ax = plt.subplots(nrows = 2, ncols = 5)
## Loop to plot each category
for i, tmpl in enumerate(labels):
## Select a random image with the current label
indAll = np.where(Y_tr == tmpl)[0]
indSel = indAll[np.random.randint(indAll.shape)]
selImg = X_tr[indSel,:,:].squeeze()
## Convert loop index to 2D index for the 2x5 plot
a1, a2 = np.unravel_index(i, [2,5])
## Plot image
ax[a1, a2].imshow(selImg)
ax[a1, a2].set_title('Label ' + str(tmpl))
## Show image
plt.show()
Q3.1: What is the distribution of each label in the initial train data (i.e. percentage of each label): ___
Q3.2: What is the distribution of each label in the reduced train data: ___
## Randomly sample data
def sample_data(X, Y, p):
## Shuffle array indices
num_sample = X.shape[0]
ind_shuf = np.random.permutation(num_sample)
## Select p percent of the shuffled indices
num_sel = int(num_sample / 100 * p )
ind_sel = ind_shuf[0:num_sel]
## Select data
X_out = X[ind_sel, :]
Y_out = Y[ind_sel]
return X_out, Y_out
X_tr_sel1, Y_tr_sel1 = sample_data(X_tr, Y_tr, 1)
X_te_sel1, Y_te_sel1 = sample_data(X_te, Y_te, 1)
## Find number of samples for each label
_, c1 = np.unique(Y_tr, return_counts=True)
_, c2 = np.unique(Y_tr_sel1, return_counts=True)
## Find percentage of samples for each label
p1 = (100*c1 / Y_tr.shape[0]).reshape(1,-1)
p2 = (100*c2 / Y_tr_sel1.shape[0]).reshape(1,-1)
print(p1)
print(p2)
## or
p12 = np.concatenate((p1, p2), axis=0)
df = pd.DataFrame(data = p12, columns = labels, index = ['X_tr','X_tr_sel1'])
df.plot.bar()
plt.show()
[[ 9.87166667 11.23666667 9.93 10.21833333 9.73666667 9.035 9.86333333 10.44166667 9.75166667 9.915 ]] [[ 9. 11.33333333 10.66666667 12. 9.66666667 9.83333333 9. 8.83333333 8. 11.66666667]]
Q4.1: What is the distribution of each label in the initial train data (i.e. percentage of each label): ___
Q4.2: What is the distribution of each label in the reduced train data: ___
Q4.3: What are your comments/interpretation on comparison of the results for S3 and S4
## Randomly sample data
def sample_data(X, Y, p, is_shuffle = False):
## Shuffle array indices
num_sample = X.shape[0]
if is_shuffle == True:
ind_shuf = np.random.permutation(num_sample)
else:
ind_shuf = np.arange(0,num_sample)
## Select p percent of the shuffled indices
num_sel = int(num_sample / 100 * p )
ind_sel = ind_shuf[0:num_sel]
## Select data
X_out = X[ind_sel, :]
Y_out = Y[ind_sel]
return X_out, Y_out
X_tr_sel2, Y_tr_sel2 = sample_data(X_tr, Y_tr, 1)
X_te_sel2, Y_te_sel2 = sample_data(X_te, Y_te, 1)
## Find number of samples for each label
_, c1 = np.unique(Y_tr, return_counts=True)
_, c2 = np.unique(Y_tr_sel2, return_counts=True)
## Find percentage of samples for each label
p1 = (100*c1 / Y_tr.shape[0]).reshape(1,-1)
p2 = (100*c2 / Y_tr_sel2.shape[0]).reshape(1,-1)
print(p1)
print(p2)
## or
p12 = np.concatenate((p1, p2), axis=0)
df = pd.DataFrame(data = p12, columns = labels, index = ['X_tr','X_tr_sel1'])
df.plot.bar()
plt.show()
[[ 9.87166667 11.23666667 9.93 10.21833333 9.73666667 9.035 9.86333333 10.44166667 9.75166667 9.915 ]] [[ 9.66666667 13.16666667 10.66666667 9.83333333 9.83333333 8.5 9. 10.33333333 8.16666667 10.83333333]]
Q5.1: Plot the 2D mean and std images for category 3 in training and testing sets: ___
Q5.2: Plot the 2D mean and std images for the category you selected in training and testing sets: ___
Q5.3: Comment on differences between the mean and std images from training and testing datasets? ___
## Select images
X = X_tr_sel2
Y = Y_tr_sel2
indAll = np.where(Y == 7)[0]
imgAll = X[indAll, :, :]
print('Num sel: ' + str(indAll.shape))
print('Num img mat: ' + str(imgAll.shape))
Num sel: (62,) Num img mat: (62, 28, 28)
## Calculate avg img
img_mean = np.mean(imgAll, axis = 0)
img_mean.shape
(28, 28)
## Show mean img
plt.imshow(img_mean)
plt.show()
Hint: You can use the "euclidean distance" as your similarity metric. Given that an image i is represented with a flattened feature vector v_i , and the second image j with v_m, the distance between these two images can be calculated using the vector norm of their differences ( | v_i - v_j | )
Q6.1: What is the index of most dissimilar image in category 3: ___
Q6.2: Plot the most dissimilar category 3 image in 2D: ___
Q6.3: Plot the most similar category 3 image in 2D: ___
## Find pixelwise "distance" of each image to the mean image
vec_mean = img_mean.flatten()
arr_d = np.zeros(indAll.shape[0])
for i, ind_sel in enumerate(indAll):
img_sel = X[ind_sel, :, :].flatten()
d_sel = np.sqrt(np.dot(vec_mean - img_sel, vec_mean - img_sel))
#d_sel = np.linalg.norm(img_sel - vec_mean)
#d_sel = np.sqrt(np.square(img_sel - vec_mean).sum())
arr_d[i] = d_sel
ind_similar = indAll[arr_d.argmin()]
ind_dissimilar = indAll[arr_d.argmax()]
## Show similar / dissimilar images
fig, ax = plt.subplots(nrows = 1, ncols = 3)
ax[0].imshow(img_mean)
ax[1].imshow(X[ind_similar,:,:])
ax[2].imshow(X[ind_dissimilar,:,:])
plt.show()
Q7.1: What is the index of most dis-similar category 3 image: ___
Q7.2: What is the index of most similar category 3 image: ___
Q7.3: Did the answer change after binarization? How do you interprete this finding?: ___
## Select images
X = X_tr_sel2
Y = Y_tr_sel2
## Binarize images
X = (X>128).astype(int)
indAll = np.where(Y == 7)[0]
imgAll = X[indAll, :, :]
print('Num sel: ' + str(indAll.shape))
print('Num img mat: ' + str(imgAll.shape))
## Calculate avg img
img_mean = np.mean(imgAll, axis = 0)
img_mean.shape
## Find pixelwise "distance" of each image to the mean image
vec_mean = img_mean.flatten()
arr_d = np.zeros(indAll.shape[0])
for i, ind_sel in enumerate(indAll):
img_sel = X[ind_sel, :, :].flatten()
d_sel = np.sqrt(np.dot(vec_mean - img_sel, vec_mean - img_sel))
#d_sel = np.linalg.norm(img_sel - vec_mean)
#d_sel = np.sqrt(np.square(img_sel - vec_mean).sum())
arr_d[i] = d_sel
ind_similar = indAll[arr_d.argmin()]
ind_dissimilar = indAll[arr_d.argmax()]
## Show similar / dissimilar images
fig, ax = plt.subplots(nrows = 1, ncols = 3)
ax[0].imshow(img_mean)
ax[1].imshow(X[ind_similar,:,:])
ax[2].imshow(X[ind_dissimilar,:,:])
plt.show()
Num sel: (62,) Num img mat: (62, 28, 28)
Q8.1: What is the prediction accuracy using the model trained on Set1: ___
Q8.2: What is the prediction accuracy using the model trained on Set2: ___
from sklearn.model_selection import train_test_split
## Select data
X = X_tr_sel2
Y = Y_tr_sel2
indSel = np.where((Y == 3) | (Y == 9))[0]
X = X[indSel, :, :]
Y = Y[indSel]
## Replace labels
Y[Y == 3] = 0
Y[Y == 9] = 1
## Flatten images
X = X.reshape(X.shape[0], -1)
## Create train test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
from sklearn import svm
# Create the svm classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
# Train the model
clf.fit(X_train, y_train)
# Predict the label
y_pred = clf.predict(X_test)
from sklearn import metrics
print('Accuracy: ', metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.86
# Train the model
clf.fit(X_test, y_test)
# Predict the label
y_pred = clf.predict(X_train)
print('Accuracy: ', metrics.accuracy_score(y_train, y_pred))
Accuracy: 0.9459459459459459
Q9.1: What is the prediction accuracy using the model trained on the training set: ___
Q9.2: What is the prediction accuracy using the model trained on the testing set: ___
Q10.1: For k=4 what is the label that was predicted with lowest accuracy: ___
Q10.2: For k=20 what is the label that was predicted with lowest accuracy: ___
Q10.3: What is the label pair that was confused most often (i.e. class A is labeled as B, and vice versa): ___
Q10.4: Visualize 5 mislabeled samples with their actual and predicted labels
from sklearn.model_selection import train_test_split
## Select data
X = X_tr_sel2
Y = Y_tr_sel2
indSel = np.where(np.isin(Y,[1, 3, 5, 7, 9]) == True)[0]
X = X[indSel, :, :]
Y = Y[indSel]
## Flatten images
X = X.reshape(X.shape[0], -1)
## Create train test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
arr_k = np.arange(4, 40, 4)
mat_pred = np.zeros([arr_k.shape[0], y_test.shape[0]]) ## Matrix to keep predictions from each experiment
acc_pred = np.zeros([arr_k.shape[0], 1]) ## Array to keep accuracy from each experiment
for i, k in enumerate(arr_k):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
mat_pred [i, :] = y_pred
acc_pred[i] = metrics.accuracy_score(y_test, y_pred)
print(acc_pred)
[[0.79527559] [0.77952756] [0.73228346] [0.7480315 ] [0.71653543] [0.71653543] [0.70866142] [0.69291339] [0.68503937]]
result = confusion_matrix(y_test, mat_pred[0])
print("Confusion Matrix:")
print(result)
Confusion Matrix: [[30 0 0 1 1] [ 4 17 0 2 1] [ 1 5 15 0 0] [ 5 0 0 14 3] [ 0 0 0 3 25]]
mat_pred_bin = mat_pred.copy()
for i in np.arange(0, mat_pred.shape[0]):
mat_pred_bin[i,:] = (mat_pred[i,:] == y_test).astype(int)
mat_pred_sum = mat_pred_bin.sum(axis=0)
indsel = mat_pred_sum.argmin()
print('Label / Pred : ' + str(y_test[indsel]) + ' ' + str(mat_pred[:, indsel]))
Label / Pred : 3 [7. 7. 7. 7. 7. 7. 7. 7. 7.]
plt.imshow(X_test[indsel,:].reshape([28,28]))
plt.show()
We describe each image by using a reduced set of features (compared to n=784 initial features for each pixel value) as follows:
Binarize the image (background=0, foreground=1)
For each image row i, find n_i, the sum of 1's in the row (28 features)
For each image column j, find n_j, the sum of 1's in the column (28 features)
Concatenate these features into a feature vector of 56 features
Repeat classification experiments in S9 using this reduced feature set.
Q11.1: What is the prediction accuracy using the model trained using the train data: ___
Q11.2: What is the prediction accuracy using the model trained using the test data: ___
X = X_tr_sel2
X = (X > 0).astype(int)
X.shape
(600, 28, 28)
X0 = X[0,:,:]
plt.imshow(X0)
plt.show()
f1 = X0.sum(axis=0)
print(f1)
[ 0 0 0 0 2 2 3 5 8 9 10 12 14 15 15 13 13 14 10 8 5 3 3 2 0 0 0 0]
f2 = X0.sum(axis=1)
print(f2)
[ 0 0 0 0 0 12 16 16 11 9 5 4 4 6 6 6 5 4 7 8 9 10 10 10 8 0 0 0]
f12 = np.concatenate((f1,f2))
f12.shape
(56,)
Example for a 6 x 6 image:
Img: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Extracted features: 0 3 3 3 3 0 0 2 2 2 2 0 0 0 0 1 0 0 0 0 0 1 0 0 (left, right, top, bottom)
Repeat classification experiments in S9 using this reduced feature set.
Q11.1: What is the prediction accuracy using the model trained using the train data: ___
Q11.2: What is the prediction accuracy using the model trained using the test data: ___