Include your code in the relevant cells below. Subparts labeled as questions (Q1.1, Q1.2, etc.) should have their answers filled in or plots placed prominently, as appropriate.
On this and future homeworks, depending on the data size and your hardware configuration, experiments may take too long if you use the complete dataset. This may be challenging, as you may need to run multiple experiments. So, if an experiment takes too much time, start first with a smaller sample that will allow you to run your code within a reasonable time. Once you complete all tasks, before the final submission, you can allow longer run times and run your code with the complete set. However, if this is still taking too much time or causing your computer to freeze, it will be OK to submit experiments using a sample size that is feasible for your setting (indicate it clearly in your submission). Grading of the homework will not be affected from this type of variations in the design of your experiments.
You can switch between 2D image data and 1D vector data using the numpy functions flatten() and resize()
Q1.1: What is the number of features in the training dataset: ___
Q1.2: What is the number of samples in the training dataset: ___
Q1.1: What is the number of features in the testing dataset: ___
Q1.4: What is the number of samples in the testing dataset: ___
Q1.3: What is the dimensionality of each data sample: ___
Q2.1: Visualize the example image: ___
Q3.1: What is the distribution of each label in the initial train data (i.e. percentage of each label): ___
Q3.2: What is the distribution of each label in the reduced train data: ___
Q4.1: What is the distribution of each label in the initial train data (i.e. percentage of each label): ___
Q4.2: What is the distribution of each label in the reduced train data: ___
Q4.3: What are your comments/interpretation on comparison of the results for S3 and S4
Q5.1: Plot the 2D mean and std images for category 3 in training and testing sets: ___
Q5.2: Plot the 2D mean and std images for the category you selected in training and testing sets: ___
Q5.3: Comment on differences between the mean and std images from training and testing datasets? ___
Hint: You can use the "euclidean distance" as your similarity metric. Given that an image i is represented with a flattened feature vector v_i , and the second image j with v_m, the distance between these two images can be calculated using the vector norm of their differences ( | v_i - v_j | )
Q6.1: What is the index of most dissimilar image in category 3: ___
Q6.2: Plot the most dissimilar category 3 image in 2D: ___
Q6.3: Plot the most similar category 3 image in 2D: ___
Q7.1: What is the index of most dis-similar category 3 image: ___
Q7.2: What is the index of most similar category 3 image: ___
Q7.3: Did the answer change after binarization? How do you interprete this finding?: ___
Q8.1: What is the prediction accuracy using the model trained on Set1: ___
Q8.2: What is the prediction accuracy using the model trained on Set2: ___
Q9.1: What is the prediction accuracy using the model trained on the training set: ___
Q9.2: What is the prediction accuracy using the model trained on the testing set: ___
Q10.1: For k=4 what is the label that was predicted with lowest accuracy: ___
Q10.2: For k=20 what is the label that was predicted with lowest accuracy: ___
Q10.3: What is the label pair that was confused most often (i.e. class A is labeled as B, and vice versa): ___
Q10.4: Visualize 5 mislabeled samples with their actual and predicted labels
We describe each image by using a reduced set of features (compared to n=784 initial features for each pixel value) as follows:
Binarize the image (background=0, foreground=1)
For each image row i, find n_i, the sum of 1's in the row (28 features)
For each image column j, find n_j, the sum of 1's in the column (28 features)
Concatenate these features into a feature vector of 56 features
Repeat classification experiments in S9 using this reduced feature set.
Q11.1: What is the prediction accuracy using the model trained using the train data: ___
Q11.2: What is the prediction accuracy using the model trained using the test data: ___
Example for a 6 x 6 image:
Img: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Extracted features: 0 3 3 3 3 0 0 2 2 2 2 0 0 0 0 1 0 0 0 0 0 1 0 0 (left, right, top, bottom)
Repeat classification experiments in S9 using this reduced feature set.
Q11.1: What is the prediction accuracy using the model trained using the train data: ___
Q11.2: What is the prediction accuracy using the model trained using the test data: ___