#!/usr/bin/env python # coding: utf-8 # # Developing a State-of-the-Art Interpreter Model for Sign Language Communication # Contributors: Shubham Khandale, Allen Lau, Sumaiya Uddin # # Source Code: https://github.com/DataScienceAndEngineering/machine-learning-dse-i210-final-project-signlanguageclassification.git # # Project Workspace Setup: Run /src/data/make_dataset.py to download necessary data to execute this notebook. # # Table of Contents: # 1. [Abstract](#abstract) # 2. [Introduction](#introduction) # 3. [Background](#background) # 4. [Data](#data) # 5. [Methods](#methods) # 6. [Evaluation](#evaluation) # 7. [Conclusion](#conclusion) # 8. [Attribution](#attribution) # 9. [Bibliography](#bibliography) # 10. [Appendix](#appendix) # # Libraries # In[12]: #importing libraries import pickle import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import cm import seaborn as sns from matplotlib.colors import ListedColormap from scipy.stats import uniform from IPython import display import pickle import cv2 from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array from tqdm import tqdm from skimage.feature import hog from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report from sklearn.pipeline import Pipeline from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.decomposition import PCA from sklearn.model_selection import RandomizedSearchCV from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier,StackingClassifier from sklearn.metrics import roc_curve, auc, matthews_corrcoef, cohen_kappa_score from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay from sklearn.decomposition import TruncatedSVD from sklearn.manifold import TSNE from sklearn.svm import SVC import xgboost as xgb from sklearn import model_selection # ## Abstract # Clear and effective communication is a vital component of society. However, for individuals who rely on sign language, interacting with those who are unfamiliar with this mode of communication can be a difficult task. The development of a model capable of receiving a video stream from a camera and accurately classifying the signed letters can prove to be an invaluable tool. This technology can be utilized in various settings, including hospitals, schools, and government offices, to facilitate seamless communication and eliminate any potential communication barriers. # # This notebook outlines the process of identifying the best models and features to accurately and quickly classify hand signs in a live setting, where the chosen models are divided into non-deep learning and deep learning approaches. The best non-deep learning approach is identified as a Stacking Ensemble Classifier, consisting of Logistic Regression, Support Vector Machine, Random Forest, and XGBoost estimators and a Logistic Regression meta-estimator. The features to train this ensemble model are the 23 components resulted from Linear Discriminant Analysis on the images' 784 pixel feature arrays and the first 30 principal components of the derived histogram of oriented gradients features. The best deep learning approach is identified as a convolutional neural network trained on 224 x 224 images with hand landmarks plotted onto the images. # ## Introduction # The main form of communication for the deaf and hard of hearing population is sign language. However, language obstacles prevent the deaf and hearing groups from communicating with one another. This communication divide can be closed by sign language identification, which enables automated translation of sign language into written or spoken language. # # The problem of sign language alphabet recognition can be formulated as a machine learning problem. The objective is to create a system that can identify hand motions for every letter of the alphabet and correctly assign them to that letter. The intricacy and variety of sign language movements, as well as the requirement that the system be adaptable to changes in backdrop, lighting, and hand orientation, make this a difficult job. The creation of an effective method for deciphering sign language can greatly improve mobility and communication for the deaf and hard of hearing population, enabling them to interact with hearing people more effectively. # # This report will outline the procedures and steps taken to tackle this classification problem. The high-level steps taken to create a sign language interpreter model is as follows: data loading, exploratory data analysis, feature extraction, dimensionality reduction, modeling with hyper-parameter tuning (with Naive Bayes, Logistic Regression, Random Forest, Support Vector Machines, XGBoost, Stacking Ensemble Classifier, and Convolutional Neural Networks), and evaluation. For each model, the classification report (depicting accuracy, precision, recall, f1-score, and support), the Matthews correlation coefficient (MCC), and the Kohen Kappa Score are used to determine model performance. As seen in the Evaluation section, the best performing models identified are the Stacking Ensemble Classifier and the Convolutional Neural Network. # # Due to considerations like the speed of predictions and the accuracy in a live environment, the Sign Language Interpreter Application, called SignLingo, will leverage the Convolutional Neural Network as its core model. At a high level, the interpreter will utilize the computer or phone’s camera, detect a person’s hand, extract the hand, send it as an input into the trained model, and output the predicted label with a confidence score. # # ## Background # # Machine learning has been used in a lot of research and development in the field of sign language recognition and interpretation. The difficulties associated with sign language recognition and interpretation have been the subject of numerous studies. # # One example is illustration of a standard-based framework was the American Communication via Gestures (ASL) acknowledgment framework created in 1998[1]. This framework utilized a glove with sensors to catch hand developments and perceived signs in light of predefined rules. While this framework accomplished an acknowledgment exactness of 98%, it was restricted in its capacity to perceive signs performed by various clients with differing hand sizes and shapes. # # Although there are countless examples of impactful sign language interpretation modes, there is still room for improvement. There is a struggle in involving communication through signing acknowledgment frameworks for various applications, for example, gesture-based communication interpretation frameworks, communication through signing learning stages, and correspondence help for the hard of hearing and deaf. In order for people who are deaf or hard of hearing and the general public to communicate effectively, these applications need to be able to recognize sign language in real-time and accurately. # # Regardless of the headway made in communication through signing acknowledgment and understanding, there still exists difficulties. For example, fluctuations in marking styles, lighting conditions, foundation mess, and impediments. In the field of applied machine learning, these issues need to be addressed as well as the accuracy and robustness of sign language recognition systems need to be improved. # ## Data # The Sign Language MNIST dataset ([Kaggle](https://www.kaggle.com/datasets/datamunge/sign-language-mnist)) will be used for developing the Sign Language Interpreter model. It is structured as a CSV format with rows containing flattened images of pixel intensity values and its associated letter label. The American Sign Language letter database of hand gestures represent a multi-class problem with 24 classes of letters (excluding J and Z, which require motion and will not be explored in this project). The training data (27,455 instances) and test data (7172 cases) are around half the size of the standard MNIST but otherwise identical, with a header row of label, pixel1, pixel2,...pixel784 representing a single 28x28 pixel image with grayscale values ranging from 0-255. # In[8]: #function for loading the sign language mnist dataset pickle file, generated from /src/data/make_dataset.py def load_sign_minist(path): #load defined pickle file and return dat with open(path,'rb') as f: data = pickle.load(f) return data #Function to return a dictionary of numeric labels to letters def get_label_dict(y): #letters letters = ['A','B','C','D','E','F','G','H','I','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y'] #numbers y = pd.Series(y,dtype=int) numbers = sorted(list(y.unique())) #dictionary of labels return dict(zip(numbers,letters)) #function to find the indices given a label def find_indices(data,label): #check if data is numpy array if type(data) == np.ndarray: #return indices return np.where(data==label) #check if data is pandas series elif type(data) == pd.Series: #return indices return data[data==label].index #else not supported in this function else: raise Exception('Not supported data type for this function.') #load dataset X_train,y_train,X_test,y_test = load_sign_minist('../data/external/sign_mnist.pkl') #get labels dictionary label_dict = get_label_dict(y_train) #list of letters letters = ['A','B','C','D','E','F','G','H','I','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y'] # In[ ]: def visualize_data(x,y,labels_dict,title): #visualization of dataset fig, ax = plt.subplots(4,6) ax = ax.ravel() pos = 0 #loop through each label in dataset for label in range(0,26): #if label is not included in dataset if label in [9,25]: continue #find first index of label idx = find_indices(y,label)[0][0] #display first found image ax[pos].imshow(x[idx],cmap='gray') #set x label as dataset label ax[pos].set(xlabel=labels_dict[label]) #do not show ticks ax[pos].set_xticks([]) ax[pos].set_yticks([]) #increment for subplotting pos+=1 plt.suptitle(title) plt.close() return fig #visualizing examples from the data set fig = visualize_data(X_train,y_train,label_dict,'Figure 1: Sign Language Dataset') fig # Fig. 1 displays example images for each letter in the dataset. As described above, it is observed that each image is a grayscale, 28x28 image. The labels for this classification problem includes letters A - Z, excluding J and Z since these signs are motioned. The images consist of 784 pixel intensity values ranging from 0 - 255, where 0 is black and 255 is white, which are the features being used for machine learning. # # The letters A, E, M, N, and S are similar. It is expected that models may have difficulty differentiating these signs. In contrast, letters like L, O, and Y are very different from the other classes; therefore, it is expected that the models would perform better classifying these letters. # ### Exploratory Data Analysis # The average image, created by taking the average values of each pixel across all images in the dataset, is plotted below. Additionally, the variance image, created by taking the variance of values for each pixel across all images, is plotted. # In[ ]: def mean_var_image(x,title): #create subplot fig, ax = plt.subplots(1, 2) ax = ax.ravel() #reshaping arrays and finding the mean and variance of each picel x = x.reshape(x.shape[0], -1) mean_img = np.mean(x, axis=0) var_img = np.var(x, axis=0) #plotting mean image ax[0].imshow(mean_img.reshape(28, 28), cmap='gray') ax[0].set_title('Mean Image') #plotting variance image ax[1].imshow(var_img.reshape(28, 28), cmap='gray') ax[1].set_title('Variance Image') plt.suptitle(title,y=0.85) plt.tight_layout() plt.close() return fig #Plotting mean and varience image mean_var_image(X_train,'Figure 2: Mean and Variance Images') # Figure 2 displays the Mean Image, which illustrates the average positioning of our hands being centered with there being a small border on all sides. The background is generally white, however there are some differences in the far corners of the images. The Variance Image shows that the background of our images are not consistently the same. # In[ ]: #Function to plot the average and variance images for each class in a dataset. def plot_mean_images(X, y, label_dict,title): # Grouping the training data by label and calculate the mean of each pixel across all observations with the same label mean_images = [] for label in np.unique(y): label_images = X[y == label] mean_image = np.mean(label_images, axis=0) mean_images.append(mean_image) # Plotting the average image for each class fig, ax = plt.subplots(4, 6) ax = ax.ravel() pos = 0 for i, mean_image in enumerate(mean_images): if i == 9: i+=1 ax[pos].imshow(mean_image, cmap='gray') label = label_dict[i] # Retrieve the corresponding letter using the label index ax[pos].set_title(label) ax[pos].set_xticks([]) ax[pos].set_yticks([]) pos+=1 plt.suptitle(title) plt.tight_layout() plt.show() # Create the label dictionary label_dict = get_label_dict(y_train) # Call the function to plot the mean images plot_mean_images(X_train, y_train, label_dict,'Figure 3: Mean Images vs. Letters') # Figure 3 illustrates the mean image for each letter, showing the average hand positions for each class. Letters that have extended fingers are more blurry around the fingers, showing that there is more variability in the finger positions. In contrast, letters like A and E do not have extended fingers and show less variability (blurriness). It is expected that letters with less variability may be show more overfit results, since they are more similar. # Next, the total counts of the individual labels are plotted to determine if there are any class imbalances that need to be addressed before modeling. The following code is used to plot Figure 4, which displays the distribution of train and test labels in the dataset. This plot illustrates that there are no class-imbalances within the dataset, so there is no need to rebalance classes or sample. # In[ ]: #function for getting label distribution def label_distr(X_train,y_train,X_test,y_test,title): train=pd.concat([pd.DataFrame(X_train.reshape(X_train.shape[0],-1)),pd.DataFrame(y_train,columns=['label'])],axis=1) test=pd.concat([pd.DataFrame(X_test.reshape(X_test.shape[0],-1)),pd.DataFrame(y_test,columns=['label'])],axis=1) fig, ax = plt.subplots(figsize=(12, 6)) # Grouping the train and test sets by label and count the number of observations for each label train_counts = train.groupby('label').size() test_counts = test.groupby('label').size() # Custom colors train_color = 'purple' test_color = 'pink' # Plotting the bar chart for train & test set letters = ['A','B','C','D','E','F','G','H','I','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y'] ax.bar(letters, train_counts, color=train_color, alpha=0.5, label='Train') ax.bar(letters, test_counts, color=test_color, alpha=0.5, label='Test') # Adding legend and labels ax.legend() ax.set_xlabel('Labels') ax.set_ylabel('Counts') ax.set_title(title) plt.show() label_distr(X_train,y_train,X_test,y_test,'Figure 4: Distribution of Labels in Train and Test Sets') # The histograms of pixel intensities are plotted below in Figure 4. It is observed that the majority of the label distributions are unimodal, left-skewed distributions. This indicates that the majority of the images have more white (or brighter) pixels than black (or darker) pixels. Additionally, most distributions have a spike of frequency at 255, which is due to the fact that most of the backgrounds in the dataset are white. # In[ ]: def label_histograms(X,y,label_dict): #creating dataframe for data data=pd.concat([pd.DataFrame(X.reshape(X.shape[0],-1)),pd.DataFrame(y.astype(int),columns=['label'])],axis=1) #finding unique labels unique_labels = sorted(data['label'].unique()) # Sort the unique labels in ascending order #finding length of unique labels num_labels = len(unique_labels) fig, axes = plt.subplots(4, 6, figsize=(15,5)) subplot_index = 0 axes = axes.ravel() #plotting histograms for i in unique_labels: if i == 9: continue label_data = data[data['label'] == i] pixel_values = label_data.iloc[:, 1:].values axes[subplot_index].hist(pixel_values.flatten(), bins=256, color='#B371C7') axes[subplot_index].set_title(label_dict[i]) subplot_index += 1 plt.suptitle('Figure 5: Pixel Intensity Distribution vs Letter') fig.text(0.5, 0, 'Pixel Intensity', ha='center') fig.text(0, 0.5, 'Frequency', va='center', rotation='vertical') plt.tight_layout() plt.show() label_histograms(X_train,y_train,label_dict) # ## Methods # The first step taken to train the sign language classification model is reshaping the train and test images, such that they are flattened arrays of 784 pixels per image. This step is required so that the data can be used in Scikit Learn's data science functionality. The resultant shapes for the data are seen below. # In[ ]: # Reshape the data to (num_samples, 784) X_train = X_train.reshape(X_train.shape[0], -1) X_test = X_test.reshape(X_test.shape[0], -1) # Print the shapes of the augmented data print(f'X_train shape: {X_train.shape}') print(f'y_train shape: {y_train.shape}') print(f'X_test shape: {X_test.shape}') print(f'y_test shape: {y_test.shape}') # Second, normalization is performed so that any techniques that are sensitive to the scale of the features are not affected negatively, in terms of bias towards features with high-magnitude scales or ease and speed of convergence. Normalization of the images is performed on both the train and test data by dividing by 255. # In[ ]: #normalized data X_train_norm = X_train/255 X_test_norm = X_test/255 # Using the preprocessed data, initial baseline modeling was performed using Naive Bayes and Logistic Regression. The procedure for initial baseline modeling utilized Randomized Grid Search Cross Validation to identify the hyperparameters that produced the best performing models. All models in this investigation are evaluated using the classification report (depicting accuracy, precision, recall, f1-score, and support), the Matthews correlation coefficient (MCC), and the Kohen Kappa Score, which are computed using the evaluate_model function below. # In[5]: def evaluate_model(y_true, y_pred, labels): # Accuracy accuracy = accuracy_score(y_true, y_pred) print(f"Accuracy: {accuracy}") # Classification report print("Classification report:") report = classification_report(y_true, y_pred, target_names=labels, output_dict=True) print(classification_report(y_true, y_pred, target_names=labels)) # Classification report bar graph precision = [report[label]['precision'] for label in labels] recall = [report[label]['recall'] for label in labels] f1_score = [report[label]['f1-score'] for label in labels] x = np.arange(len(labels)) width = 0.3 # Define custom sequential colormap sequence = ['#F7E8F6', '#F1C6E7','#E5B0EA','#BD83CE','#B371C7'] divergence = ['#f8df81','#f6aa90','#f6b4bf','#B371C7','#badfda'] cmap = ListedColormap(sequence) fig, ax = plt.subplots(figsize=(12,8)) rects1 = ax.bar(x - width, precision, width, label='Precision', color=divergence[4]) rects2 = ax.bar(x, recall, width, label='Recall', color=divergence[2]) rects3 = ax.bar(x + width, f1_score, width, label='F1-Score', color=divergence[3]) ax.set_xlabel('Letters') ax.set_ylabel('Score') ax.set_title('Classification Report') ax.set_xticks(x) ax.set_xticklabels(labels) ax.legend() plt.tight_layout() plt.show() # Matthews Correlation Coefficient (MCC) mcc = matthews_corrcoef(y_true, y_pred) print(f"MCC: {mcc}") # Cohen's Kappa kappa = cohen_kappa_score(y_true, y_pred) print(f"Cohen's Kappa: {kappa}") # Confusion Matrix cm = ConfusionMatrixDisplay(confusion_matrix(y_true, y_pred), display_labels=labels) fig, ax = plt.subplots(figsize=(16,14)) # set figure size cm.plot(cmap=cmap, ax=ax) # set color map and axis plt.title("Confusion Matrix") plt.show() # #### Initial Model Results # # For initial baseline modeling, Naive Bayes and Logistic Regression were used and trained on the ~27k train images. The Randomized Grid Search CV is used to find the best hyperparameters for Logistic Regression. The results for Naive Bayes are seen in Appendix A and Logistic Regression in Appendix B. # # The Logistic Regression model has achieved a high training accuracy (1.00) and a relatively higher test accuracy (0.67). While the test accuracy is better than the Naive Bayes model, there is still a large performance gap between the train and test accuracies, indicating high level of overfitting. # # In summary, the Naive Bayes and Logistic Regression models both exhibit strong indications of overfitting due to the large gap in train and test accuracies. # # #### Adressing overfitting # To address the overfitting issues exhibited by the initial model results, three methods were used: data augmentation, dimension reduction, and regularization. # # Data augmentation: Data augmentation addressses overfitting by increasing the amount of images available for training by using parameters like rotation, scaling, and translation to transform the original images to augmented ones. This will not only increase the number of training images, but also increase the variability (noise) in the images, allowing the model to be more generalized. # # Dimension reduction: LDA, SVD, TSN-E, PCA, and feature selection approaches (HOG) were used extract useful information from the dataset while reducing the dimensionality of the data before training. This helps remove irrelevant or redundant features and focus on the most informative ones, reducing the risk of overfitting. Moreover, this results in a less complex model, which reduces the liklihood for overfitting. # # Regularization: L1 or L2 regularization techniques were used to add penalty terms to the model weights during the training process, which discourages complex and overfitted models. Tuning regularization parameters helps by finding the balance between fitting the training data well and generalizing it to new data. # # ### Data Augmentation Methodology # # First, data preprocessing, like reshaping the image arrays, so that it is compatible with the Keras Data Generator function is performed. The extra dimension is added to the image array to represent the batch size for the Keras Data Generator. The Keras Data Generator expects an input array of rank 4, where the first dimension represents the batch size. Since one image is being passed at a time to the data generator, it is required to add an extra dimension to the image array to make its shape (1, height, width, channels). # In[ ]: #define image resolution res = (28,28) # Reshape the data to images adding color dimention X_train = X_train.reshape(X_train.shape[0], res[0], res[1], 1) X_test = X_test.reshape(X_test.shape[0], res[0], res[1], 1) # An ImageDataGenerator object is created with specific parameters for data augmentation. The parameters include rotation_range, zoom_range, width_shift_range, height_shift_range, shear_range, brightness_range, and fill_mode. These parameters define the range and type of transformations that will be applied to the images during data augmentation, such as rotation, zooming, shifting, shearing, adjusting brightness, and filling missing pixels. # In[ ]: # Creating an ImageDataGenerator object with data augmentation parameters datagen = ImageDataGenerator( rotation_range=10, zoom_range=0.1, width_shift_range=0.1, height_shift_range=0.1, shear_range=0.1, brightness_range=[0.5, 1.5], fill_mode='nearest') # # Data augmentation is applied to the test and train sets separately by looping over each image in the original sets. # For each image, three random transformations are generated using the ImageDataGenerator object defined earlier, and the transformed images are added to the augmented sets along with their corresponding labels. # Finally, the augmented data is converted to numpy arrays, resulting in X_train_aug, y_train_aug, X_test_aug, and y_test_aug, which contain the augmented images and labels for both the training and test sets. # In[ ]: # Apply data augmentation to the training set X_train_augmented = [] y_train_augmented = [] for i in range(X_train.shape[0]): img = X_train[i] label = y_train[i] for j in range(3): x_augmented = datagen.random_transform(img) X_train_augmented.append(x_augmented) y_train_augmented.append(label) # Apply data augmentation to the test set X_test_augmented = [] y_test_augmented = [] for i in range(X_test.shape[0]): img = X_test[i] label = y_test[i] for j in range(3): x_augmented = datagen.random_transform(img) X_test_augmented.append(x_augmented) y_test_augmented.append(label) # Convert the augmented data to numpy arrays X_train_aug = np.array(X_train_augmented) y_train_aug = np.array(y_train_augmented) X_test_aug = np.array(X_test_augmented) y_test_aug = np.array(y_test_augmented) # The arrays containing the augmented data and the original data are reshaped to have dimensions (number of samples, 28, 28) to match the image dimensions. # In[ ]: #rehape arrays X_train_aug = X_train_aug.reshape(X_train_aug.shape[0],28,28) X_test_aug = X_test_aug.reshape(X_test_aug.shape[0],28,28) X_train = X_train.reshape(X_train.shape[0],28,28) X_test = X_test.reshape(X_test.shape[0],28,28) # The augmented data is combined with the original data by concatenating the arrays along the first axis (row-wise), resulting in the combined training and test sets with increased sample size. As seen by the shapes, using data augmentation, the train dataset size increases from ~27k to ~101k images and the test size increases from ~7k to 29k. This drastically increases the amount of new data that can be trained on, thus reducing the liklihood of overfitting. # In[ ]: # Concatenate the arrays along the first axis (i.e., row-wise) X_train_com = np.concatenate((X_train_aug, X_train), axis=0) y_train_com = np.concatenate((y_train_aug, y_train), axis=0) X_test_com = np.concatenate((X_test_aug, X_test), axis=0) y_test_com = np.concatenate((y_test_aug, y_test), axis=0) # Print the shape of the combined array print("Shape of combined array:", X_train_com.shape) print("Shape of combined array:", X_test_com.shape) # #### EDA for Augmented Data # In[ ]: #visualizing examples from the data set fig = visualize_data(X_train_com,y_train_com,label_dict,'Figure 6: Augmented Sign Language Dataset') fig # Fig. 6 diplays example images representing each letter in the augmented dataset. These images are grayscale, with dimensions of 28x28 pixels. The augmentation process introduces variations such as changes in brightness, darkness, and rotations, resulting in a diverse set of images for each letter. # In[ ]: #Plotting mean and varience image mean_var_image(X_train_com,'Figure 7: Augmented Mean and Variance Images') # # Figure 7 displays the Mean Image of the augmented dataset, indicating that the hands are centered within the images with a small border around them. The background generally appears white, although slight variations can be observed in the far corners of the images. The Variance Image reveals that the background of the augmented images is not consistently uniform, exhibiting some degree of variation. It is noted that the average and variation images for the augmented dataset are more noisy, which illustrates the ~73k additional augmented images. # In[ ]: # Create the label dictionary label_dict = get_label_dict(y_train_com) # Call the function to plot the mean images plot_mean_images(X_train_com, y_train_com, label_dict,'Figure 8: Augmented Mean Images vs. Letters') # # Figure 8 depicts the Mean Image for each letter in the augmented dataset, representing the average hand positions for each class. Letters with extended fingers exhibit blurriness around the fingers, indicating higher variability in finger positions. Again, it is noted that the augmented mean images for each letter are more noisy than the corresponding letter for the original dataset. # In[ ]: # Label distribuiton for augmented data label_distr(X_train_com,y_train_com,X_test_com,y_test_com,'Figure 9: Augmented Distribution of Labels in Train and Test Sets') # Figure 9 shows the distribution in image count for each label after augmentation. It is noted that there is still no class imblances in the dataset. # ### Dimensionality Reduction # #### LDA # # Dimensionality reduction is an important step in the Data Science pipeline because it reduces the complexity of the model by reducing the number of input features. This results in a decreased likelihood for the model to overfit on the training data. Additionally, removing noise and unimportant/redundant features can lead to better performing models. Lastly, reducing dimensionality will decrease the computational and memory requirements to train and use the model. # # Linear Discriminant Analysis is a linear supervised learning algorithm used for classification tasks by projecting the data to a lower dimensionality that maximizes the separation between classes. This is achieved by finding the vectors in the feature space that best separates the different classes of the data and minimizes the variance of the data within each class. # # The below code outlines the process of setting up the data so that it can be an input into Scikit-Learn's LinearDiscriminantAnalysis() function. # In[ ]: #renaming variables X_train = X_train_com y_train = y_train_com X_test = X_test_com y_test = y_test_com #reshaping X_train = X_train.reshape(X_train.shape[0],-1) X_test = X_test.reshape(X_test.shape[0],-1) # Normalizing the data X_train_norm = X_train/255 X_test_norm = X_test/255 #define sklearn LDA object lda = LinearDiscriminantAnalysis() #fit on training data lda.fit(X_train_norm,y_train) # The explained variance ratio of the LDA components (linear discriminants) indicate how much information is retained at each component. As a result, the cumulative explained variance can help determine how many components to keep for dimensionality reduction. # In[ ]: #getting explained variance ratio from the lda model evr = lda.explained_variance_ratio_ components = range(1, len(evr) + 1) #plotting scree plot fig, ax = plt.subplots(figsize = (8,5)) ax.bar(x = components, height = evr, label = 'Explained Variance'); plt.plot(components, np.cumsum(evr), marker = '.', color = 'orange', label = 'Cumulative Explained Variance') plt.axhline(y = .95, color = 'r', linestyle = '--', label = '0.95 Explained Variance') plt.xticks(range(1, len(evr)+1)); plt.title('Figure 10: LDA Explained Variance'); plt.xlabel('Component'); plt.ylabel('Explained Variance'); plt.legend(fontsize = 9); # Looking at the plot above, it can be interpreted that there is an elbow at around component 3 - 5, however, this would only account for about .4 - .55 of the cumulative variance explained. As a result, for the purposes of modeling, all components resulted from the LDA computation will be used. # # This results in a dimensionality reduction of 784 to 23. # In[ ]: #fit on training data and transform X_train_lda = lda.transform(X_train_norm) X_test_lda = lda.transform(X_test_norm) # In[ ]: fig, ax = plt.subplots(figsize = (8,8)) ax = sns.scatterplot(x = X_train_lda[:,0], y = X_train_lda[:,1], hue = y_train, palette = 'pastel'); handler, _ = ax.get_legend_handles_labels(); plt.legend(handler, letters, bbox_to_anchor = (1, 1)); plt.title('Figure 11: 2D Embedding of Sign Language Images') plt.xlabel('Linear Discriminant 1'); plt.ylabel('Linear Discriminant 2'); # Plotting linear components 1 and 2 from the LDA computation, it is observed that it does reasonably well at separating certain letters from others. For example, X and Y is separated well from the other letters using the first two linear discriminants. The other letters likely require more components to result in a clearer separation between the classes. # # # #### HOG # The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. # # The code below defines the parameters for Histogram of Oriented Gradients (HOG) feature extraction, including the number of orientations, pixels per cell, and cells per block. The extract_features function is then defined to extract HOG features from a single image using these parameters. Finally, the function is applied to all images in the training and test sets, resulting in X_train_features and X_test_features arrays that contain the extracted HOG features for each image. # # In summary, this code performs HOG feature extraction on the normalized images in the training and test sets using predefined parameters, producing arrays of HOG features for further analysis and modeling. # In[ ]: # Define the HOG parameters orientations = 9 pixels_per_cell = (2, 2) cells_per_block = (1, 1) # Function to extract HOG features from a single image def extract_features(img): #print(f"Image shape before HOG: {img.shape}") features = hog(img, orientations=orientations, pixels_per_cell=pixels_per_cell, cells_per_block=cells_per_block, visualize=False, transform_sqrt=True, feature_vector=True, block_norm='L2-Hys') #print(f"Feature shape: {features.shape}") return features # In[ ]: # Apply the extract_features function to all images in X_train and X_test X_train_features = np.array([extract_features(img.reshape((28,28))) for img in X_train_norm]) X_test_features = np.array([extract_features(img.reshape((28,28))) for img in X_test_norm]) print(f"X_train_features Shape: {X_train_features.shape}") print(f"X_test_features Shape: {X_test_features.shape}") # Because the HOG feature engineering method creates feature arrays with 1764 dimensions, a dimensionality reduction technique is needed so that the number of features being trained on can be limited to reduce the liklihood of overfitting. The code below performs Principal Component Analysis (PCA) on the extracted HOG features from the train and test sets. It reduces the dimensionality of the feature vectors to 30 components and transforms the data accordingly. The shape of hog_test_pca and hog_train_pca is then printed to show the new dimensions of the transformed feature vectors. # In[ ]: #perform PCA on HOG feature pca = PCA(n_components=30) #fit transform on training data hog_train_pca = pca.fit_transform(X_train_features) #transform on testing data hog_test_pca=pca.transform(X_test_features) #printing shape print(hog_test_pca.shape) print(hog_train_pca.shape) # In[ ]: #getting explained variance ratio from the lda model evr = pca.explained_variance_ratio_ components = range(1, len(evr) + 1) #plotting scree plot fig, ax = plt.subplots(figsize = (8,5)) ax.bar(x = components, height = evr, label = 'Explained Variance'); plt.plot(components, np.cumsum(evr), marker = '.', color = 'orange', label = 'Cumulative Explained Variance') plt.axhline(y = .95, color = 'r', linestyle = '--', label = '0.95 Explained Variance') plt.xticks(range(1, len(evr)+1)); plt.title('Figure 12: PCA Explained Variance (HOG)'); plt.xlabel('Component'); plt.ylabel('Explained Variance'); plt.legend(fontsize = 9); # # The figure above shows the explained variance ratio and cumulative explained variance for each principal component in the PCA analysis. It helps visualize the amount of variance explained by each component and the cumulative variance explained as more components are considered. The red dashed line represents the threshold of 0.95 explained variance, indicating the number of components needed to capture at least 95% of the total variance. However, since the explained variance increases slowly, it would require almost all components to reach the 95% mark. As a result, only the first 30 components will be used which was determined by an Accuracy vs Number of LDA Components Graph using the Logistic Regression baseline model. 30 is chosen based on the point where the accuracy no longer increases exponentially. The code and figure illustrating this is shown below. # In[ ]: train_acc = [] test_acc = [] for num_components in range(6,60,3): pca = PCA(n_components=num_components) hog_train_pca = pca.fit_transform(X_train_features) hog_test_pca = pca.transform(X_test_features) lr = LogisticRegression(C=3.4647045830997407, max_iter=3171, penalty="l2", solver="liblinear", warm_start=False) lr.fit(hog_train_pca,y_train) y_pred_lr_train = lr.predict(hog_train_pca) y_pred_lr_test = lr.predict(hog_test_pca) train_acc.append(accuracy_score(y_train,y_pred_lr_train)) test_acc.append(accuracy_score(y_test,y_pred_lr_test)) # In[ ]: plt.plot(range(6,60,3),train_acc,label='Train') plt.plot(range(6,60,3),test_acc,label='Test') plt.title('Figure 13: Logistic Regression Accuracy vs. Number of PCA Components') plt.xlabel('Number of PCA Components') plt.ylabel('Accuracy') plt.legend() # Next, the scatter plot in Figure 14 shows a 2D embedding of the HOG features after performing PCA (Principal Component Analysis) on the training data. Each point represents a specific label data point, and its position on the plot is determined by the values of the first and second principal components. # In[ ]: # Plotting scatter plot for hog_train_pca fig, ax = plt.subplots(figsize=(8, 8)) ax = sns.scatterplot(x=hog_train_pca[:, 0], y=hog_train_pca[:, 1], hue=y_train, palette='pastel', alpha=0.6) handler, _ = ax.get_legend_handles_labels() plt.legend(handler, letters, bbox_to_anchor=(1, 1)) plt.title('Figure 14: 2D Embedding For PCA (HOG Train)') plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.show() # The last step with feature engineering and dimensionality reduction is to combine the LDA components and the PCA components of HOG together, so that they can be used for model training. The shapes of the dataset are seen below. # In[ ]: # concatenate PCA and LDA features X_train_combined = np.concatenate((hog_train_pca, X_train_lda), axis=1) X_test_combined = np.concatenate((hog_test_pca, X_test_lda), axis=1) print(f"X_train Shape: {X_train_combined.shape}") print(f"X_test Shape: {X_test_combined.shape}") #redefining X_train = X_train_combined X_test = X_test_combined # After addressing the overfittting issue, two mechine learning models were employed as baselines, namely Naive Bayes and Logistic Regression, to establish a performance benchmark. Additionally, more advanced models such as Random Forest, Support Vector Machines (SVM), XGBoost were utilized, along with a stacking ensemble technique. This comprehensive approach aimed to leverage the strengths of each model, resulting in improved predictive accuracy and robustness for the given task. The models Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine, XGBoost, and Stacking Ensemble Classifier are located in Appendix C, D, E, F, G, H respectively where Randomized Grid Search CV is used to find the best hyperparameters for this dataset. # # The evaluation for each model is also shown in the aforementioned appendix sections. In the Evaluation Section, plots to determine the best performing models are shown. Only the code for the best performing non-deep learning model and deep learning models will be displayed in this methodology section. To see the code for the other models, refer to the appendix sections. # # For non-deep learning models, the best performing model is the Stacking Ensemble Classifier. The code to define the structure and train this model is as follows. # #### Best Performing Non-Deep Learning Model: Stacking Ensemble Classifier # The below code describes the process of training the stacking ensemble classifier. This ensemble method is composed of the best performing models found using Randomized Grid Search CV for Support Vector Machine, XGBoost, Logistic Regression, and Random Forest. The meta estimator is defined as a Logistic Regression model. # In[ ]: #defining estimators all_estimators = [ ('svm',SVC(kernel = 'poly', gamma = 'auto', C = .1, probability=True)), ('xgb',xgb.XGBClassifier(subsample=0.4,reg_lambda=2.25,reg_alpha=2,min_child_weight=30,max_depth=8,learning_rate=0.001,gamma=0,colsample_bytree=0.4)), ('lr',LogisticRegression(C=0.22564631610840102,max_iter=2391, penalty="l2", solver='newton-cg',warm_start=False)), ('rf',RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=5,max_features=5, max_depth=5, random_state=42)) ] #training stacking classifier all_stack = StackingClassifier(estimators=all_estimators, final_estimator=LogisticRegression(max_iter=3000)) all_stack.fit(X_train,y_train) #predictions y_pred_train = all_stack.predict(X_train) y_pred_test = all_stack.predict(X_test) # In[ ]: #stacking model all_stack # #### Best Performing Deep Learning Model: CNN # In[ ]: import tensorflow as tf from tensorflow.keras import layers, models train_datagen = tf.keras.preprocessing.image.ImageDataGenerator( rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=False, preprocessing_function=tf.keras.applications.resnet50.preprocess_input) train_generator = train_datagen.flow_from_directory( directory='/content/drive/MyDrive/Data/', target_size=(224, 224), batch_size=32, class_mode='categorical',shuffle= False) val_datagen = tf.keras.preprocessing.image.ImageDataGenerator( rescale=1./255, preprocessing_function=tf.keras.applications.resnet50.preprocess_input) val_generator = val_datagen.flow_from_directory( directory='/content/drive/MyDrive/TestData/TestData/', target_size=(224, 224), batch_size=32, class_mode='categorical', shuffle= False) test_datagen = tf.keras.preprocessing.image.ImageDataGenerator( rescale=1./255, preprocessing_function=tf.keras.applications.resnet50.preprocess_input) test_generator = test_datagen.flow_from_directory( directory='/content/drive/MyDrive/TestData/TestData/', target_size=(224, 224), batch_size=32, class_mode='categorical',shuffle= False) #CNN Model model = models.Sequential() model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3))) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(128, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(256, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Flatten()) model.add(layers.Dense(512, activation='relu')) model.add(layers.Dropout(0.2)) # Add dropout with a rate of 0.5 model.add(layers.Dense(24, activation='softmax')) model.summary() model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) history = model.fit( train_generator, steps_per_epoch=train_generator.samples/train_generator.batch_size, epochs=5, validation_data=val_generator, validation_steps=val_generator.samples/val_generator.batch_size) # This model is a convolutional neural network (CNN) architecture used for image classification tasks. # # - Conv2D layer with 32 filters and a kernel size of (3, 3): This layer performs the convolution operation on the input image with 32 filters, each of which detects different features in the image. The activation function used is ReLU (Rectified Linear Unit), which introduces non-linearity to the model. # # - MaxPooling2D layer with a pool size of (2, 2): This layer reduces the spatial dimensions (width and height) of the input by selecting the maximum value in each 2x2 region. It helps in reducing the computational complexity and provides a form of translation invariance. # # - Conv2D layer with 64 filters and a kernel size of (3, 3): This layer performs another convolution operation on the previous layer's output with 64 filters, extracting more complex features from the image. # # - MaxPooling2D layer: Similar to the previous max pooling layer, it reduces the spatial dimensions further. # # - Conv2D layer with 128 filters and a kernel size of (3, 3): This layer continues the pattern of extracting higher-level features with 128 filters. # # - MaxPooling2D layer: Reduces the spatial dimensions again. # # - Conv2D layer with 256 filters and a kernel size of (3, 3): This layer further increases the number of filters, capturing more abstract features in the image. # # - MaxPooling2D layer: Continues the downsampling process. # # - Flatten layer: This layer converts the 2D output of the previous layer into a 1D feature vector, preparing it for input to a fully connected (dense) layer. # # - Dense layer with 512 units: A fully connected layer that receives the flattened feature vector as input. It applies the ReLU activation function to introduce non-linearity. # # - Dropout layer with a dropout rate of 0.2: Dropout is a regularization technique that randomly sets a fraction of input units to 0 during training. It helps prevent overfitting by reducing the reliance on specific neurons and encourages the network to learn more robust features. # # - Dense layer with 24 units: This is the final output layer of the network, consisting of 24 units corresponding to the number of classes in the classification task. The activation function used is softmax, which converts the final layer's outputs into probabilities, representing the likelihood of each class. # In[ ]: #Showing the training epochs as it was trained on kaggle display.Image("./figures/CNN_epochs.png") # ## Evaluation # To find the best performing non-deep learning model, KFold Cross Validation accuracy scores are computed for each model and the plot of the average accuracies across the KFold validation datasets and its standard deviation are shown below. # In[ ]: #define models for stacking classifier all_estimators = [ ('svm',SVC(kernel = 'poly', gamma = 'auto', C = .1, probability=True)), ('xgb',xgb.XGBClassifier(subsample=0.4,reg_lambda=2.25,reg_alpha=2,min_child_weight=30,max_depth=8,learning_rate=0.001,gamma=0,colsample_bytree=0.4)), ('lr',LogisticRegression(C=0.22564631610840102,max_iter=2391, penalty="l2", solver='newton-cg',warm_start=False)), ('rf',RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=5,max_features=5, max_depth=5, random_state=42)) ] #define models for KFold CV accuracy looping models = {'Naive Bayes':GaussianNB(), 'Logistic Regression':LogisticRegression(C=0.22564631610840102,max_iter=2391, penalty="l2",solver='newton-cg',warm_start=False), 'SVM':SVC(kernel = 'poly', gamma = 'auto', C = .1, probability=True), 'Random Forest Classifier':RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=5,max_features=5, max_depth=5, random_state=42), 'XGBoost':xgb.XGBClassifier(subsample=0.4,reg_lambda=2.25,reg_alpha=2,min_child_weight=30,max_depth=8,learning_rate=0.001,gamma=0,colsample_bytree=0.4), 'Stacking Classifier':StackingClassifier(estimators=all_estimators, final_estimator=LogisticRegression(max_iter=3000)) } #lists to store results and names results = [] names = [] #loop to calculate accuracies for name,model in models.items(): kfold = model_selection.KFold(n_splits=5,random_state=99,shuffle=True) if name != 'XGBoost': cv_results = model_selection.cross_val_score(model,X_train,y_train,cv=kfold,scoring='accuracy') else: from sklearn.preprocessing import LabelEncoder le = LabelEncoder() le.fit(y_train) y_train_encoded = le.transform(y_train) cv_results = model_selection.cross_val_score(model,X_train,y_train_encoded,cv=kfold,scoring='accuracy') results.append(cv_results) names.append(name) print(f'{name}: {cv_results.mean()}, {cv_results.std()}') fig = plt.figure() fig.suptitle('Figure 15: KFold CV Accuracy Comparison') ax = fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.xticks(rotation=45, ha='right') plt.show() # In[ ]: #display image due to computation time display.Image("./figures/Figure15.png") # In Figure 15, it is observed that the stacking classifier has the highest average cross validation accuracy at 0.835 and a standard deviation of 0.00598. This indicates that the stacking ensemble classifier is the best performing non-deep learning model. This is exemplified by the train and test accuracy scores, where the Stacking Ensemble Model has the highest train, test accuracies for the non-deep learning models. # In[ ]: #display image due to computation time display.Image("./figures/Figure16.png") # #### Stacking Ensemble Model Evaluation # To see the performance of the stacking ensemble in more detail, the predictions for train and test datasets are computed, and the evaluate_model() function is executed below. # In[ ]: #predict on train and test data using stacking ensemble y_pred_train = all_stack.predict(X_train) y_pred_test = all_stack.predict(X_test) # In[ ]: #evaluate model evaluate_model(y_train,y_pred_train,letters) evaluate_model(y_test,y_pred_test,letters) # ##### Evaluation Summary # - Accuracy: For train, accuracy is 96% and, and test is 86% # - Precision: A-Y labels range from 0.89-1.00 for train and 0.55-0.98 for test. C, O & P have higher precision for test set. Higher values indicates a lower false positive rate # - Recall: The recall ranges from 0.90-1.00for train and 0.64-0.99 for test, P & L have higher recall. Higher values indicates a lower false negative rate # - Support: For test, the values range from 576-1992, indicating varying class frequency # - The Matthews Correlation Coefficient (MCC) is 0.9567 for train, indicating a good higher level of agreement between predicted and true labels. For test, it is approximately 0.852. # - Letter R an S had lower accuracy and showed lower performance in terms of correctly identifying instances of their respective classes # - Cohen's Kappa is a statistical measure of inter-rater agreement for categorical classifications which is high for this model # # #### Deep Learning Model Evaluation # ##### Deep Learning Train Data Evaluation # In[ ]: # Evaluate the model on the train set test_loss, test_accuracy = model.evaluate(train_generator) # Generate predictions on the test set predictions = model.predict(train_generator) predicted_labels = tf.argmax(predictions, axis=1) # Get the true labels true_labels = train_generator.classes # Get the class names class_names = list(train_generator.class_indices.keys()) evaluate_model(true_labels, predicted_labels.numpy(), class_names) # In[ ]: #Showing the images it was trained on kaggle display.Image("./figures/accTrain.png") # In the given classification report, the model achieved an accuracy of 0.9412, meaning it correctly predicted the class labels for 94.12% of the samples. # # - Precision: Most classes have high precision scores of 1.00, indicating that the model had a very low rate of false positives for those classes. However, classes 'M', 'N', 'R', 'S', 'U', and 'V' have lower precision scores, suggesting some difficulty in accurately predicting those classes. # # - Recall: The majority of classes have recall scores of 1.00, indicating that the model successfully identified the majority of positive instances for those classes. However, classes 'M' and 'S' have lower recall scores, indicating some difficulty in correctly capturing all positive instances for those classes. # # - F1-score: The F1-scores for most classes are high, with a value of 1.00, suggesting a balanced performance in terms of precision and recall. However, classes 'M', 'N', 'R', and 'S' have lower F1-scores, indicating a trade-off between precision and recall for those classes. # # - Support: The support column shows the number of samples in each class in the dataset. # # # - Weighted avg: The weighted average precision, recall, and F1-score are also around 0.94, taking into account the support (i.e., number of samples) for each class. # # Overall, the model achieved high accuracy and performed well for most classes. However, it faced challenges in accurately predicting classes 'M', 'N', 'R', 'S', 'U', and 'V'. Improvements may be needed to enhance the model's performance on these specific classes. # In[ ]: #Showing the images it was trained on kaggle display.Image("./figures/ClassificationReport_train.png") # In[ ]: #Showing the images it was trained on kaggle display.Image("./figures/ConfusionMatrix_train.png") # ##### Deep Learning Test Data Evaluation # In[ ]: #Deep Learning Evaluation #testing Data # Evaluate the model on the test set test_loss, test_accuracy = model.evaluate(val_generator) # Generate predictions on the test set predictions = model.predict(val_generator) predicted_labels = tf.argmax(predictions, axis=1) # Get the true labels true_labels = val_generator.classes # Get the class names class_names = list(val_generator.class_indices.keys()) evaluate_model(true_labels, predicted_labels.numpy(), class_names) # In[ ]: #Showing the images it was trained on kaggle display.Image("./figures/accTest.png") # The accuracy of the model is 0.9321, which means it predicted the correct class for 93.21% of the samples. # # - Precision measures the proportion of correctly predicted positive instances out of the total instances predicted as positive. Recall, on the other hand, measures the proportion of correctly predicted positive instances out of the total actual positive instances. The F1-score is the harmonic mean of precision and recall, providing a balanced measure of the model's performance. # # - In your classification report, most classes have perfect precision and recall scores of 1.00, indicating high performance. However, some classes like 'M', 'N', 'R', and 'S' have lower scores, suggesting difficulties in accurately predicting those classes. # # - The support column indicates the number of samples in each class, which can vary across classes. # # - The macro average provides the average performance across all classes, treating each class equally. The weighted average considers the support for each class, providing a weighted measure that accounts for class imbalance. # # In summary, the model achieved high accuracy and performed well for most classes, but struggled with certain classes. Further analysis and improvements may be needed to enhance the model's performance on those specific classes. # In[ ]: #Showing the images it was trained on kaggle display.Image("./figures/ClassificationReportTesting.png") # In[ ]: display.Image("./figures/ConfusionMatrixTesting.png") # ## Conclusion # For the purposes of this investigation, the best performing models for sign language classification is split between the best non-deep learning and deep learning models. The best non-deep learning model is the Stacking Ensemble Classifier, which consists of the estimators of Logistic Regression, Support Vector Machine, Random Forest, and XGBoost. The meta estimator used is Logistic Regression. This model was trained on the ~100k augmented dataset where the features used are the 24 LDA components and the 30 PCA components derived from the HOG features. The best performing deep learning model is a Convolutional Neural Network that consists of a total of 4 hidden layers (convolutional layers with 32, 64, 128, and 256 filters), pooling layers after each convolutional layers to down sample the feature maps, and a dense layer consisting of 24 units, representing the number of classes in the classification task. It is important to note that these models performed well on the train and test datasets; however, for live testing, where the input images are being extracted from live camera video feeds, the performance of the models decreased. As a result, the Convolutional Neural Network was trained on 224 x 224 images where the hand landmarks are plotted on the images. Adding these hand features and increasing the sizes of the training images allows the model to better able find the finger positions and more accurately predict the correct labels. # # As seen in the evaluation metrics, these models perform much better than the random classifier for this problem, which has an accuracy of 0.04. As a result, we are able to conclude that we have created high performing and suitable models for a Sign Language Interpreter Model. # # Using the CCN, the Sign Language Interpreter Model, named SignLingo, was created and can be ran using the sign_language_interpreter.py script. The reason why the CNN model was chosen is due to the fact that it is the highest performing model on both the dataset and live environment. This script utilizes various Python libraries like OpenCV and MediaPipe utilize the computer or phone’s camera, detect a person’s hand, extract the hand, send it as an input into the trained CNN, and output the predicted label with a confidence score. # # The general takeaways from this project are as follows: The performance of models is highly dependent on the quality of data and the ability to extract informative features, training data should be as close as possible to the real world environment, certain models have certain decision boundary characteristics, which work better for certain features (it is expected that Random Forest and XGBoost perform better, however the features that have been chosen are not optimized for these models), and features that can clarify information like the positioning of the fingers will benefit model performance and generality. # # ## Attribution # # - Sumaiya Uddin has contributed significantly to the project. Her code hours are estimated to be around 3-4 hours daily on average. In the repository, her contributions include exploratory data analysis (EDA), data augmentation techniques, feature selection using the Histogram of Oriented Gradients (HOG) method, implementation of the baseline model, and the application of various machine learning algorithms such as Naive Bayes, Logistic Regression, Random Forest, and SVD dimension reduction. Her expertise and efforts have been instrumental in improving the project's performance and advancing its objectives. # # - Allen Lau has been an instrumental part to this project and team. His code hours are estimated to be around 3 - 4 hours daily. In the repository, his contributions include exploratory data analysis, dimensionality reduction using LDA, modeling with various algorithms like support vector machine, XGBoost, and stacking ensemble classifier, and scripts for utilizing OpenCV and MediaPipe to utilize computer vision techniques for the sign language interpreter application. His knowledge and diligence played a vital role in ensuring the project meets the goals set by the team and the project objectives. # # - Shubham Khandale has played a crucial role in this project and team, dedicating approximately 3 to 4 hours per day to coding. His contributions to the repository encompass a range of tasks, including conducting exploratory data analysis, performing dimensionality reduction using TSNE, and implementing various algorithms such as Logistic Regression, Naïve Bayes, and LDA on the feature set extracted from the CNN model. Furthermore, he actively participated in generating the dataset using the cvzone Python library and constructing a customized CNN model. His expertise and dedication were instrumental in ensuring the project aligns with the team's objectives and achieves its goals. # In[13]: #Showing the images it was trained on kaggle display.Image("./figures/github_contributions.png") # ## Bibliography # # [1] https://www.sciencedirect.com/science/article/abs/pii/S0957417405003040 # # ## Appendix # #### A. Naive Bayes Initial Results # The below code is used to train a Gaussian Naive Bayes model on the normalized training data. Once the model is trained, the evaluate_model() function is used to output the evaluation metrics described in the Methods section above. # In[ ]: # Defining Naive Bayes gnb = GaussianNB() gnb.fit(X_train_norm, y_train) # applying NB on normalized train data y_pred_train = gnb.predict(X_train_norm) # applying NB on normalized test data y_pred_test = gnb.predict(X_test_norm) # In[ ]: # Evaluating on train set evaluate_model(y_train, y_pred_train, letters) # Evaluating on test set evaluate_model(y_test, y_pred_test, letters) # The results indicate that the Naive Bayes classifier is showing signs of overfitting. The high training accuracy (0.46) compared to the lower test accuracy (0.39) suggests that the model has learned the training data too well, resulting in poor generalization to unseen data. # #### B. Logistic Regression Initial Results # # The below code describes the process of training the best Logistic Regression model, where Randomized Grid Search CV is used for hyperparameter tuning. The best parameters found using Randomized Grid Search CV are as follows: # # Best hyperparameters: {'C': 3.4647045830997407, 'max_iter': 3171, 'penalty': 'l2', 'solver': 'liblinear', 'warm_start': False} # In[ ]: lr = LogisticRegression(C=3.4647045830997407, max_iter=3171, penalty="l2", solver="liblinear", warm_start=False) lr.fit(X_train_norm, y_train) # applying Logistic regression on train y_pred_lr_train = lr.predict(X_train_norm) # applying Logistic regression on test y_pred_lr_test = lr.predict(X_test_norm) # In[ ]: evaluate_model(y_train, y_pred_lr_train, letters) evaluate_model(y_test, y_pred_lr_test, letters) # As seen in the train vs test evaluation, the Logistic Regression model on the 27k training images results in overfitting. The train accuracy is 100%, while the test accuracy is 67%. This indicates that techniques like data augmentation, regularization, and dimensionality reduction are needed to reduce the liklihood of overfitting. # In[ ]: # Normalizing the data #X_train_norm = X_train/255 #X_test_norm = X_test/255 # Print the shapes of the augmented data print(f'X_train shape: {X_train_norm.shape}') print(f'y_train shape: {y_train.shape}') print(f'X_test shape: {X_test_norm.shape}') print(f'y_test shape: {y_test.shape}') # #### C. Naive Bayes # # The below code is used to train a Gaussian Naive Bayes model on the Combined featured training data. Once the model is trained, the evaluate_model() function is used to output the evaluation metrics described in the Methods section above. # # In[ ]: # Defining Naive Bayes gnb = GaussianNB() gnb.fit(X_train, y_train) # applying NB on normalized train data y_pred_train = gnb.predict(X_train) # applying NB on normalized test data y_pred_test = gnb.predict(X_test) # In[ ]: # Evaluating on train set evaluate_model(y_train, y_pred_train, letters) # Evaluating on test set evaluate_model(y_test, y_pred_test, letters) # ##### Evaluation Summary # - Accuracy: For train, accuracy is 69% and, and test is 57% # - Precision: A-Y labels range from 0.45-0.90 for train and 0.15-0.91 for test. P & C have higher precision for test set. Higher values indicates a lower false positive rate # - Recall: The recall ranges from 0.54-0.87 for train and 0.30-0.85 for test, P & H have higher recall. Higher values indicates a lower false negative rate # - Support: For test, the values range from 576-1992, indicating varying class frequency # - The Matthews Correlation Coefficient (MCC) is 0.6778 for train, indicating a moderate level of agreement between predicted and true labels. For test, it is approximately 0.548. # - Letter R an S had lower accuracy and showed lower performance in terms of correctly identifying instances of their respective classes # - Cohen's Kappa is a statistical measure of inter-rater agreement for categorical classifications, which is moderate for this model # # # #### D. Logistic regression # # The below code describes the process of training the best Logistic Regression model, where Randomized Grid Search CV is used for hyperparameter tuning. The best parameters found using Randomized Grid Search CV are as follows: # # Best hyperparameters: {C=0.22564631610840102,max_iter=2391, penalty="l2", solver='newton-cg',warm_start=False} # In[ ]: lr = LogisticRegression(C=0.22564631610840102, max_iter=2391, penalty="l2", solver='newton-cg', warm_start=False) lr.fit(X_train, y_train) # applying Logistic regression and predicting on train y_pred_lr_train = lr.predict(X_train) # testing logistic regression on test data y_pred_lr_test = lr.predict(X_test) # In[ ]: evaluate_model(y_train, y_pred_lr_train, letters) evaluate_model(y_test, y_pred_lr_test, letters) # ##### Evaluation Summary # # - Accuracy: For train, accuracy is 77% and, and test is 65% # - Precision: A-Y labels range from 0.66-0.89 for train and 0.29-0.89 for test. P, & C have higher precision for test set. Higher values indicates a lower false positive rate # - Recall: The recall ranges from 0.63-0.90 for train and 0.42-0.89 for test, P, C & H have higher recall. Higher values indicates a lower false negative rate # - Support: For test, the values range from 576-1992, indicating varying class frequency # - The Matthews Correlation Coefficient (MCC) is 0.761 for train, indicating a little higher level of agreement between predicted and true labels. For test, it is approximately 0.636 which moderate level. # - Letter R an S had lower accuracy and showed lower performance in terms of correctly identifying instances of their respective classe # # #### E. Random Forest # # The below code describes the process of training the best Random Forest model, where Randomized Grid Search CV is used for hyperparameter tuning. The best parameters found using Randomized Grid Search CV are as follows: # # Best hyperparameters: {'n_estimators=20, min_samples_split=10, min_samples_leaf=5, max_features=5, max_depth=5} # In[ ]: #defining rfc with best parameters rfc =RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=5, max_features=5, max_depth=5, random_state=42) rfc.fit(X_train, y_train) # applying Logistic regression and predicting on train y_pred_train = rfc.predict(X_train) # testing logistic regression on test data y_pred_test = rfc.predict(X_test) # In[ ]: evaluate_model(y_train, y_pred_train, letters) evaluate_model(y_test, y_pred_test, letters) # ##### Evaluation Summary # - Accuracy: For train, accuracy is 54% and, and test is 42% # - Precision: A-Y labels range from 0.27-0.74 for train and 0.12-0.77 for test. P, B & C have higher precision for test set. Higher values indicates a lower false positive rate # - Recall: The recall ranges from 0.54-0.87 for train and 0.30-0.85 for test, P & H have higher recall. Higher values indicates a lower false negative rate # - Support: For test, the values range from 576-1992, indicating varying class frequency # - The Matthews Correlation Coefficient (MCC) is 0.518 for train, indicating a little lower level of agreement between predicted and true labels. For test, it is approximately 0.402. # - Letter R an S had lower accuracy and showed lower performance in terms of correctly identifying instances of their respective classes # #### F. Support Vector Machine # The below code describes the process of training the best Support Vector Machine, where Randomized Grid Search CV is used for hyperparameter tuning. The best parameters found using Randomized Grid Search CV are as follows: # # Best hyperparameters: {'kernel': 'poly', 'gamma': 'scale', 'C': .1} # In[ ]: #define svm with best parameters svm = SVC(kernel = 'poly', gamma = 'scale', C = .1, probability=True) #fit on training data svm.fit(X_train,y_train) #predict on training data y_pred_train = svm.predict(X_train) #predict on testing data y_pred_test = svm.predict(X_test) # In[10]: evaluate_model(y_train,y_pred_train,letters) evaluate_model(y_test,y_pred_test,letters) # ##### Evaluation Summary # - Accuracy: For train, accuracy is 94% and, and test is 82% # - Precision: A-Y labels range from 0.81-1.00 for train and 0.44-1.00 for test. P & C have higher precision for test set. Higher values indicates a lower false positive rate # - Recall: The recall ranges from 0.82-0.99 for train and 0.74-0.96for test, L, Q & P have higher recall. Higher values indicates a lower false negative rate # - Support: For test, the values range from 576-1992, indicating varying class frequency # - The Matthews Correlation Coefficient (MCC) is 0.9334 for train, indicating a higher level of agreement between predicted and true labels. For test, it is approximately 0.8159. # - Letter R an S had lower accuracy and showed lower performance in terms of correctly identifying instances of their respective classes # - Cohen's Kappa is a statistical measure of inter-rater agreement for categorical classifications, which is high for this model # # # # #### G. XGBoost # The below code describes the process of training the best XGBoost, where Randomized Grid Search CV is used for hyperparameter tuning. The best parameters found using Randomized Grid Search CV are as follows: # # Best hyperparameters: {'subsample': 0.4, # 'reg_lambda': 2.25, # 'reg_alpha': 2, # 'min_child_weight': 30, # 'max_depth': 8, # 'learning_rate': 0.001, # 'gamma': 0, # 'colsample_bytree': 0.4} # In[ ]: #encoding for XGBoost from sklearn.preprocessing import LabelEncoder #encoding le = LabelEncoder() le.fit(y_train) y_train_encoded = le.transform(y_train) y_test_encoded = le.transform(y_test) # Create an XGBoost classifier object xgb_model = xgb.XGBClassifier(subsample=0.4,reg_lambda=2.25,reg_alpha=2,min_child_weight=30,max_depth=8,learning_rate=0.001,gamma=0,colsample_bytree=0.4) #fitting on train xgb_model.fit(X_train,y_train_encoded) # In[12]: with open('/content/drive/Shareddrives/SignLanguageData/XGBoost_Predictions.pkl','rb') as f: y_train_encoded,y_pred_train,y_test_encoded,y_pred_test = pickle.load(f) # In[13]: #evaluation evaluate_model(y_train_encoded,y_pred_train,letters) evaluate_model(y_test_encoded,y_pred_test,letters) # ##### Evaluation Summary # - Accuracy: For train, accuracy is 76% and, and test is 60% # - Precision: A-Y labels range from 0.61-0.92 for train and 0.15-0.91 for test. P & C have higher precision for test set. Higher values indicates a lower false positive rate # - Recall: The recall ranges from 0.58-0.85 for train and 0.25-0.87 for test, P & Q have higher recall. Higher values indicates a lower false negative rate # - Support: For test, the values range from 576-1992, indicating varying class frequency # - The Matthews Correlation Coefficient (MCC) is 0.7489 for train, indicating a moderate level of agreement between predicted and true labels. For test, it is approximately 0.579. # - Letter R, S, & V had lower accuracy and showed lower performance in terms of correctly identifying instances of their respective classes # - Cohen's Kappa is a statistical measure of inter-rater agreement for categorical classifications, which is moderate for this model. # # # # # #### I. Stacking Ensemble # The below code describes the process of training the stacking ensemble classifier. This ensemble method is composed of the best performing models found using Randomized Grid Search CV for Support Vector Machine, XGBoost, Logistic Regression, and Random Forest. The meta estimator is defined as a Logistic Regression model. # In[ ]: #defining estimators all_estimators = [ ('svm',SVC(kernel = 'poly', gamma = 'auto', C = .1, probability=True)), ('xgb',xgb.XGBClassifier(subsample=0.4,reg_lambda=2.25,reg_alpha=2,min_child_weight=30,max_depth=8,learning_rate=0.001,gamma=0,colsample_bytree=0.4)), ('lr',LogisticRegression(C=0.22564631610840102,max_iter=2391, penalty="l2", solver='newton-cg',warm_start=False)), ('rf',RandomForestClassifier(n_estimators=20, min_samples_split=10, min_samples_leaf=5,max_features=5, max_depth=5, random_state=42)) ] #training stacking classifier all_stack = StackingClassifier(estimators=all_estimators, final_estimator=LogisticRegression(max_iter=3000)) all_stack.fit(X_train,y_train) #predictions y_pred_train = all_stack.predict(X_train) y_pred_test = all_stack.predict(X_test) #evaluate model evaluate_model(y_train,y_pred_train,letters) evaluate_model(y_test,y_pred_train,letters) # In[5]: #display image due to computation time print('Stacking Ensemble Train Evaluation') display.Image("./figures/stack_eval_train1.png") # In[6]: #display image due to computation time display.Image("./figures/stack_eval_train2.png") # In[7]: #display image due to computation time display.Image("./figures/stack_eval_train3.png") # In[8]: #display image due to computation time print('Stacking Ensemble Test Evaluation') display.Image("./figures/stack_eval_test1.png") # In[10]: #display image due to computation time display.Image("./figures/stack_eval_test2.png") # In[11]: #display image due to computation time display.Image("./figures/stack_eval_test3.png") # ##### Evaluation Summary # - Accuracy: For train, accuracy is 96% and, and test is 86% # - Precision: A-Y labels range from 0.89-1.00 for train and 0.55-0.98 for test. C, O & P have higher precision for test set. Higher values indicates a lower false positive rate # - Recall: The recall ranges from 0.90-1.00for train and 0.64-0.99 for test, P & L have higher recall. Higher values indicates a lower false negative rate # - Support: For test, the values range from 576-1992, indicating varying class frequency # - The Matthews Correlation Coefficient (MCC) is 0.9567 for train, indicating a good higher level of agreement between predicted and true labels. For test, it is approximately 0.852. # - Letter R an S had lower accuracy and showed lower performance in terms of correctly identifying instances of their respective classes # - Cohen's Kappa is a statistical measure of inter-rater agreement for categorical classifications which is high for this model # # # # # #### J. Deep Learning Data Generation 224x224x3 # # HandDetector - A robust Python module called CVZone was created exclusively for hand landmark identification. With its extensive collection of features and capabilities, CVZone makes it easier to find and follow hand landmarks in photos and video feeds. In order to precisely find and identify important human hand features like the fingers, knuckles, and palm center, it makes use of computer vision and machine learning methods. # In[ ]: #Deep Learning Data Generation import os import cv2 from cvzone.HandTrackingModule import HandDetector import numpy as np import math import time cap = cv2.VideoCapture(0) detector = HandDetector(maxHands=1) offset = 20 imgSize = 224 counter = 0 label = "Y" folder = "../data/external/Data/" + label + "/" if not os.path.exists(folder): os.makedirs(folder) print(f"Directory '{folder}' created successfully") else: print(f"Directory '{folder}' already exists") while True: success, img = cap.read() hands, img = detector.findHands(img) if hands: hand = hands[0] x, y, w, h = hand['bbox'] imgWhite = np.ones((imgSize, imgSize, 3), np.uint8) * 255 imgCrop = img[y - offset:y + h + offset, x - offset:x + w + offset] imgCropShape = imgCrop.shape aspectRatio = h / w if aspectRatio > 1: k = imgSize / h wCal = math.ceil(k * w) imgResize = cv2.resize(imgCrop, (wCal, imgSize)) imgResizeShape = imgResize.shape wGap = math.ceil((imgSize - wCal) / 2) imgWhite[:, wGap:wCal + wGap] = imgResize else: k = imgSize / w hCal = math.ceil(k * h) imgResize = cv2.resize(imgCrop, (imgSize, hCal)) imgResizeShape = imgResize.shape hGap = math.ceil((imgSize - hCal) / 2) imgWhite[hGap:hCal + hGap, :] = imgResize cv2.imshow("ImageCrop", imgCrop) cv2.imshow("ImageWhite", imgWhite) cv2.imshow("Image", img) key = cv2.waitKey(1) if key == ord("s"): counter += 1 cv2.imwrite(f'{folder}/Image_{time.time()}.jpg', imgWhite) print(counter) # #### K. Feature Extraction Using CNN # # The provided code focuses on extracting hierarchical and increasingly abstract features from the input images through multiple convolutional and pooling layers. These layers effectively perform feature selection by learning and capturing important patterns and information from the images. The subsequent layers (not shown in the given code) would typically involve fully connected layers and an output layer for the final classification or regression task. # In[ ]: #importing the required libraries import numpy as np import pandas as pd from matplotlib import pyplot as plt import numpy as np import pickle from keras.preprocessing.image import ImageDataGenerator import matplotlib.pyplot as plt import seaborn as sns import keras from keras.models import Sequential from keras.layers import Dense, Conv2D , MaxPool2D , Flatten , Dropout , BatchNormalization from keras.preprocessing.image import ImageDataGenerator from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report,confusion_matrix from keras.callbacks import ReduceLROnPlateau # 100k shuffled data with open('../data/external/combined_augmented_data_v3.pkl','rb') as f: X_train,y_train,X_test,y_test = pickle.load(f) # Normalize the data x_train = X_train / 255 x_test = X_test / 255 #printing the shape of data print(f'X_train shape: {x_train.shape}') print(f'y_train shape: {y_train.shape}') print(f'X_test shape: {X_test.shape}') print(f'y_test shape: {y_test.shape}') # Reshape X to be a 4D tensor for use in Keras X = x_train.reshape((x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)) y= y_train.reshape((y_train.shape[0], 1)) #number of labels num_classes = len(list(map(int,list(np.unique(y_train))))) #CNN Model model = Sequential() model.add(Conv2D(75 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu' , input_shape = (28,28,1))) model.add(BatchNormalization()) model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same')) model.add(Conv2D(50 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu')) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same')) model.add(Conv2D(25 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu')) model.add(BatchNormalization()) model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same')) model.add(Flatten()) model.summary() # Extract features from X using the model features = model.predict(X) #saving extracted Features into numpy with open('../data/external/extracted_features_on_v3.npy', 'wb') as f: np.save(f,features) # #### L. CNN Model for 28x28 Images # # The provided code represents a CNN model for image classification. It consists of multiple convolutional and pooling layers for feature extraction, followed by fully connected layers for high-level feature representation and classification. The model is trained using the Adam optimizer and evaluated using categorical cross-entropy loss and accuracy metrics. # In[ ]: #importing required libraries import numpy as np import pandas as pd from matplotlib import pyplot as plt import numpy as np import pickle from keras.preprocessing.image import ImageDataGenerator import matplotlib.pyplot as plt import seaborn as sns import keras from keras.models import Sequential from keras.layers import Dense, Conv2D , MaxPool2D , Flatten , Dropout , BatchNormalization from keras.preprocessing.image import ImageDataGenerator from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report,confusion_matrix from keras.callbacks import ReduceLROnPlateau from sklearn.preprocessing import LabelBinarizer #opening pickle file of augmented added data with open('../data/external/combined_augmented_data_v2.pkl','rb') as f: X_train,y_train,X_test,y_test = pickle.load(f) #LabelBinarizer is typically employed to convert categorical labels into binary vectors, enabling easier analysis and computation by machine learning algorithms. label_binarizer = LabelBinarizer() y_train = label_binarizer.fit_transform(y_train) y_test = label_binarizer.fit_transform(y_test) # Normalize the data x_train = X_train / 255 x_test = X_test / 255 learning_rate_reduction = ReduceLROnPlateau(monitor='val_accuracy', patience = 2, verbose=1,factor=0.5, min_lr=0.00001) #CNN Model model = Sequential() model.add(Conv2D(75 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu' , input_shape = (28,28,1))) model.add(BatchNormalization()) model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same')) model.add(Conv2D(50 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu')) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same')) model.add(Conv2D(25 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu')) model.add(BatchNormalization()) model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same')) model.add(Flatten()) model.add(Dense(units = 512 , activation = 'relu')) model.add(Dropout(0.3)) model.add(Dense(units = 24 , activation = 'softmax')) model.compile(optimizer = 'adam' , loss = 'categorical_crossentropy' , metrics = ['accuracy']) model.summary() history = model.fit(x_train,y_train, batch_size = 128 ,epochs = 2 , validation_data = (x_test, y_test) , callbacks = [learning_rate_reduction], verbose=1) print("Accuracy of the model is - " , model.evaluate(x_test,y_test)[1]*100 , "%") # save the model to an h5 file model.save('../data/external/finalModel.h5') # #### M. SVD # # Singular Value Decomposition (SVD) is a matrix factorization technique that is commonly used for dimensionality reduction and feature extraction. It decomposes a matrix into three separate matrices: U, Σ, and V, where U and V are orthogonal matrices and Σ is a diagonal matrix containing the singular values. # # In the code below, TruncatedSVD is used to perform SVD with a specified number of components (n_components). The data is first fitted to the pipeline (pipe.fit_transform(X_train)) to learn the SVD model and then transformed to obtain the reduced-dimensional representations. Finally, the test data is transformed using the learned SVD model (pipe.transform(X_test)). # In[ ]: n_components = 10 svd = TruncatedSVD(n_components, n_iter=7, random_state=42) # Build the pipeline pipe = Pipeline([('reducer', svd)]) # Fit the pipeline to X_train_sc and transform the data X_train_svd = pipe.fit_transform(X_train_norm) X_test_svd = svd.transform(X_test_norm) # The explained variance ratio of the SVD components (linear discriminants) indicate how much information is retained at each component. As a result, the cumulative explained variance can help determine how many components to keep for dimensionality reduction. # In[ ]: # calculate the explained variance ratio for each component explained_variance_ratio = svd.explained_variance_ratio_ explained_variance_ratio # In[ ]: # Getting explained variance ratio from the lda model evr = svd.explained_variance_ratio_ components = range(1, len(evr) + 1) # Plotting scree plot fig, ax = plt.subplots(figsize=(8, 5)) ax.bar(x=components, height=evr, label='Explained Variance') plt.plot(components, np.cumsum(evr), marker='.', color='orange', label='Cumulative Explained Variance') plt.axhline(y=.95, color='r', linestyle='--', label='0.95 Explained Variance') plt.xticks(range(1, len(evr) + 1)) plt.title('SVD: Explained Variance') plt.xlabel('Component') plt.ylabel('Explained Variance') plt.legend(fontsize=9) # Show the plot plt.show() # In[ ]: fig, ax = plt.subplots(figsize = (8,8)) ax = sns.scatterplot(x = X_train_svd[:,0], y = X_train_svd[:,1], hue = y_train, palette = 'pastel'); handler, _ = ax.get_legend_handles_labels(); plt.legend(handler, letters, bbox_to_anchor = (1, 1)); plt.title('2D Embedding of Sign Language Images') plt.xlabel('Singular Vector 1'); plt.ylabel('Singular Vector 2'); # Looking at the plot above, it can be interpreted that there is an elbow at component 2. And from Scatter plot above it is observed that it does not reasonably well at separating certain letters from others. # #### N. TSNE # # t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data in a lower-dimensional space. It aims to preserve the local structure and relationships between data points while creating a compressed representation. # # The below code outline the process of setting up the data so that it can be an input into t-SNE using Scikit-Learn. # In[ ]: # Initialize t-SNE object tsne = TSNE(n_components=2, random_state=0 ) # Apply t-SNE to data tsne_res = tsne.fit_transform(X_train_norm) # In[ ]: # Plot t-SNE results fig, ax = plt.subplots(figsize = (10,10)) ax= sns.scatterplot(x = tsne_res[:,0], y = tsne_res[:,1], hue = y_train, palette = sns.hls_palette(10), legend = 'full') handler, _ = ax.get_legend_handles_labels(); plt.legend(handler, letters, bbox_to_anchor = (1, 1)); #title plt.title('2D Embedding of Sign Language Images') #x-axis label plt.xlabel('TSNE Component 1'); #y-axis label plt.ylabel('TSNE Component 2'); # Looking at the plot above, it is observed that it does not reasonably well at separating certain letters from others. #