#!/usr/bin/env python # coding: utf-8 #

Spam Filter using Naive Bayes

Building a spam filter from scratch using the Naive Bayes algorithm

# # In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes theorem with strong (but naive) independence assumptions between the features. In probability theory and statistics, Bayes' theorem (alternatively Bayes law or Bayes rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event. # # The aim of this project is to build a spam filter using Naive Bayes algorithm from scratch. Leveraging the Bayes theorem the model is built to predict the probability that a given text is spam or non-spam based on the training data exposed to the model.
# # The dataset consists of text messages and a target variable specifying whether the text message was a spam or not. The dataset has been picked up from the UCI Machine Learning Repository
# # The columns in the dataset are not named, hence the names given - Label and SMS.
# # -> Label - target variable spam and ham (non-spam) # -> SMS - the message # # It is required to explore the dataset before building the model, so as to remove any discrepancies in the text. # In[184]: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re from itertools import chain from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import classification_report from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator from PIL import Image # In[143]: df = pd.read_csv('SMSSpamCollection',sep='\t',header=None) df.columns = ['Label','SMS'] print(df.shape) df.head(5) # Whenever building a classifier, it is important to identify the proportion of the targets and know before hand whether the classifier is going to be exposed to imbalanced data or not. In this case, the proportion of *spam* to proportion of *ham* (non-spam) is important for the classifier. A `countplot` perfectly captures this statistic. # In[144]: get_ipython().run_line_magic('matplotlib', 'inline') plt.style.use('fivethirtyeight') plt.figure(figsize=(12,8)) plt.style.use('fivethirtyeight') sns.countplot(df.Label) plt.ylabel('') plt.yticks([]) df.Label.value_counts(normalize=True) # The dataset is highly imbalanced i.e. the target variable doesnot have equal proportions of its classes. There are multiple ways to solve this problem. Most commonly, the dataset can be *Under Sampled* i.e. collect random samples of the majority class equal to minority class or the dataset can be *Over Sampled* i.e. using interpolation, the minority class points are increased in number to compete with the majority class. For this project, none of the two techiniques are used since the classifier is a probabilistic model, probability is relative and hence will make up for the imbalance.
# # It is very important to test the model to gain validation and feedback on the model. The dataset here is split into 80:20. The train data will be 80% and test data 20% of the entire population. All randomly selected. The sampling is done using the `sample` function of `pandas`. An easier method would be to use the inbuilt `train_test_split()` function of the `sklearn.model_selection` module. # # In[145]: randomized = df.sample(frac=1,random_state=1) train_size = round(len(randomized) * 0.8) train = randomized[:train_size].reset_index(drop=True) test = randomized[train_size:].reset_index(drop=True) print(train.shape) print(test.shape) # A sample of a population has to be a representative of the population, otherwise the results obtained can be faulty or skewed. Thus it becomes very important to check for this criterion before moving forward with the project. # In[72]: train.Label.value_counts(normalize=True) # In[73]: test.Label.value_counts(normalize=True) # Comparing the splits, the proportion of classes for the two splits as well as the entire population (found above) are very near to eachother. This is a good enough measure to prove that the split samples are good representatives of the population for this dataset.
# # The general idea for this classifier is that, for every word in the input the probability is calculated for that word appearing in either spam messages or non-spam messages. These probabilities are then compared. If the probabilities cumulatively indicate that having this word or a collection of words is strongly associated with spam (non-spam) messages, then that input is classified as spam (non-spam). # # The above logic pushes for cleaning the text to only recover the terms that are useful and remove stopwords, punctuations and anything extra in the texts. # In[74]: train.SMS.head(10) # In[146]: train.SMS = train['SMS'].str.replace('\W',' ') train.SMS = train['SMS'].str.lower() train.SMS.head(10) # The normalized messages are used to create the `Term Document Matrix`. This is a mapping of every word to its message and the number of times the word occurs in the message. Think of it as a matrix with vocabulary of all the texts as row indices and every text id as column indices. A value 'n' (> 0) at the index -> (row,col) means the word 'col' appear n times in the text 'row'.
# # The first step to building the `Term Document Matrix` is to create the vocabulary of the texts. The vocabulary is a set of all unique words in all the texts. # In[104]: messages = train.SMS.str.split() words = list(chain(*messages)) vocabulary = pd.Series(words).unique() len(vocabulary) # In the train set, there are 7783 unique words for our vocabulary. These have been retrieved from the messages.
# Using the vocabulary, the `Term Document Matrix` is derived. For the ease of usage, a dictionary representing the same concept is made. # In[110]: no_of_messages = len(train.SMS) word_counts = {word: [0] * no_of_messages for word in vocabulary} for index,mssge in enumerate(messages): for word in mssge: word_counts[word][index] += 1 # The above method made a dictionary with every term in the vocabulary as key and for each term a vector identifying in which messages the term appears in and how many times, analogous to the `Term Document Matrix`
# # The method afore-mentioned is a way of doing this from scratch, that doesnt take into account the stop words. This can also be achieved via the `CountVectorizer` function in the `sklearn.feature_extraction.text` module. Here the removal of stopwords is also considered. # In[147]: token = RegexpTokenizer(r'[a-z0-9A-Z]+') vectorizer = CountVectorizer( lowercase=True, stop_words='english', ngram_range=(1,1), tokenizer = token.tokenize ) counts = vectorizer.fit_transform(train['SMS']) print(len(vectorizer.get_feature_names())) vocabulary = vectorizer.get_feature_names() word_counts = dict(zip(vectorizer.get_feature_names(),np.transpose(counts.toarray()))) # The vocabulary contains 7508 terms after removing the stopwords as well. Since the vocabulary is quite long, it makes it impossible to draw inferences from. One solution is to create a `WordCloud` to get a gist of the data. The `WordCloud`s will be different for the `spam` messages and the `ham` (non-spam) messages. This would give a sense of the kind of words present in both the type of messages.
# # The `WordCloud` is present in the wordcloud package, downloadable - # # pip install wordcloud # conda install wordcloud # # The `WordCloud` function creates an image object, which is viewed using the `imshow()` function of `matplotlib`. # The width, height and other parameters are set in the wordcloud object. There is a provision for removing stopwords as well in the object. The stopwords paramter has to be supplied with the STOPWORDS object from the `nltk.corpus`. # # The first `WordCloud` is of the text messages, labeled as *spam*. # In[133]: text = "" for message in train[train.Label == 'spam'].SMS: words = message.split() text = text + " ".join(words) + " " # stopwords = set(STOPWORDS) wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text) plt.figure(figsize=(12,8)) plt.imshow(wordcloud,interpolation='bilinear') plt.axis('off') # From the `WordCloud` the words like - free, call, now, txt, tone, reply, mobile, text are frequently appearing words in the *spam* messages of the training set. This is very intuitive as infact these sort of words do appear in spam messages recieved generally.
# # The next word cloud is for the *ham* (non-spam) label. # In[112]: text = "" for message in train[train.Label == 'ham'].SMS: words = message.split() text = text + " ".join(words) + " " # stopwords = set(STOPWORDS) wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text) plt.figure(figsize=(12,8)) plt.imshow(wordcloud,interpolation='bilinear') plt.axis('off') # There are distinct differences in the variety and kind of words appearing for both the classes/labels. # The words appearing in *ham*(non-spam) messages are generally common words as the `WordCloud` shows - love, come, know, got, home, need, dear etc. The ones in *spam* messages convey state very direct actions/objects like the words mentioned. # For model purposes, the dictionary is converted to a DataFrame and appended to the initial dataset. This way the representation for each message becomes easy. Every row is an individual text and every column after the *SMS* and *Label* columns represents a word from the vocabulary. # In[148]: vector_space = pd.DataFrame(word_counts) train = pd.concat([train,vector_space],axis=1) train.head(3) # Before the model building starts, the stopwords have to be removed from the *SMS* columns permanently. The stopwords are usually present in all messages and are not relevant when distinguishing between *spam* and *ham* (non-spam) messages. # The train dataset is ready for model building. The Naive Bayes algorithm's formula requires the following terms. Precalculating them once saves time during model training, :-
# # * P(spam) - probability of spam
# * P(ham) - probability of non-spam
# * N_spam - number of words in spam messages
# * N_ham - number of words in non-spam messages
# * N_vocab - number of words in vocabulary
# * Alpha - Laplace smoothing constant
# # The Alpha is the Laplace smoothing constant. This is required because there can be certain words of the input text that only appear in spam messages and not in non-spam messages. When calculating - P(word|non-spam), it gives a 0. Since the inference is based on the given formula :- # # y_pred = argmax_y (P(y)P(w₁|y)P(w₂|y)…P(w_n|y)) # # One 0 in the equation and the entire message is given a 0 probability. This actually is a shortcoming of the limited vocabulary and limited texts in the dataset. To overcome this problem, the Laplace smoothing constant is applied to avoid the 0 entering the equation. # In[149]: alpha = 1 P_spam = train.Label.value_counts(normalize=True)['spam'] P_ham = train.Label.value_counts(normalize=True)['ham'] N_spam = len(list(chain(*messages[train.Label == 'spam']))) N_ham = len(list(chain(*messages[train.Label == 'ham']))) N_vocab = len(vocabulary) # The paramters # # * P(w_i|spam) # * P(w_i|ham) # # where w_i stands for each word in the vocabulary. These formulas calculate the probability of a word appearing in the text given the message label. # These calculations are static like the ones done previously. Hence calculating them once will avoid meaningless calculations during training phaze. #
# # Two separate dictionaries, for *spam* and *ham* (non-spam) are maintained, mapping word to its probability of appearing in the respective label. # In[150]: P_word_given_spam = {word: 0 for word in vocabulary} P_word_given_ham = {word: 0 for word in vocabulary} Spam_messages = train[train.Label == 'spam'] Ham_messages = train[train.Label == 'ham'] for word in vocabulary: N_word_given_spam = Spam_messages[word].sum() N_word_given_ham = Ham_messages[word].sum() P_word_given_spam[word] = ( (N_word_given_spam + alpha) / (N_spam + (alpha * N_vocab)) ) P_word_given_ham[word] = ( (N_word_given_ham + alpha) / (N_ham + (alpha * N_vocab)) ) # All static components required to build the model are calculated, this saves time during training.
# Next phaze is to build the model. The model will take an input message, calculate the probabilites :- # # * P(spam|w_i...w_n) # * P(ham|w_i...w_n) # # Comparing the two, if the first is greater than the latter, the models classifies the message as `spam`, if they are equal then the model requires human help to classify them better, else the message is classified as `ham` (non-spam). # In[151]: def classify(message,verbose): message = re.sub('\W',' ',message) message = message.lower() message = message.split() spam = 1 ham = 1 for word in message: if word in P_word_given_spam.keys(): spam *= P_word_given_spam[word] if word in P_word_given_ham.keys(): ham *= P_word_given_ham[word] P_spam_given_message = P_spam * spam P_ham_given_message = P_ham * ham if verbose: print("P(spam|message) = ",P_spam_given_message) print("P(ham|message) = ",P_ham_given_message) if P_spam_given_message > P_ham_given_message: if verbose: print('Label: spam') return 'spam' elif P_spam_given_message < P_ham_given_message: if verbose: print('Label: ham') return 'ham' else: if verbose: print('Human assistance needed, equal probabilities') return 'Not classified' # The function as specified before accepts an input message and classifies it. The verbose parameter is to get printed output at every step of the function. It is useful when debugging or understanding the working. # # The function is tested on the following inputs, as a part of preliminary test :- # # # Spam - 'WINNER!! This is the secret code to unlock the money: C3421.' # Ham (non-spam) - "Sounds good, Tom, then see u there" # # In[154]: classify('WINNER!! This is the secret code to unlock the money: C3421.',verbose=1) # In[155]: classify('Sounds good, Tom, then see u there',verbose=1) # The model looks good enough from the preliminary tests. As the final part of the project, the model has to be tested on the test set and inference from the predictions have to be made.
# The predictions are stored in a new column *predicted*, to compare it with the actual *Label* column. # In[158]: test['predicted'] = test.SMS.apply(classify,verbose=0) #try putting verbose >1 and see the output of the model test.head(5) # In[159]: test.predicted.value_counts(normalize=True) # Out of the entire test sample, the model classified about 85% as *ham* and about 14% as *spam*. The model seems to have done pretty well as these proportions are analogous to the sample's proportion of classes. But this doesnt speak for missclassified labels.
# # To be clearer with respect to the effectiveness of the model, accuracy is a good metric that describes the proportion of correctly classified labels by the model. The accuracy is simply comparision between the `Label` and the `predicted` columns divided by the total units in the test sample. # In[160]: accuracy = sum(test.Label == test.predicted)/len(test) accuracy # The model has achieved a phenomenal accuracy of about 98% on the test set. It obviously seems to good to be true, but this is a result of less data points for training, limited variance in the vocabulary of the population.
# # The accuracy can be calculated on the train set as well. # In[162]: train['predicted'] = train.SMS.apply(classify,verbose=0) accuracy = sum(train.Label == train.predicted)/len(train) accuracy # In[163]: train.predicted.value_counts(normalize=True) # Accuracies give a general idea about the model, but not the entire picture. A confusion matrix gives information about the True positives and the False negatives (correct predictions) and the True negative and False positive (incorrect predictions) of the model. With these, important metrics to report are the *Precision*, *Recall* and the *F1 score*.
# # The `sklearn.metrics` module provides the inbuilt function `classification_report` which is generated based on the true labels and the predicted labels. # In[186]: print(classification_report(test.predicted,test.Label)) # The model manages to get a 99% accuracy on the train set. And an equally good precision and recall.
# Since it is not a 100% for either test or train set, certain messages were missclassified. Its good practice to review the missclassified data, as it could be due to erroneous data or uncleaned text in this case. # # Since the test messages were cleaned inside the `classify()` function. To review the missclassified messages, they will have to be cleaned externally. # In[183]: def clean(message): message = re.sub('\W',' ',message) message = message.lower() return message test.SMS = test.SMS.apply(clean) missclassified = test[test.Label != test.predicted] missclassified # The easier method to visualize texts is a `WordCloud`. Below are generated `WordCloud`s for both the classes, only for the texts that were missclassified.
# # The first one is for the *ham* (non-spam) messages that were classified as *spam*. # In[177]: text = "" for message in missclassified[missclassified.Label == 'ham'].SMS: words = message.split() text = text + " ".join(words) + " " # stopwords = set(STOPWORDS) wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text) plt.figure(figsize=(12,8)) plt.imshow(wordcloud,interpolation='bilinear') plt.axis('off') # The `WordCloud` shows the most frequent words for the *ham* (non-spam) missclassified messages. The words - call, phone, connected, messages, executive, mobile etc are similar to words found usually in *spam* messages. In short, the model missclassified all those *ham* (non-spam) messages that had a more formal tone to it or sort of mimicked the *spam* messages.
# # The next plot is for the missclassified *spam* label. # In[179]: text = "" for message in missclassified[missclassified.Label == 'spam'].SMS: words = message.split() text = text + " ".join(words) + " " # stopwords = set(STOPWORDS) wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text) plt.figure(figsize=(12,8)) plt.imshow(wordcloud,interpolation='bilinear') plt.axis('off') # The plot shows alot of words that are not part of the vocabulary. Words such as - u4, luv, knickers etc. Due to these words not appearing in the vocabulary, the *ham* words outweighed the *spam* words, meaning - The message is essentially a spam, since the words that could actually contribute to classify this as one are not present in the vocabulary, the words present in the message and were a part of the vocabulary classified this as a *ham* (non-spam) message. # The missclassified messages expose the problem of this model. The model is oblivious to context hidden in the messages. It only considers the word individually and checks if that word can be spam or not. Whereas it is noticed that the words individually do not convey the whole story.
# In conclusion to the project, a successful model for spam filter was created using the Naive Bayes Algorithm. The model assumes independence between the messages and the words in the messages, oblivious to the context of the message.
# Just based on this assumption, the model performed very well for this dataset with the following accuracies (approx) :- # # * Training accuracy - 98.94% # * Testing accuracy - 98.02% # # The model achieves a good precision and recall scores. However it is important to note that *spam* messages have a lesser recall. This can be accounted to the fact that there is an imbalance in the classes of the target variable. # # Overall the model has learnt the training data well and has performed well on the test set for the text messages in the dataset used.