Spam Filter using Naive Bayes

Building a spam filter from scratch using the Naive Bayes algorithm

In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes theorem with strong (but naive) independence assumptions between the features. In probability theory and statistics, Bayes' theorem (alternatively Bayes law or Bayes rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

The aim of this project is to build a spam filter using Naive Bayes algorithm from scratch. Leveraging the Bayes theorem the model is built to predict the probability that a given text is spam or non-spam based on the training data exposed to the model.

The dataset consists of text messages and a target variable specifying whether the text message was a spam or not. The dataset has been picked up from the UCI Machine Learning Repository

The columns in the dataset are not named, hence the names given - Label and SMS.

-> Label - target variable spam and ham (non-spam)
-> SMS - the message

It is required to explore the dataset before building the model, so as to remove any discrepancies in the text.

In [184]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from itertools import chain
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
from PIL import Image
In [143]:
df = pd.read_csv('SMSSpamCollection',sep='\t',header=None)
df.columns = ['Label','SMS']
print(df.shape)
df.head(5)
(5572, 2)
Out[143]:
Label SMS
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

Whenever building a classifier, it is important to identify the proportion of the targets and know before hand whether the classifier is going to be exposed to imbalanced data or not. In this case, the proportion of spam to proportion of ham (non-spam) is important for the classifier. A countplot perfectly captures this statistic.

In [144]:
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
plt.style.use('fivethirtyeight')
sns.countplot(df.Label)
plt.ylabel('')
plt.yticks([])
df.Label.value_counts(normalize=True)
Out[144]:
ham     0.865937
spam    0.134063
Name: Label, dtype: float64

The dataset is highly imbalanced i.e. the target variable doesnot have equal proportions of its classes. There are multiple ways to solve this problem. Most commonly, the dataset can be Under Sampled i.e. collect random samples of the majority class equal to minority class or the dataset can be Over Sampled i.e. using interpolation, the minority class points are increased in number to compete with the majority class. For this project, none of the two techiniques are used since the classifier is a probabilistic model, probability is relative and hence will make up for the imbalance.

It is very important to test the model to gain validation and feedback on the model. The dataset here is split into 80:20. The train data will be 80% and test data 20% of the entire population. All randomly selected. The sampling is done using the sample function of pandas. An easier method would be to use the inbuilt train_test_split() function of the sklearn.model_selection module.

In [145]:
randomized = df.sample(frac=1,random_state=1)
train_size = round(len(randomized) * 0.8)
train = randomized[:train_size].reset_index(drop=True)
test = randomized[train_size:].reset_index(drop=True)
print(train.shape)
print(test.shape)
(4458, 2)
(1114, 2)

A sample of a population has to be a representative of the population, otherwise the results obtained can be faulty or skewed. Thus it becomes very important to check for this criterion before moving forward with the project.

In [72]:
train.Label.value_counts(normalize=True)
Out[72]:
ham     0.86541
spam    0.13459
Name: Label, dtype: float64
In [73]:
test.Label.value_counts(normalize=True)
Out[73]:
ham     0.868043
spam    0.131957
Name: Label, dtype: float64

Comparing the splits, the proportion of classes for the two splits as well as the entire population (found above) are very near to eachother. This is a good enough measure to prove that the split samples are good representatives of the population for this dataset.

The general idea for this classifier is that, for every word in the input the probability is calculated for that word appearing in either spam messages or non-spam messages. These probabilities are then compared. If the probabilities cumulatively indicate that having this word or a collection of words is strongly associated with spam (non-spam) messages, then that input is classified as spam (non-spam).

The above logic pushes for cleaning the text to only recover the terms that are useful and remove stopwords, punctuations and anything extra in the texts.

In [74]:
train.SMS.head(10)
Out[74]:
0                         Yep, by the pretty sculpture
1        Yes, princess. Are you going to make me moan?
2                           Welp apparently he retired
3                                              Havent.
4    I forgot 2 ask ü all smth.. There's a card on ...
5    Ok i thk i got it. Then u wan me 2 come now or...
6    I want kfc its Tuesday. Only buy 2 meals ONLY ...
7                           No dear i was sleeping :-P
8                            Ok pa. Nothing problem:-)
9                      Ill be there on  <#>  ok.
Name: SMS, dtype: object
In [146]:
train.SMS = train['SMS'].str.replace('\W',' ')
train.SMS = train['SMS'].str.lower()
train.SMS.head(10)
Out[146]:
0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
5    ok i thk i got it  then u wan me 2 come now or...
6    i want kfc its tuesday  only buy 2 meals only ...
7                           no dear i was sleeping   p
8                            ok pa  nothing problem   
9                      ill be there on   lt   gt   ok 
Name: SMS, dtype: object

The normalized messages are used to create the Term Document Matrix. This is a mapping of every word to its message and the number of times the word occurs in the message. Think of it as a matrix with vocabulary of all the texts as row indices and every text id as column indices. A value 'n' (> 0) at the index -> (row,col) means the word 'col' appear n times in the text 'row'.

The first step to building the Term Document Matrix is to create the vocabulary of the texts. The vocabulary is a set of all unique words in all the texts.

In [104]:
messages = train.SMS.str.split()
words = list(chain(*messages))
vocabulary = pd.Series(words).unique()
len(vocabulary)
Out[104]:
7783

In the train set, there are 7783 unique words for our vocabulary. These have been retrieved from the messages.
Using the vocabulary, the Term Document Matrix is derived. For the ease of usage, a dictionary representing the same concept is made.

In [110]:
no_of_messages = len(train.SMS)
word_counts = {word: [0] * no_of_messages for word in vocabulary}

for index,mssge in enumerate(messages):
    for word in mssge:
        word_counts[word][index] += 1

The above method made a dictionary with every term in the vocabulary as key and for each term a vector identifying in which messages the term appears in and how many times, analogous to the Term Document Matrix

The method afore-mentioned is a way of doing this from scratch, that doesnt take into account the stop words. This can also be achieved via the CountVectorizer function in the sklearn.feature_extraction.text module. Here the removal of stopwords is also considered.

In [147]:
token = RegexpTokenizer(r'[a-z0-9A-Z]+')
vectorizer = CountVectorizer(
    lowercase=True,
    stop_words='english',
    ngram_range=(1,1),
    tokenizer = token.tokenize
)
counts = vectorizer.fit_transform(train['SMS'])
print(len(vectorizer.get_feature_names()))
vocabulary = vectorizer.get_feature_names()

word_counts = dict(zip(vectorizer.get_feature_names(),np.transpose(counts.toarray())))
7508

The vocabulary contains 7508 terms after removing the stopwords as well. Since the vocabulary is quite long, it makes it impossible to draw inferences from. One solution is to create a WordCloud to get a gist of the data. The WordClouds will be different for the spam messages and the ham (non-spam) messages. This would give a sense of the kind of words present in both the type of messages.

The WordCloud is present in the wordcloud package, downloadable -

pip install wordcloud
conda install wordcloud

The WordCloud function creates an image object, which is viewed using the imshow() function of matplotlib. The width, height and other parameters are set in the wordcloud object. There is a provision for removing stopwords as well in the object. The stopwords paramter has to be supplied with the STOPWORDS object from the nltk.corpus.

The first WordCloud is of the text messages, labeled as spam.

In [133]:
text = ""

for message in train[train.Label == 'spam'].SMS:
    words = message.split()
    text = text + " ".join(words) + " "
    
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
Out[133]:
(-0.5, 1199.5, 799.5, -0.5)

From the WordCloud the words like - free, call, now, txt, tone, reply, mobile, text are frequently appearing words in the spam messages of the training set. This is very intuitive as infact these sort of words do appear in spam messages recieved generally.

The next word cloud is for the ham (non-spam) label.

In [112]:
text = ""

for message in train[train.Label == 'ham'].SMS:
    words = message.split()
    text = text + " ".join(words) + " "
    
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
Out[112]:
(-0.5, 1199.5, 799.5, -0.5)

There are distinct differences in the variety and kind of words appearing for both the classes/labels. The words appearing in ham(non-spam) messages are generally common words as the WordCloud shows - love, come, know, got, home, need, dear etc. The ones in spam messages convey state very direct actions/objects like the words mentioned.

For model purposes, the dictionary is converted to a DataFrame and appended to the initial dataset. This way the representation for each message becomes easy. Every row is an individual text and every column after the SMS and Label columns represents a word from the vocabulary.

In [148]:
vector_space = pd.DataFrame(word_counts)
train = pd.concat([train,vector_space],axis=1)
train.head(3)
Out[148]:
Label SMS 0 00 000 000pes 008704050406 0089 01223585334 02 ... zealand zebra zed zeros zhong zindgi zoe zogtorius zouk zyada
0 ham yep by the pretty sculpture 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 ham yes princess are you going to make me moan 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 ham welp apparently he retired 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 7510 columns

Before the model building starts, the stopwords have to be removed from the SMS columns permanently. The stopwords are usually present in all messages and are not relevant when distinguishing between spam and ham (non-spam) messages.

The train dataset is ready for model building. The Naive Bayes algorithm's formula requires the following terms. Precalculating them once saves time during model training, :-

  • P(spam) - probability of spam
  • P(ham) - probability of non-spam
  • Nspam - number of words in spam messages
  • Nham - number of words in non-spam messages
  • Nvocab - number of words in vocabulary
  • Alpha - Laplace smoothing constant

The Alpha is the Laplace smoothing constant. This is required because there can be certain words of the input text that only appear in spam messages and not in non-spam messages. When calculating - P(word|non-spam), it gives a 0. Since the inference is based on the given formula :-

ypred = argmaxy (P(y)P(w1|y)P(w2|y)…P(wn|y))

One 0 in the equation and the entire message is given a 0 probability. This actually is a shortcoming of the limited vocabulary and limited texts in the dataset. To overcome this problem, the Laplace smoothing constant is applied to avoid the 0 entering the equation.

In [149]:
alpha = 1
P_spam = train.Label.value_counts(normalize=True)['spam']
P_ham = train.Label.value_counts(normalize=True)['ham']
N_spam = len(list(chain(*messages[train.Label == 'spam'])))
N_ham = len(list(chain(*messages[train.Label == 'ham'])))
N_vocab = len(vocabulary)

The paramters

  • P(wi|spam)
  • P(wi|ham)

where wi stands for each word in the vocabulary. These formulas calculate the probability of a word appearing in the text given the message label. These calculations are static like the ones done previously. Hence calculating them once will avoid meaningless calculations during training phaze.

Two separate dictionaries, for spam and ham (non-spam) are maintained, mapping word to its probability of appearing in the respective label.

In [150]:
P_word_given_spam = {word: 0 for word in vocabulary}
P_word_given_ham = {word: 0 for word in vocabulary}

Spam_messages = train[train.Label == 'spam']
Ham_messages = train[train.Label == 'ham']

for word in vocabulary:
    
    N_word_given_spam = Spam_messages[word].sum()
    N_word_given_ham = Ham_messages[word].sum()
    
    P_word_given_spam[word] = (
        (N_word_given_spam + alpha) / (N_spam + (alpha * N_vocab))
    )
    
    P_word_given_ham[word] = (
        (N_word_given_ham + alpha) / (N_ham + (alpha * N_vocab))
    )

All static components required to build the model are calculated, this saves time during training.
Next phaze is to build the model. The model will take an input message, calculate the probabilites :-

  • P(spam|wi...wn)
  • P(ham|wi...wn)

Comparing the two, if the first is greater than the latter, the models classifies the message as spam, if they are equal then the model requires human help to classify them better, else the message is classified as ham (non-spam).

In [151]:
def classify(message,verbose):
    
    message = re.sub('\W',' ',message)
    message = message.lower()
    message = message.split()
    
    spam = 1
    ham = 1
    
    for word in message:
        if word in P_word_given_spam.keys():
            spam *= P_word_given_spam[word]
        if word in P_word_given_ham.keys():
            ham *= P_word_given_ham[word]
        
    P_spam_given_message = P_spam * spam
    P_ham_given_message = P_ham * ham
    
    if verbose:
        print("P(spam|message) = ",P_spam_given_message)
        print("P(ham|message) = ",P_ham_given_message)
    
    if P_spam_given_message > P_ham_given_message:
        if verbose:
            print('Label: spam')
        return 'spam'
    elif P_spam_given_message < P_ham_given_message:
        if verbose:
            print('Label: ham')    
        return 'ham'
    else:
        if verbose:
            print('Human assistance needed, equal probabilities')
        return 'Not classified'

The function as specified before accepts an input message and classifies it. The verbose parameter is to get printed output at every step of the function. It is useful when debugging or understanding the working.

The function is tested on the following inputs, as a part of preliminary test :-

Spam - 'WINNER!! This is the secret code to unlock the money: C3421.'
Ham (non-spam) - "Sounds good, Tom, then see u there"
In [154]:
classify('WINNER!! This is the secret code to unlock the money: C3421.',verbose=1)
P(spam|message) =  6.814909426971065e-15
P(ham|message) =  1.772963190712378e-17
Label: spam
Out[154]:
'spam'
In [155]:
classify('Sounds good, Tom, then see u there',verbose=1)
P(spam|message) =  1.5820325455468539e-15
P(ham|message) =  2.7059850698247674e-13
Label: ham
Out[155]:
'ham'

The model looks good enough from the preliminary tests. As the final part of the project, the model has to be tested on the test set and inference from the predictions have to be made.
The predictions are stored in a new column predicted, to compare it with the actual Label column.

In [158]:
test['predicted'] = test.SMS.apply(classify,verbose=0) #try putting verbose >1 and see the output of the model
test.head(5)
Out[158]:
Label SMS predicted
0 ham Later i guess. I needa do mcat study too. ham
1 ham But i haf enuff space got like 4 mb... ham
2 spam Had your mobile 10 mths? Update to latest Oran... spam
3 ham All sounds good. Fingers . Makes it difficult ... ham
4 ham All done, all handed in. Don't know if mega sh... ham
In [159]:
test.predicted.value_counts(normalize=True)
Out[159]:
ham     0.857271
spam    0.142729
Name: predicted, dtype: float64

Out of the entire test sample, the model classified about 85% as ham and about 14% as spam. The model seems to have done pretty well as these proportions are analogous to the sample's proportion of classes. But this doesnt speak for missclassified labels.

To be clearer with respect to the effectiveness of the model, accuracy is a good metric that describes the proportion of correctly classified labels by the model. The accuracy is simply comparision between the Label and the predicted columns divided by the total units in the test sample.

In [160]:
accuracy = sum(test.Label == test.predicted)/len(test)
accuracy
Out[160]:
0.9802513464991023

The model has achieved a phenomenal accuracy of about 98% on the test set. It obviously seems to good to be true, but this is a result of less data points for training, limited variance in the vocabulary of the population.

The accuracy can be calculated on the train set as well.

In [162]:
train['predicted'] = train.SMS.apply(classify,verbose=0)
accuracy = sum(train.Label == train.predicted)/len(train)
accuracy
Out[162]:
0.9894571556751907
In [163]:
train.predicted.value_counts(normalize=True)
Out[163]:
ham     0.856662
spam    0.143338
Name: predicted, dtype: float64

Accuracies give a general idea about the model, but not the entire picture. A confusion matrix gives information about the True positives and the False negatives (correct predictions) and the True negative and False positive (incorrect predictions) of the model. With these, important metrics to report are the Precision, Recall and the F1 score.

The sklearn.metrics module provides the inbuilt function classification_report which is generated based on the true labels and the predicted labels.

In [186]:
print(classification_report(test.predicted,test.Label))
              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       955
        spam       0.97      0.89      0.93       159

    accuracy                           0.98      1114
   macro avg       0.97      0.94      0.96      1114
weighted avg       0.98      0.98      0.98      1114

The model manages to get a 99% accuracy on the train set. And an equally good precision and recall.
Since it is not a 100% for either test or train set, certain messages were missclassified. Its good practice to review the missclassified data, as it could be due to erroneous data or uncleaned text in this case.

Since the test messages were cleaned inside the classify() function. To review the missclassified messages, they will have to be cleaned externally.

In [183]:
def clean(message):
    
    message = re.sub('\W',' ',message)
    message = message.lower()
    
    return message

test.SMS = test.SMS.apply(clean)
missclassified = test[test.Label != test.predicted]
missclassified
Out[183]:
Label SMS predicted
9 ham i liked the new mobile spam
18 ham and picking them up from various points spam
114 spam not heard from u4 a while call me now am here... ham
130 ham how was txting and driving spam
131 ham they have a thread on the wishlist section of ... spam
152 ham unlimited texts limited minutes spam
159 ham 26th of july spam
284 ham nokia phone is lovly spam
302 ham no calls messages missed calls spam
313 ham speaking of does he have any cash yet spam
319 ham we have sent jd for customer service cum accou... spam
363 spam email alertfrom jeri stewartsize 2kbsubject ... ham
398 ham hasn t that been the pattern recently crap wee... spam
492 ham madam regret disturbance might receive a refer... spam
660 ham any chance you might have had with me evaporat... spam
754 ham i am getting threats from your sales executive... spam
824 ham these won t do have to move on to morphine spam
876 spam rct thnq adrian for u text rgds vatian ham
885 spam 2 2 146tf150p ham
923 ham how much would it cost to hire a hitman spam
953 spam hello we need some posh birds and chaps to us... ham
1073 ham i career tel have added u as a contact on in... spam

The easier method to visualize texts is a WordCloud. Below are generated WordClouds for both the classes, only for the texts that were missclassified.

The first one is for the ham (non-spam) messages that were classified as spam.

In [177]:
text = ""

for message in missclassified[missclassified.Label == 'ham'].SMS:
    words = message.split()
    text = text + " ".join(words) + " "
    
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
Out[177]:
(-0.5, 1199.5, 799.5, -0.5)