A Spam Filter with Naive Bayes

Introduction

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository here. You can also download the dataset directly from this link. The data collection process is described in more details on this page, where you can also find some of the authors' papers.

Project Goal

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [1]:
# import all 'libraries' required for this project.
import pandas as pd
import numpy as np
import random
from numpy.random import seed, randint
from IPython.display import HTML
from IPython.display import display, Markdown

messages_df = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
messages_df.columns=['Label', 'SMS']
print(messages_df.head(), '\n')
print(messages_df.info(), '\n')
print(messages_df['Label'].value_counts(), '\n')

print('The total number of messages are:', len(messages_df), '\n')

ham = messages_df[messages_df['Label'] == 'ham']
spam = messages_df[messages_df['Label'] == 'spam']

percent_ham = (100*len(ham))/len(messages_df)
percent_spam = 100 - percent_ham
print('The percentage of messages classified as ham are', "{:.2f}%".format(percent_ham), '\n')
print('The percentage of messages classified as spam are', "{:.2f}%".format(percent_spam))
  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro... 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
None 

ham     4825
spam     747
Name: Label, dtype: int64 

The total number of messages are: 5572 

The percentage of messages classified as ham are 86.59% 

The percentage of messages classified as spam are 13.41%

Training Phase

In [2]:
messages_df = messages_df.sample(frac=1, random_state=1)
print(messages_df.head())

percentage=round(len(messages_df)/100*80) 
training_df = messages_df.head(percentage)  
test_df = messages_df.iloc[percentage:len(messages_df),:]

print(len(training_df), '\n')
print(len(test_df), '\n')

display(training_df.head())
display(test_df.head())

hamtr = training_df[training_df['Label'] == 'ham']
spamtr = training_df[training_df['Label'] == 'spam']
p_training_ham = (100*len(hamtr))/len(training_df)
p_training_spam = 100 - p_training_ham

display(Markdown('<h2>Training Data</h2>'))
print("{:.2f}%".format(p_training_ham), 'ham    ', "{:.2f}%".format(p_training_spam), 'spam', '\n')

hamt = test_df[test_df['Label'] == 'ham']
spamt = test_df[test_df['Label'] == 'spam']
p_test_ham = (100*len(hamt))/len(test_df)
p_test_spam = 100 - p_test_ham

display(Markdown('<h2>Test Data</h2>'))
print("{:.2f}%".format(p_test_ham), 'ham    ', "{:.2f}%".format(p_test_spam), 'spam', '\n')

training_df = training_df.reset_index()
display(training_df.head())
test_df = test_df.reset_index()
display(test_df.head())
     Label                                                SMS
1078   ham                       Yep, by the pretty sculpture
4028   ham      Yes, princess. Are you going to make me moan?
958    ham                         Welp apparently he retired
4642   ham                                            Havent.
4674   ham  I forgot 2 ask ü all smth.. There's a card on ...
4458 

1114 

Label SMS
1078 ham Yep, by the pretty sculpture
4028 ham Yes, princess. Are you going to make me moan?
958 ham Welp apparently he retired
4642 ham Havent.
4674 ham I forgot 2 ask ü all smth.. There's a card on ...
Label SMS
2131 ham Later i guess. I needa do mcat study too.
3418 ham But i haf enuff space got like 4 mb...
3424 spam Had your mobile 10 mths? Update to latest Oran...
1538 ham All sounds good. Fingers . Makes it difficult ...
5393 ham All done, all handed in. Don't know if mega sh...

Training Data

86.54% ham     13.46% spam 

Test Data

86.80% ham     13.20% spam 

index Label SMS
0 1078 ham Yep, by the pretty sculpture
1 4028 ham Yes, princess. Are you going to make me moan?
2 958 ham Welp apparently he retired
3 4642 ham Havent.
4 4674 ham I forgot 2 ask ü all smth.. There's a card on ...
index Label SMS
0 2131 ham Later i guess. I needa do mcat study too.
1 3418 ham But i haf enuff space got like 4 mb...
2 3424 spam Had your mobile 10 mths? Update to latest Oran...
3 1538 ham All sounds good. Fingers . Makes it difficult ...
4 5393 ham All done, all handed in. Don't know if mega sh...
In [3]:
import string 

def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
# Apply to the DF series
training_df['SMS_Cleaned'] = training_df['SMS'].apply(remove_punctuations)
test_df['SMS_Cleaned'] = test_df['SMS'].apply(remove_punctuations)
display(Markdown('<h2>Training Data</h2>'))
display(training_df.head())
display(Markdown('<h2>Test Data</h2>'))
display(test_df.head())
display(Markdown('<h2>Training Data</h2>'))
training_df['SMS_Cleaned'] = training_df['SMS_Cleaned'].str.lower()
display(training_df.head())
test_df['SMS_Cleaned'] = test_df['SMS_Cleaned'].str.lower()
display(Markdown('<h2>Test Data</h2>'))
display(test_df.head())

Training Data

index Label SMS SMS_Cleaned
0 1078 ham Yep, by the pretty sculpture Yep by the pretty sculpture
1 4028 ham Yes, princess. Are you going to make me moan? Yes princess Are you going to make me moan
2 958 ham Welp apparently he retired Welp apparently he retired
3 4642 ham Havent. Havent
4 4674 ham I forgot 2 ask ü all smth.. There's a card on ... I forgot 2 ask ü all smth Theres a card on da ...

Test Data

index Label SMS SMS_Cleaned
0 2131 ham Later i guess. I needa do mcat study too. Later i guess I needa do mcat study too
1 3418 ham But i haf enuff space got like 4 mb... But i haf enuff space got like 4 mb
2 3424 spam Had your mobile 10 mths? Update to latest Oran... Had your mobile 10 mths Update to latest Orang...
3 1538 ham All sounds good. Fingers . Makes it difficult ... All sounds good Fingers Makes it difficult to...
4 5393 ham All done, all handed in. Don't know if mega sh... All done all handed in Dont know if mega shop ...

Training Data

index Label SMS SMS_Cleaned
0 1078 ham Yep, by the pretty sculpture yep by the pretty sculpture
1 4028 ham Yes, princess. Are you going to make me moan? yes princess are you going to make me moan
2 958 ham Welp apparently he retired welp apparently he retired
3 4642 ham Havent. havent
4 4674 ham I forgot 2 ask ü all smth.. There's a card on ... i forgot 2 ask ü all smth theres a card on da ...

Test Data

index Label SMS SMS_Cleaned
0 2131 ham Later i guess. I needa do mcat study too. later i guess i needa do mcat study too
1 3418 ham But i haf enuff space got like 4 mb... but i haf enuff space got like 4 mb
2 3424 spam Had your mobile 10 mths? Update to latest Oran... had your mobile 10 mths update to latest orang...
3 1538 ham All sounds good. Fingers . Makes it difficult ... all sounds good fingers makes it difficult to...
4 5393 ham All done, all handed in. Don't know if mega sh... all done all handed in dont know if mega shop ...
In [4]:
# create an empty list to store each unique work for each meassage.
vocabulary = []

training_df['SMS_Cleaned'] = training_df['SMS_Cleaned'].str.split()
display(training_df.head())

for sms in training_df['SMS_Cleaned']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = set(vocabulary)
print(len(vocabulary))
vocabulary = list(vocabulary)
print(len(vocabulary))
index Label SMS SMS_Cleaned
0 1078 ham Yep, by the pretty sculpture [yep, by, the, pretty, sculpture]
1 4028 ham Yes, princess. Are you going to make me moan? [yes, princess, are, you, going, to, make, me,...
2 958 ham Welp apparently he retired [welp, apparently, he, retired]
3 4642 ham Havent. [havent]
4 4674 ham I forgot 2 ask ü all smth.. There's a card on ... [i, forgot, 2, ask, ü, all, smth, theres, a, c...
8515
8515
In [5]:
word_counts_per_sms = {unique_word: [0] * len(training_df['SMS_Cleaned']) for unique_word in vocabulary}

for index, sms in enumerate(training_df['SMS_Cleaned']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)
print(len(word_counts))
pd.set_option("display.max_columns", 10)
display(word_counts.head())
4458
totes rs opener jaykwon father ... but wishes gigolo hurried httpwwwetlpcoukexpressoffer
0 0 0 0 0 0 ... 0 0 0 0 0
1 0 0 0 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 0 0 0 0 0
3 0 0 0 0 0 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0

5 rows × 8515 columns

In [6]:
training_df_clean = pd.concat([training_df, word_counts], axis=1)
training_df_clean.head()
Out[6]:
index Label SMS SMS_Cleaned totes ... but wishes gigolo hurried httpwwwetlpcoukexpressoffer
0 1078 ham Yep, by the pretty sculpture [yep, by, the, pretty, sculpture] 0 ... 0 0 0 0 0
1 4028 ham Yes, princess. Are you going to make me moan? [yes, princess, are, you, going, to, make, me,... 0 ... 0 0 0 0 0
2 958 ham Welp apparently he retired [welp, apparently, he, retired] 0 ... 0 0 0 0 0
3 4642 ham Havent. [havent] 0 ... 0 0 0 0 0
4 4674 ham I forgot 2 ask ü all smth.. There's a card on ... [i, forgot, 2, ask, ü, all, smth, theres, a, c... 0 ... 0 0 0 0 0

5 rows × 8519 columns

In [7]:
# Isolating spam and ham messages first
spam_messages = training_df_clean[training_df_clean['Label'] == 'spam']
ham_messages = training_df_clean[training_df_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_df_clean)
p_ham = len(ham_messages) / len(training_df_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1
In [8]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham
In [9]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
In [10]:
classify('You\'ve been chosen a winner. Click on the number below to access your money.')
P(Spam|message): 3.670601065218682e-50
P(Ham|message): 1.0688398108249164e-53
Label: Spam
In [11]:
classify('Hey Bill, let\'s celebrate your lottery win')
P(Spam|message): 3.560359302507492e-30
P(Ham|message): 8.160585293334815e-29
Label: Ham

Test the Spam Filter

In [12]:
def classify_test_df(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'
In [13]:
pd.reset_option("display.max_columns")
test_df['predicted'] = test_df['SMS_Cleaned'].apply(classify_test_df)
test_df.head()
Out[13]:
index Label SMS SMS_Cleaned predicted
0 2131 ham Later i guess. I needa do mcat study too. later i guess i needa do mcat study too ham
1 3418 ham But i haf enuff space got like 4 mb... but i haf enuff space got like 4 mb ham
2 3424 spam Had your mobile 10 mths? Update to latest Oran... had your mobile 10 mths update to latest orang... spam
3 1538 ham All sounds good. Fingers . Makes it difficult ... all sounds good fingers makes it difficult to... ham
4 5393 ham All done, all handed in. Don't know if mega sh... all done all handed in dont know if mega shop ... ham
In [14]:
correct = 0
total = test_df.shape[0]
    
for row in test_df.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)
Correct: 1094
Incorrect: 20
Accuracy: 0.9820466786355476

Conclusion

The spam filter developed by using Multinomial Naive Bayes in this case is very reliable at 98.2%.