Building a spam filter using naive bayes algorithm

Introduction

Here in this guided project, we are going to build a spam filter using the practical side of naive bayes algorithm. The dataset for this project is put together by Tiago A. Almeida and José María Gómez Hidalgo and is available fof download from here.

Goals

The computer does the following to classify messages as spam or not spam:

  • Learns how humans classify messages
  • Uses that knowledge to estimate probability for new messages being spam or not
  • Classifies a new message based on probability values
    • If the probability for spam is greater, then the message is a spam
    • Otherwise it is a non-spam

Import the libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import operator
from wordcloud import WordCloud, STOPWORDS
In [2]:
# Open the SMSSpamCollection file using the read_csv() function from the pandas package.
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

# Explore the dataset a little.
print(f'Number of SMS messages: {sms.shape[0]:,}')
print(f'Number of missing values in the dataframe: {sms.isnull().sum().sum()}\n')

def pretty_print_table(df, substring):
    print(f'Spam vs. ham {substring}, %')
    spam_ham_pct = round(df['Label'].value_counts(normalize=True)*100, 0)
    print(spam_ham_pct.to_markdown(tablefmt='pretty', headers=['Label', '%']))
    
pretty_print_table(df=sms, substring='(non-spam)')
Number of SMS messages: 5,572
Number of missing values in the dataframe: 0

Spam vs. ham (non-spam), %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+

By applying the pretty_print_table() function we are able to get the probabilities of the spam and ham in a tabulated format. Here it 87% ham and 13% spam.

In [3]:
# Start by randomizing the entire dataset by using the DataFrame.sample() method.
sms_randomized = sms.sample(frac=1, random_state=1)

# Split the randomized dataset into a training and a test set.
training_data = sms_randomized[:4458].reset_index(drop=True)
test_data = sms_randomized[4458:].reset_index(drop=True)

# Find the percentage of spam and ham in both the training and the test set.
pretty_print_table(df=training_data, substring=('in the training set'))
print('\n')
pretty_print_table(df=test_data, substring=('in the test set'))
Spam vs. ham in the training set, %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+


Spam vs. ham in the test set, %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+

We are drawing a random sample of the spam and ham messages and segregating the dataset in train and test data with 75% of train_data and 25% of test_data.

In [4]:
# Remove all the punctuation from the SMS column. You can use the regex '\W' to detect any character that is not from a-z, A-Z or 0-9.
training_data['SMS'] = training_data['SMS'].str.replace('\W', ' ').str.lower()
training_data.head()

# Create a vocabulary for the messages in the training set. 
training_data['SMS'] = training_data['SMS'].str.split()
training_data.head()
Out[4]:
Label SMS
0 ham [yep, by, the, pretty, sculpture]
1 ham [yes, princess, are, you, going, to, make, me,...
2 ham [welp, apparently, he, retired]
3 ham [havent]
4 ham [i, forgot, 2, ask, ü, all, smth, there, s, a,...

Using the series.str.replace() function in conjunction with the \W expression we are able to remove the punctuations in the training_data and subsequently we use the series.str.lower() function and series.str.split() function to convert to lowercase and split the sentences.

In [5]:
# Initiate an empty list named vocabulary
vocabulary = []
for sms in training_data['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))
print(f'Number of unique words in the vocabulary of the training set: {len(vocabulary):,}')

# Run the code you see above to get the word_counts_per_sms dictionary
word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
# Transform word_counts_per_sms into a DataFrame using pd.DataFrame()
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
word_counts_per_sms.head(3)

# Concatenate the DataFrame we just built above with the DataFrame containing the training set
training_set_final = pd.concat([training_data, word_counts_per_sms], axis=1)
training_set_final.head(3)
Number of unique words in the vocabulary of the training set: 7,783
Out[5]:
Label SMS made 08081263000 knock eat off 08002986030 misss continue ... picsfree1 kent welcomes magical steyn an sounds alert breath edition
0 ham [yep, by, the, pretty, sculpture] 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 ham [yes, princess, are, you, going, to, make, me,... 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 ham [welp, apparently, he, retired] 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 7785 columns

  • Using the for loop, we find the number of unique words in the list of vocabulary and subsequently identify the word counts per sms and create it into a dataframe.
  • Next, we use the pd.concat() function to merge the dataframes 'training_data' and 'word_counts_per_sms'.
In [6]:
# Calculate P(Spam) and P(Ham).
training_spam = training_set_final[training_set_final['Label']=='spam']
training_ham = training_set_final[training_set_final['Label']=='ham']

# Creating a dictionary of words from all spam messages with their frequencies
spam_dict = {}
for sms in training_spam['SMS']:
    for word in sms:
        if word not in spam_dict:
            spam_dict[word]=0
        spam_dict[word]+=1
        
# Sorting the dictionary in descending order of word frequencies 
sorted_spam_dict = dict(sorted(spam_dict.items(), key=operator.itemgetter(1), reverse=True))
  • In the above step, we segregate the spam and ham messages using the boolean filter and arrange the spam messages in a dictionary.
  • Next, we randomize the order of the dictionary using the sorted() function such that the most frequent word appears first.

Building a wordcloud

To build a wordcloud we are:

  • considering only the most frequent words,
  • selecting only the most "suspicious" of them (i.e. non-conservative approach),
  • gathering exactly 100 words to visualize.
In [7]:
selected = ['call', 'free', 'stop', 'mobile', 'text', 'claim', 'www', 
            'prize', 'send', 'cash', 'nokia', 'win', 'urgent', 'service',
            'contact', 'com', 'msg', 'chat', 'guaranteed', 'customer', 
            'awarded', 'sms', 'ringtone', 'video', 'rate', 'latest', 
            'award', 'code', 'camera', 'chance', 'apply', 'valid', 'selected',
            'offer', 'tones', 'collection', 'mob', 'network', 'attempt', 
            'bonus', 'delivery', 'weekly', 'club', 'http', 'help', 'dating',
            'vouchers', 'poly', 'auction', 'ltd', 'pounds', 'special',
            'services', 'games', 'await', 'double', 'unsubscribe', 'hot',
            'price', 'sexy', 'camcorder', 'content', 'top', 'calls', 
            'account', 'private', 'winner', 'savamob', 'offers', 'pobox',
            'gift', 'net', 'quiz', 'expires', 'freemsg', 'play', 'ipod',
            'last', 'order', 'anytime', 'congratulations', 'caller', 'points',
            'identifier', 'voucher', 'statement', 'operator', 'real', 
            'mobiles', 'important', 'join', 'rental', 'valued', 'congrats',
            'final', 'enjoy', 'unlimited', 'tv', 'charged', 'sex']

# Extracting only the 100 most frequent spam words with their frequencies
filtered_sorted_spam_dict = {}
for word in selected:
    if word in sorted_spam_dict:
        filtered_sorted_spam_dict[word]=sorted_spam_dict[word] 

print(f'The number of the most popular spam words selected: {len(filtered_sorted_spam_dict)}')
The number of the most popular spam words selected: 100

We select the first 100 words in the order of their frequencies and append them to the dictionary filtered_sorted_spam.

In [8]:
# Creating a word cloud
fig = plt.subplots(figsize=(12,10)) 
wordcloud = WordCloud(width=1000, height=700,
                      background_color='white', 
                      random_state=1).generate_from_frequencies(filtered_sorted_spam_dict)
plt.title('The most frequent words in spam messages\n', fontsize=29)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

The 100 most frequent words of spam messages reveal the following patterns :

  • encouraging people to do further actions (call, text, send, contact, help, claim, apply),
  • promising them something alluring (free, offer, prize, award, special, unlimited, guaranteed, gift, bonus),
  • urging them (urgent, attempt, stop),
  • having sexual context (dating, hot, sexy, sex),
  • inviting to visit some web resources (www, http, com, net),
  • advertising various digital devices and products (nokia, ringtone, video, camcorder, ipod, camera, games).

Calculating Constants

We make the classification using Naive Bayes algorithm based on the probabilities of two equations.

image.png

To calculate the probabilities of Spam and Ham above:

image-2.png

where:

  • Nwi|Spam — the number of times the word wi occurs in spam messages,
  • Nwi|Ham — the number of times the word wi occurs in ham messages,
  • NSpam — total number of words in spam messages,
  • NHam — total number of words in ham messages,
  • NVocabulary — total number of unique words in the vocabulary,
  • α — a smoothing parameter.
In [9]:
# Calculate P(Spam) and P(Ham)
p_spam = training_set_final['Label'].value_counts()['spam']/len(training_set_final)
p_ham = training_set_final['Label'].value_counts()['ham']/len(training_set_final)

n_spam = 0
n_ham = 0
for i in range(len(training_set_final)):
    row = list(training_set_final.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam+=row[j]
        else:
            n_ham+=row[j]
            
n_vocabulary = len(vocabulary)
alpha = 1

print(f'p_spam: {p_spam:.2f}\n'
      f'p_ham: {p_ham:.2f}\n'
      f'n_spam: {n_spam:,}\n'
      f'n_ham: {n_ham:,}\n'
      f'n_vocabulary: {n_vocabulary:,}\n'
      f'alpha: {alpha}')
p_spam: 0.13
p_ham: 0.87
n_spam: 15,190
n_ham: 57,237
n_vocabulary: 7,783
alpha: 1
  • Using series.value_counts()/len(dataframe) formula we find the probabilities of the spam and ham respectively.
  • We also find the number of words in spam and ham messages as n_spam and n_ham respectively.
  • The n_vocabulary talks about the number of unique words in the vocabulary which is 7783.

Calculating parameters

For the 7,783 words in the vocabulary a total of 15,566 probabilities, we calculate the P(wi|Spam) and P(wi|Ham) for each word.

image.png

In [10]:
# Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0
p_wi_spam = {}
p_wi_ham = {}

for word in vocabulary:
    p_wi_spam[word] = (training_spam[word].sum()+alpha)/(n_spam+alpha*n_vocabulary)
    p_wi_ham[word] = (training_ham[word].sum()+alpha)/(n_ham+alpha*n_vocabulary)

We apply the smoothing parameter alpha to this dictionary and append the respective probabilities to the same using a for loop.

Classifying a new message

Create a spam filter which:

  • Takes in a new message as input
  • Calculates P(Spam|message) and P(Ham|message) using the formulas

image.png

  • Compares both values and:

    • if P(Ham|message) > P(Spam|message), then the message is classified as ham,
    • if P(Ham|message) < P(Spam|message), then the message is classified as spam,
    • if P(Ham|message) = P(Spam|message), then the algorithm may request human help.
In [11]:
# Copy the classify() function you see above and write the code needed for calculating
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_wi_spam:
            p_spam_given_message*=p_wi_spam[word]
        if word in p_wi_ham:
            p_ham_given_message*=p_wi_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'

# Use the classify() function to classify two new messages.
print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print(classify('Sounds good, Tom, then see u there'))
spam
ham
  • We create a function which takes in a message and returns whether it is a spam or ham message based on the words present in it.
  • We also test it on two messages 'WINNER!! This is the secret code to unlock the money: C3421.' and 'Sounds good, Tom, then see u there'.

Measuring the Spam Filter's Accuracy

Testing the spam filter on the test set of 1114 messages.

In [12]:
test_data['Predicted'] = test_data['SMS'].apply(classify)
test_data.head()

# Measure the accuracy of the spam filter.
correct = 0
total = len(test_data)       
for row in test_data.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct+=1
accuracy = correct/total*100

print(f'The accuracy of the spam filter: {accuracy:.2f}%')
The accuracy of the spam filter: 98.74%

We have successfully used our spam filter to get more than the expected 80% accuracy using the multinomial naive bayes algorithm.

Wrongly Classified Messages

Here is the list of the wrongly classified messages.

In [13]:
false_spam = test_data[(test_data['Predicted']=='spam')&(test_data['Label']=='ham')].reset_index(drop=True)
false_ham = test_data[(test_data['Predicted']=='ham')&(test_data['Label']=='spam')].reset_index(drop=True)
unclear = test_data[test_data['Predicted']=='needs human classification'].reset_index(drop=True)

print('Total number of wrongly classified messages: ', len(false_spam)+len(false_ham)+len(unclear))
print('_________________________________________________________________________\n')
print('FALSE SPAM MESSAGES:')
for row in false_spam.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('FALSE HAM MESSAGES:')
for row in false_ham.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('UNCLEAR MESSAGES:')
for row in unclear.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________')
Total number of wrongly classified messages:  13
_________________________________________________________________________

FALSE SPAM MESSAGES:
1.  Unlimited texts. Limited minutes.
2.  26th OF JULY
3.  Nokia phone is lovly..
4.  No calls..messages..missed calls
5.  We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us
_________________________________________________________________________

FALSE HAM MESSAGES:
1.  Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
2.  More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
3.  Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50
4.  Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text
5.  0A$NETWORKS allow companies to bill for SMS, so they are responsible for their "suppliers", just as a shop has to give a guarantee on what they sell. B. G.
6.  RCT' THNQ Adrian for U text. Rgds Vatian
7.  2/2 146tf150p
8.  Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r
_________________________________________________________________________

UNCLEAR MESSAGES:
_________________________________________________________________________

From the total number of 13 wrongly classified messages, we can see that.

  • 5 of these are classified as spam messages due to the presence of words such as texts, Nokia, calls and customer.
  • 8 of these are classified as ham messages due to the presence of words seen in non-spam messages.

The slgorithm however had worked perfectly with other messages giving it an overal accuracy of 98.74%

Experiment: Making the Algorithm Case-Sensitive

By making the following adjustments we can try tuning the spam filter accuracy the following ways.

  • increasing the training set size (decreasing, however, the one of the test set) to train the algorithm more,
  • selecting alpha different than 1,
  • preserving punctuation (the word win and win!!! can appear in rather different contexts),
  • making the algorithm sensitive to letter case (for example, SECRET seems to be more spam-like than secret).
In [14]:
training_set_exp = sms_randomized[:4458].reset_index(drop=True)
test_set_exp = sms_randomized[4458:].reset_index(drop=True)
training_set_exp['SMS'] = training_set_exp['SMS'].str.replace('\W', ' ')

vocabulary_exp = []
for sms in training_set_exp['SMS']:
    for word in sms:
        vocabulary_exp.append(word)
vocabulary_exp = list(set(vocabulary_exp))

word_counts_per_sms_exp = {unique_word: [0] * len(training_set_exp['SMS']) for unique_word in vocabulary_exp}
for index, sms in enumerate(training_set_exp['SMS']):
    for word in sms:
        word_counts_per_sms_exp[word][index]+=1
        
word_counts_exp = pd.DataFrame(word_counts_per_sms_exp)

training_set_final_exp = pd.concat([training_set_exp, word_counts_exp], axis=1)
    
spam_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='spam']
ham_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='ham']

p_spam_exp = training_set_final_exp['Label'].value_counts()['spam']/len(training_set_final_exp)
p_ham_exp = training_set_final_exp['Label'].value_counts()['ham']/len(training_set_final_exp)

n_spam_exp = 0
n_ham_exp = 0
for i in range(len(training_set_final_exp)):
    row = list(training_set_final_exp.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam_exp+=row[j]
        else:
            n_ham_exp+=row[j]
            
n_vocabulary_exp = len(vocabulary_exp)
alpha = 1

p_wi_spam_exp = {}
p_wi_ham_exp = {}
for word in vocabulary_exp:
    p_wi_spam_exp[word] = (spam_sms_exp[word].sum()+alpha)/(n_spam_exp+alpha*n_vocabulary_exp)
    p_wi_ham_exp[word] = (ham_sms_exp[word].sum()+alpha)/(n_ham_exp+alpha*n_vocabulary_exp)
    
def classify_exp(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message_exp = p_spam_exp
    p_ham_given_message_exp = p_ham_exp
    for word in message:
        if word in p_wi_spam_exp:
             p_spam_given_message_exp*=p_wi_spam_exp[word]
        if word in p_wi_ham_exp:
            p_ham_given_message_exp*=p_wi_ham_exp[word]

    if p_ham_given_message_exp > p_spam_given_message_exp:
        return 'ham'
    elif p_ham_given_message_exp < p_spam_given_message_exp:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'
    
test_set_exp['Predicted'] = test_set_exp['SMS'].apply(classify_exp)

correct_exp = 0
total_exp = len(test_set_exp)

for row in test_set_exp.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct_exp+=1
accuracy_exp = correct_exp/total_exp*100
print(f'The accuracy of the spam filter: {accuracy_exp:.2f}%')
The accuracy of the spam filter: 85.10%

Conclusions

Through this project, we have managed to achieve in creating a spam filter using multinomial naive bayes algorithm.

Apart from that we also have achieved learning:

  • How to assign probabilities based on certain conditions using conditional probability rules.
  • How to assign probabilities to events based on whether their relationship of statistical independence or not with other events.
  • How to assign probabilities to events using Bayes theorem.

Also, the effieiency achieved using Naive Bayes algorithm in this project is much better than the expected efficiency.