Building a spam filter using naive bayes algorithm

Introduction

Here in this guided project, we are going to build a spam filter using the practical side of naive bayes algorithm. The dataset for this project is put together by Tiago A. Almeida and José María Gómez Hidalgo and is available fof download from here.

Goals

The computer does the following to classify messages as spam or not spam:

  • Learns how humans classify messages
  • Uses that knowledge to estimate probability for new messages being spam or not
  • Classifies a new message based on probability values
    • If the probability for spam is greater, then the message is a spam
    • Otherwise it is a non-spam

Import the libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import operator
from wordcloud import WordCloud, STOPWORDS
In [2]:
# Open the SMSSpamCollection file using the read_csv() function from the pandas package.

sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

# Explore the dataset a little.

print(f'Number of SMS messages: {sms.shape[0]:,}')
print(f'Number of missing values in the dataframe: {sms.isnull().sum().sum()}\n')

def pretty_print_table(df, substring):
    print(f'Spam vs. ham {substring}, %')
    spam_ham_pct = round(df['Label'].value_counts(normalize=True)*100, 0)
    print(spam_ham_pct.to_markdown(tablefmt='pretty', headers=['Label', '%']))
    
pretty_print_table(df=sms, substring='(non-spam)')
Number of SMS messages: 5,572
Number of missing values in the dataframe: 0

Spam vs. ham (non-spam), %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+
In [3]:
# Start by randomizing the entire dataset by using the DataFrame.sample() method.

sms_randomized = sms.sample(frac=1, random_state=1)

# Split the randomized dataset into a training and a test set.

training_data = sms_randomized[:4458].reset_index(drop=True)
test_data = sms_randomized[4458:].reset_index(drop=True)

# Find the percentage of spam and ham in both the training and the test set.

pretty_print_table(df=training_data, substring=('in the training set'))
print('\n')
pretty_print_table(df=test_data, substring=('in the test set'))
Spam vs. ham in the training set, %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+


Spam vs. ham in the test set, %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+
In [4]:
# Remove all the punctuation from the SMS column. You can use the regex '\W' to detect any character that is not from a-z, A-Z or 0-9.

training_data['SMS'] = training_data['SMS'].str.replace('\W', ' ').str.lower()
training_data.head()
Out[4]:
Label SMS
0 ham yep by the pretty sculpture
1 ham yes princess are you going to make me moan
2 ham welp apparently he retired
3 ham havent
4 ham i forgot 2 ask ü all smth there s a card on ...
In [5]:
# Create a vocabulary for the messages in the training set. 

training_data['SMS'] = training_data['SMS'].str.split()
training_data.head()
Out[5]:
Label SMS
0 ham [yep, by, the, pretty, sculpture]
1 ham [yes, princess, are, you, going, to, make, me,...
2 ham [welp, apparently, he, retired]
3 ham [havent]
4 ham [i, forgot, 2, ask, ü, all, smth, there, s, a,...
In [6]:
# Initiate an empty list named vocabulary

vocabulary = []
for sms in training_data['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))
print(f'Number of unique words in the vocabulary of the training set: {len(vocabulary):,}')
Number of unique words in the vocabulary of the training set: 7,783
In [7]:
# Run the code you see above to get the word_counts_per_sms dictionary

word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
In [8]:
# Transform word_counts_per_sms into a DataFrame using pd.DataFrame()

word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
word_counts_per_sms.head(3)
Out[8]:
____ kuch atm patients hourish 08718727870150ppm hoody bthere wipe ... fancied jog haha scream or2optout ana waiting shortcode 97n7qp batchlor
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 7783 columns

In [9]:
# Concatenate the DataFrame we just built above with the DataFrame containing the training set

training_set_final = pd.concat([training_data, word_counts_per_sms], axis=1)
training_set_final.head(3)
Out[9]:
Label SMS ____ kuch atm patients hourish 08718727870150ppm hoody ... fancied jog haha scream or2optout ana waiting shortcode 97n7qp batchlor
0 ham [yep, by, the, pretty, sculpture] 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 ham [yes, princess, are, you, going, to, make, me,... 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 ham [welp, apparently, he, retired] 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 7785 columns

In [10]:
# Calculate P(Spam) and P(Ham).

training_spam = training_set_final[training_set_final['Label']=='spam']
training_ham = training_set_final[training_set_final['Label']=='ham']

# Creating a dictionary of words from all spam messages with their frequencies

spam_dict = {}
for sms in training_spam['SMS']:
    for word in sms:
        if word not in spam_dict:
            spam_dict[word]=0
        spam_dict[word]+=1
        
# Sorting the dictionary in descending order of word frequencies 


sorted_spam_dict = dict(sorted(spam_dict.items(), key=operator.itemgetter(1), reverse=True))

Building a wordcloud

To build a wordcloud we are:

  • considering only the most frequent words,
  • selecting only the most "suspicious" of them (i.e. non-conservative approach),
  • gathering exactly 100 words to visualize.
In [11]:
selected = ['call', 'free', 'stop', 'mobile', 'text', 'claim', 'www', 
            'prize', 'send', 'cash', 'nokia', 'win', 'urgent', 'service',
            'contact', 'com', 'msg', 'chat', 'guaranteed', 'customer', 
            'awarded', 'sms', 'ringtone', 'video', 'rate', 'latest', 
            'award', 'code', 'camera', 'chance', 'apply', 'valid', 'selected',
            'offer', 'tones', 'collection', 'mob', 'network', 'attempt', 
            'bonus', 'delivery', 'weekly', 'club', 'http', 'help', 'dating',
            'vouchers', 'poly', 'auction', 'ltd', 'pounds', 'special',
            'services', 'games', 'await', 'double', 'unsubscribe', 'hot',
            'price', 'sexy', 'camcorder', 'content', 'top', 'calls', 
            'account', 'private', 'winner', 'savamob', 'offers', 'pobox',
            'gift', 'net', 'quiz', 'expires', 'freemsg', 'play', 'ipod',
            'last', 'order', 'anytime', 'congratulations', 'caller', 'points',
            'identifier', 'voucher', 'statement', 'operator', 'real', 
            'mobiles', 'important', 'join', 'rental', 'valued', 'congrats',
            'final', 'enjoy', 'unlimited', 'tv', 'charged', 'sex']

# Extracting only the 100 most frequent spam words with their frequencies

filtered_sorted_spam_dict = {}
for word in selected:
    if word in sorted_spam_dict:
        filtered_sorted_spam_dict[word]=sorted_spam_dict[word] 

print(f'The number of the most popular spam words selected: {len(filtered_sorted_spam_dict)}')
The number of the most popular spam words selected: 100
In [12]:
# Creating a word cloud
fig = plt.subplots(figsize=(12,10)) 
wordcloud = WordCloud(width=1000, height=700,
                      background_color='white', 
                      random_state=1).generate_from_frequencies(filtered_sorted_spam_dict)
plt.title('The most frequent words in spam messages\n', fontsize=29)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

The 100 most frequent words of spam messages reveal the following patterns :

  • encouraging people to do further actions (call, text, send, contact, help, claim, apply),
  • promising them something alluring (free, offer, prize, award, special, unlimited, guaranteed, gift, bonus),
  • urging them (urgent, attempt, stop),
  • having sexual context (dating, hot, sexy, sex),
  • inviting to visit some web resources (www, http, com, net),
  • advertising various digital devices and products (nokia, ringtone, video, camcorder, ipod, camera, games).

Calculating Constants

We make the classification using Naive Bayes algorithm based on the probabilities of two equations.

image.png

To calculate the probabilities of Spam and Ham above:

image-2.png

where:

  • Nwi|Spam — the number of times the word wi occurs in spam messages,
  • Nwi|Ham — the number of times the word wi occurs in ham messages,
  • NSpam — total number of words in spam messages,
  • NHam — total number of words in ham messages,
  • NVocabulary — total number of unique words in the vocabulary,
  • α — a smoothing parameter.
In [13]:
# Calculate P(Spam) and P(Ham)

p_spam = training_set_final['Label'].value_counts()['spam']/len(training_set_final)
p_ham = training_set_final['Label'].value_counts()['ham']/len(training_set_final)

n_spam = 0
n_ham = 0
for i in range(len(training_set_final)):
    row = list(training_set_final.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam+=row[j]
        else:
            n_ham+=row[j]
            
n_vocabulary = len(vocabulary)
alpha = 1

print(f'p_spam: {p_spam:.2f}\n'
      f'p_ham: {p_ham:.2f}\n'
      f'n_spam: {n_spam:,}\n'
      f'n_ham: {n_ham:,}\n'
      f'n_vocabulary: {n_vocabulary:,}\n'
      f'alpha: {alpha}')
p_spam: 0.13
p_ham: 0.87
n_spam: 15,190
n_ham: 57,237
n_vocabulary: 7,783
alpha: 1

Calculating parameters

For the 7,783 words in the vocabulary a total of 15,566 probabilities, we calculate the P(wi|Spam) and P(wi|Ham) for each word.

image.png

In [14]:
# Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0

p_wi_spam = {}
p_wi_ham = {}

for word in vocabulary:
    p_wi_spam[word] = (training_spam[word].sum()+alpha)/(n_spam+alpha*n_vocabulary)
    p_wi_ham[word] = (training_ham[word].sum()+alpha)/(n_ham+alpha*n_vocabulary)

Classifying a new message

Create a spam filter which:

  • Takes in a new message as input
  • Calculates P(Spam|message) and P(Ham|message) using the formulas

image.png

  • Compares both values and:

    • if P(Ham|message) > P(Spam|message), then the message is classified as ham,
    • if P(Ham|message) < P(Spam|message), then the message is classified as spam,
    • if P(Ham|message) = P(Spam|message), then the algorithm may request human help.
In [15]:
# Copy the classify() function you see above and write the code needed for calculating

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_wi_spam:
            p_spam_given_message*=p_wi_spam[word]
        if word in p_wi_ham:
            p_ham_given_message*=p_wi_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'
In [16]:
# Use the classify() function to classify two new messages.

print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print(classify("Sounds good, Tom, then see u there"))
spam
ham

Measuring the Spam Filter's Accuracy

Testing the spam filter on the test set of 1114 messages.

In [17]:
test_data['Predicted'] = test_data['SMS'].apply(classify)
test_data.head()
Out[17]:
Label SMS Predicted
0 ham Later i guess. I needa do mcat study too. ham
1 ham But i haf enuff space got like 4 mb... ham
2 spam Had your mobile 10 mths? Update to latest Oran... spam
3 ham All sounds good. Fingers . Makes it difficult ... ham
4 ham All done, all handed in. Don't know if mega sh... ham
In [18]:
# Measure the accuracy of the spam filter.

correct = 0
total = len(test_data)       
for row in test_data.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct+=1
accuracy = correct/total*100

print(f'The accuracy of the spam filter: {accuracy:.2f}%')
The accuracy of the spam filter: 98.74%

We have successfully used our spam filter to get more than the expected 80% accuracy using the multinomial naive bayes algorithm.

Wrongly Classified Messages

Here is the list of the wrongly classified messages.

In [19]:
false_spam = test_data[(test_data['Predicted']=='spam')&(test_data['Label']=='ham')].reset_index(drop=True)
false_ham = test_data[(test_data['Predicted']=='ham')&(test_data['Label']=='spam')].reset_index(drop=True)
unclear = test_data[test_data['Predicted']=='needs human classification'].reset_index(drop=True)

print('Total number of wrongly classified messages: ', len(false_spam)+len(false_ham)+len(unclear))
print('_________________________________________________________________________\n')
print('FALSE SPAM MESSAGES:')
for row in false_spam.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('FALSE HAM MESSAGES:')
for row in false_ham.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('UNCLEAR MESSAGES:')
for row in unclear.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________')
Total number of wrongly classified messages:  13
_________________________________________________________________________

FALSE SPAM MESSAGES:
1.  Unlimited texts. Limited minutes.
2.  26th OF JULY
3.  Nokia phone is lovly..
4.  No calls..messages..missed calls
5.  We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us
_________________________________________________________________________

FALSE HAM MESSAGES:
1.  Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
2.  More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
3.  Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50
4.  Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text
5.  0A$NETWORKS allow companies to bill for SMS, so they are responsible for their "suppliers", just as a shop has to give a guarantee on what they sell. B. G.
6.  RCT' THNQ Adrian for U text. Rgds Vatian
7.  2/2 146tf150p
8.  Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r
_________________________________________________________________________

UNCLEAR MESSAGES:
_________________________________________________________________________

From the total number of 13 wrongly classified messages, we can see that.

  • 5 of these are classified as spam messages due to the presence of words such as texts, Nokia, calls and customer.
  • 8 of these are classified as ham messages due to the presence of words seen in non-spam messages.

The slgorithm however had worked perfectly with other messages giving it an overal accuracy of 98.74%

Experiment: Making the Algorithm Case-Sensitive

By making the following adjustments we can try tuning the spam filter accuracy the following ways.

  • increasing the training set size (decreasing, however, the one of the test set) to train the algorithm more,
  • selecting alpha different than 1,
  • preserving punctuation (the word win and win!!! can appear in rather different contexts),
  • making the algorithm sensitive to letter case (for example, SECRET seems to be more spam-like than secret).
In [20]:
training_set_exp = sms_randomized[:4458].reset_index(drop=True)
test_set_exp = sms_randomized[4458:].reset_index(drop=True)
training_set_exp['SMS'] = training_set_exp['SMS'].str.replace('\W', ' ')

vocabulary_exp = []
for sms in training_set_exp['SMS']:
    for word in sms:
        vocabulary_exp.append(word)
vocabulary_exp = list(set(vocabulary_exp))

word_counts_per_sms_exp = {unique_word: [0] * len(training_set_exp['SMS']) for unique_word in vocabulary_exp}
for index, sms in enumerate(training_set_exp['SMS']):
    for word in sms:
        word_counts_per_sms_exp[word][index]+=1
        
word_counts_exp = pd.DataFrame(word_counts_per_sms_exp)

training_set_final_exp = pd.concat([training_set_exp, word_counts_exp], axis=1)
    
spam_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='spam']
ham_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='ham']

p_spam_exp = training_set_final_exp['Label'].value_counts()['spam']/len(training_set_final_exp)
p_ham_exp = training_set_final_exp['Label'].value_counts()['ham']/len(training_set_final_exp)

n_spam_exp = 0
n_ham_exp = 0
for i in range(len(training_set_final_exp)):
    row = list(training_set_final_exp.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam_exp+=row[j]
        else:
            n_ham_exp+=row[j]
            
n_vocabulary_exp = len(vocabulary_exp)
alpha = 1

p_wi_spam_exp = {}
p_wi_ham_exp = {}
for word in vocabulary_exp:
    p_wi_spam_exp[word] = (spam_sms_exp[word].sum()+alpha)/(n_spam_exp+alpha*n_vocabulary_exp)
    p_wi_ham_exp[word] = (ham_sms_exp[word].sum()+alpha)/(n_ham_exp+alpha*n_vocabulary_exp)
    
def classify_exp(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message_exp = p_spam_exp
    p_ham_given_message_exp = p_ham_exp
    for word in message:
        if word in p_wi_spam_exp:
             p_spam_given_message_exp*=p_wi_spam_exp[word]
        if word in p_wi_ham_exp:
            p_ham_given_message_exp*=p_wi_ham_exp[word]

    if p_ham_given_message_exp > p_spam_given_message_exp:
        return 'ham'
    elif p_ham_given_message_exp < p_spam_given_message_exp:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'
    
test_set_exp['Predicted'] = test_set_exp['SMS'].apply(classify_exp)

correct_exp = 0
total_exp = len(test_set_exp)

for row in test_set_exp.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct_exp+=1
accuracy_exp = correct_exp/total_exp*100
print(f'The accuracy of the spam filter: {accuracy_exp:.2f}%')
The accuracy of the spam filter: 85.10%

Conclusions

Through this project, we have managed to achieve in creating a spam filter using multinomial naive bayes algorithm.

Apart from that we also have achieved learning:

  • How to assign probabilities based on certain conditions using conditional probability rules.
  • How to assign probabilities to events based on whether their relationship of statistical independence or not with other events.
  • How to assign probabilities to events using Bayes theorem.

Also, the effieiency achieved using Naive Bayes algorithm in this project is much better than the expected efficiency.