Building a spam filter using naive bayes algorithm¶

Introduction¶

Here in this guided project, we are going to build a spam filter using the practical side of naive bayes algorithm. The dataset for this project is put together by Tiago A. Almeida and José María Gómez Hidalgo and is available fof download from here.

Goals¶

The computer does the following to classify messages as spam or not spam:

Learns how humans classify messages
Uses that knowledge to estimate probability for new messages being spam or not
Classifies a new message based on probability values
- If the probability for spam is greater, then the message is a spam
- Otherwise it is a non-spam

Import the libraries¶

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import re
import operator
from wordcloud import WordCloud, STOPWORDS

In [2]:

# Open the SMSSpamCollection file using the read_csv() function from the pandas package.

sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

# Explore the dataset a little.

print(f'Number of SMS messages: {sms.shape[0]:,}')
print(f'Number of missing values in the dataframe: {sms.isnull().sum().sum()}\n')

def pretty_print_table(df, substring):
    print(f'Spam vs. ham {substring}, %')
    spam_ham_pct = round(df['Label'].value_counts(normalize=True)*100, 0)
    print(spam_ham_pct.to_markdown(tablefmt='pretty', headers=['Label', '%']))
    
pretty_print_table(df=sms, substring='(non-spam)')

Number of SMS messages: 5,572
Number of missing values in the dataframe: 0

Spam vs. ham (non-spam), %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+

In [3]:

# Start by randomizing the entire dataset by using the DataFrame.sample() method.

sms_randomized = sms.sample(frac=1, random_state=1)

# Split the randomized dataset into a training and a test set.

training_data = sms_randomized[:4458].reset_index(drop=True)
test_data = sms_randomized[4458:].reset_index(drop=True)

# Find the percentage of spam and ham in both the training and the test set.

pretty_print_table(df=training_data, substring=('in the training set'))
print('\n')
pretty_print_table(df=test_data, substring=('in the test set'))

Spam vs. ham in the training set, %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+


Spam vs. ham in the test set, %
+-------+------+
| Label |  %   |
+-------+------+
|  ham  | 87.0 |
| spam  | 13.0 |
+-------+------+

In [4]:

# Remove all the punctuation from the SMS column. You can use the regex '\W' to detect any character that is not from a-z, A-Z or 0-9.

training_data['SMS'] = training_data['SMS'].str.replace('\W', ' ').str.lower()
training_data.head()

Out[4]:

	Label	SMS
0	ham	yep by the pretty sculpture
1	ham	yes princess are you going to make me moan
2	ham	welp apparently he retired
3	ham	havent
4	ham	i forgot 2 ask ü all smth there s a card on ...

In [5]:

# Create a vocabulary for the messages in the training set. 

training_data['SMS'] = training_data['SMS'].str.split()
training_data.head()

Out[5]:

	Label	SMS
0	ham	[yep, by, the, pretty, sculpture]
1	ham	[yes, princess, are, you, going, to, make, me,...
2	ham	[welp, apparently, he, retired]
3	ham	[havent]
4	ham	[i, forgot, 2, ask, ü, all, smth, there, s, a,...

In [6]:

# Initiate an empty list named vocabulary

vocabulary = []
for sms in training_data['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))
print(f'Number of unique words in the vocabulary of the training set: {len(vocabulary):,}')

Number of unique words in the vocabulary of the training set: 7,783

In [7]:

# Run the code you see above to get the word_counts_per_sms dictionary

word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [8]:

# Transform word_counts_per_sms into a DataFrame using pd.DataFrame()

word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
word_counts_per_sms.head(3)

Out[8]:

	____	kuch	atm	patients	hourish	08718727870150ppm	鈥	hoody	bthere	wipe	...	fancied	jog	haha	scream	or2optout	ana	waiting	shortcode	97n7qp	batchlor
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

3 rows × 7783 columns

In [9]:

# Concatenate the DataFrame we just built above with the DataFrame containing the training set

training_set_final = pd.concat([training_data, word_counts_per_sms], axis=1)
training_set_final.head(3)

Out[9]:

	Label	SMS	...
0	ham	[yep, by, the, pretty, sculpture]	...
1	ham	[yes, princess, are, you, going, to, make, me,...	...
2	ham	[welp, apparently, he, retired]	...

3 rows × 7785 columns

In [10]:

# Calculate P(Spam) and P(Ham).

training_spam = training_set_final[training_set_final['Label']=='spam']
training_ham = training_set_final[training_set_final['Label']=='ham']

# Creating a dictionary of words from all spam messages with their frequencies

spam_dict = {}
for sms in training_spam['SMS']:
    for word in sms:
        if word not in spam_dict:
            spam_dict[word]=0
        spam_dict[word]+=1
        
# Sorting the dictionary in descending order of word frequencies 


sorted_spam_dict = dict(sorted(spam_dict.items(), key=operator.itemgetter(1), reverse=True))

Building a wordcloud¶

To build a wordcloud we are:

considering only the most frequent words,
selecting only the most "suspicious" of them (i.e. non-conservative approach),
gathering exactly 100 words to visualize.

In [11]:

selected = ['call', 'free', 'stop', 'mobile', 'text', 'claim', 'www', 
            'prize', 'send', 'cash', 'nokia', 'win', 'urgent', 'service',
            'contact', 'com', 'msg', 'chat', 'guaranteed', 'customer', 
            'awarded', 'sms', 'ringtone', 'video', 'rate', 'latest', 
            'award', 'code', 'camera', 'chance', 'apply', 'valid', 'selected',
            'offer', 'tones', 'collection', 'mob', 'network', 'attempt', 
            'bonus', 'delivery', 'weekly', 'club', 'http', 'help', 'dating',
            'vouchers', 'poly', 'auction', 'ltd', 'pounds', 'special',
            'services', 'games', 'await', 'double', 'unsubscribe', 'hot',
            'price', 'sexy', 'camcorder', 'content', 'top', 'calls', 
            'account', 'private', 'winner', 'savamob', 'offers', 'pobox',
            'gift', 'net', 'quiz', 'expires', 'freemsg', 'play', 'ipod',
            'last', 'order', 'anytime', 'congratulations', 'caller', 'points',
            'identifier', 'voucher', 'statement', 'operator', 'real', 
            'mobiles', 'important', 'join', 'rental', 'valued', 'congrats',
            'final', 'enjoy', 'unlimited', 'tv', 'charged', 'sex']

# Extracting only the 100 most frequent spam words with their frequencies

filtered_sorted_spam_dict = {}
for word in selected:
    if word in sorted_spam_dict:
        filtered_sorted_spam_dict[word]=sorted_spam_dict[word] 

print(f'The number of the most popular spam words selected: {len(filtered_sorted_spam_dict)}')

The number of the most popular spam words selected: 100

In [12]:

# Creating a word cloud
fig = plt.subplots(figsize=(12,10)) 
wordcloud = WordCloud(width=1000, height=700,
                      background_color='white', 
                      random_state=1).generate_from_frequencies(filtered_sorted_spam_dict)
plt.title('The most frequent words in spam messages\n', fontsize=29)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

The 100 most frequent words of spam messages reveal the following patterns :

encouraging people to do further actions (call, text, send, contact, help, claim, apply),
promising them something alluring (free, offer, prize, award, special, unlimited, guaranteed, gift, bonus),
urging them (urgent, attempt, stop),
having sexual context (dating, hot, sexy, sex),
inviting to visit some web resources (www, http, com, net),
advertising various digital devices and products (nokia, ringtone, video, camcorder, ipod, camera, games).

Calculating Constants¶

We make the classification using Naive Bayes algorithm based on the probabilities of two equations.

To calculate the probabilities of Spam and Ham above:

where:

Nwi|Spam — the number of times the word wi occurs in spam messages,
Nwi|Ham — the number of times the word wi occurs in ham messages,
NSpam — total number of words in spam messages,
NHam — total number of words in ham messages,
NVocabulary — total number of unique words in the vocabulary,
α — a smoothing parameter.

In [13]:

# Calculate P(Spam) and P(Ham)

p_spam = training_set_final['Label'].value_counts()['spam']/len(training_set_final)
p_ham = training_set_final['Label'].value_counts()['ham']/len(training_set_final)

n_spam = 0
n_ham = 0
for i in range(len(training_set_final)):
    row = list(training_set_final.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam+=row[j]
        else:
            n_ham+=row[j]
            
n_vocabulary = len(vocabulary)
alpha = 1

print(f'p_spam: {p_spam:.2f}\n'
      f'p_ham: {p_ham:.2f}\n'
      f'n_spam: {n_spam:,}\n'
      f'n_ham: {n_ham:,}\n'
      f'n_vocabulary: {n_vocabulary:,}\n'
      f'alpha: {alpha}')

p_spam: 0.13
p_ham: 0.87
n_spam: 15,190
n_ham: 57,237
n_vocabulary: 7,783
alpha: 1

Calculating parameters¶

For the 7,783 words in the vocabulary a total of 15,566 probabilities, we calculate the P(wi|Spam) and P(wi|Ham) for each word.

In [14]:

# Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0

p_wi_spam = {}
p_wi_ham = {}

for word in vocabulary:
    p_wi_spam[word] = (training_spam[word].sum()+alpha)/(n_spam+alpha*n_vocabulary)
    p_wi_ham[word] = (training_ham[word].sum()+alpha)/(n_ham+alpha*n_vocabulary)

Classifying a new message¶

Create a spam filter which:

Takes in a new message as input
Calculates P(Spam|message) and P(Ham|message) using the formulas

Compares both values and:
- if P(Ham|message) > P(Spam|message), then the message is classified as ham,
- if P(Ham|message) < P(Spam|message), then the message is classified as spam,
- if P(Ham|message) = P(Spam|message), then the algorithm may request human help.

In [15]:

# Copy the classify() function you see above and write the code needed for calculating

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_wi_spam:
            p_spam_given_message*=p_wi_spam[word]
        if word in p_wi_ham:
            p_ham_given_message*=p_wi_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'

In [16]:

# Use the classify() function to classify two new messages.

print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print(classify("Sounds good, Tom, then see u there"))

spam
ham

Measuring the Spam Filter's Accuracy¶

Testing the spam filter on the test set of 1114 messages.

In [17]:

test_data['Predicted'] = test_data['SMS'].apply(classify)
test_data.head()

Out[17]:

	Label	SMS	Predicted
0	ham	Later i guess. I needa do mcat study too.	ham
1	ham	But i haf enuff space got like 4 mb...	ham
2	spam	Had your mobile 10 mths? Update to latest Oran...	spam
3	ham	All sounds good. Fingers . Makes it difficult ...	ham
4	ham	All done, all handed in. Don't know if mega sh...	ham

In [18]:

# Measure the accuracy of the spam filter.

correct = 0
total = len(test_data)       
for row in test_data.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct+=1
accuracy = correct/total*100

print(f'The accuracy of the spam filter: {accuracy:.2f}%')

The accuracy of the spam filter: 98.74%

We have successfully used our spam filter to get more than the expected 80% accuracy using the multinomial naive bayes algorithm.

Wrongly Classified Messages¶

Here is the list of the wrongly classified messages.

In [19]:

false_spam = test_data[(test_data['Predicted']=='spam')&(test_data['Label']=='ham')].reset_index(drop=True)
false_ham = test_data[(test_data['Predicted']=='ham')&(test_data['Label']=='spam')].reset_index(drop=True)
unclear = test_data[test_data['Predicted']=='needs human classification'].reset_index(drop=True)

print('Total number of wrongly classified messages: ', len(false_spam)+len(false_ham)+len(unclear))
print('_________________________________________________________________________\n')
print('FALSE SPAM MESSAGES:')
for row in false_spam.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('FALSE HAM MESSAGES:')
for row in false_ham.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('UNCLEAR MESSAGES:')
for row in unclear.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________')

Total number of wrongly classified messages: 13
_________________________________________________________________________

FALSE SPAM MESSAGES:
1. Unlimited texts. Limited minutes.
2. 26th OF JULY
3. Nokia phone is lovly..
4. No calls..messages..missed calls
5. We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us
_________________________________________________________________________

FALSE HAM MESSAGES:
1. Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
2. More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
3. Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50
4. Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text
5. 0A$NETWORKS allow companies to bill for SMS, so they are responsible for their "suppliers", just as a shop has to give a guarantee on what they sell. B. G.
6. RCT' THNQ Adrian for U text. Rgds Vatian
7. 2/2 146tf150p
8. Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r
_________________________________________________________________________

UNCLEAR MESSAGES:
_________________________________________________________________________

From the total number of 13 wrongly classified messages, we can see that.

5 of these are classified as spam messages due to the presence of words such as texts, Nokia, calls and customer.
8 of these are classified as ham messages due to the presence of words seen in non-spam messages.

The slgorithm however had worked perfectly with other messages giving it an overal accuracy of 98.74%

Experiment: Making the Algorithm Case-Sensitive¶

By making the following adjustments we can try tuning the spam filter accuracy the following ways.

increasing the training set size (decreasing, however, the one of the test set) to train the algorithm more,
selecting alpha different than 1,
preserving punctuation (the word win and win!!! can appear in rather different contexts),
making the algorithm sensitive to letter case (for example, SECRET seems to be more spam-like than secret).

In [20]:

training_set_exp = sms_randomized[:4458].reset_index(drop=True)
test_set_exp = sms_randomized[4458:].reset_index(drop=True)
training_set_exp['SMS'] = training_set_exp['SMS'].str.replace('\W', ' ')

vocabulary_exp = []
for sms in training_set_exp['SMS']:
    for word in sms:
        vocabulary_exp.append(word)
vocabulary_exp = list(set(vocabulary_exp))

word_counts_per_sms_exp = {unique_word: [0] * len(training_set_exp['SMS']) for unique_word in vocabulary_exp}
for index, sms in enumerate(training_set_exp['SMS']):
    for word in sms:
        word_counts_per_sms_exp[word][index]+=1
        
word_counts_exp = pd.DataFrame(word_counts_per_sms_exp)

training_set_final_exp = pd.concat([training_set_exp, word_counts_exp], axis=1)
    
spam_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='spam']
ham_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='ham']

p_spam_exp = training_set_final_exp['Label'].value_counts()['spam']/len(training_set_final_exp)
p_ham_exp = training_set_final_exp['Label'].value_counts()['ham']/len(training_set_final_exp)

n_spam_exp = 0
n_ham_exp = 0
for i in range(len(training_set_final_exp)):
    row = list(training_set_final_exp.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam_exp+=row[j]
        else:
            n_ham_exp+=row[j]
            
n_vocabulary_exp = len(vocabulary_exp)
alpha = 1

p_wi_spam_exp = {}
p_wi_ham_exp = {}
for word in vocabulary_exp:
    p_wi_spam_exp[word] = (spam_sms_exp[word].sum()+alpha)/(n_spam_exp+alpha*n_vocabulary_exp)
    p_wi_ham_exp[word] = (ham_sms_exp[word].sum()+alpha)/(n_ham_exp+alpha*n_vocabulary_exp)
    
def classify_exp(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message_exp = p_spam_exp
    p_ham_given_message_exp = p_ham_exp
    for word in message:
        if word in p_wi_spam_exp:
             p_spam_given_message_exp*=p_wi_spam_exp[word]
        if word in p_wi_ham_exp:
            p_ham_given_message_exp*=p_wi_ham_exp[word]

    if p_ham_given_message_exp > p_spam_given_message_exp:
        return 'ham'
    elif p_ham_given_message_exp < p_spam_given_message_exp:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'
    
test_set_exp['Predicted'] = test_set_exp['SMS'].apply(classify_exp)

correct_exp = 0
total_exp = len(test_set_exp)

for row in test_set_exp.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct_exp+=1
accuracy_exp = correct_exp/total_exp*100
print(f'The accuracy of the spam filter: {accuracy_exp:.2f}%')

The accuracy of the spam filter: 85.10%

Conclusions¶

Through this project, we have managed to achieve in creating a spam filter using multinomial naive bayes algorithm.

Apart from that we also have achieved learning:

How to assign probabilities based on certain conditions using conditional probability rules.
How to assign probabilities to events based on whether their relationship of statistical independence or not with other events.
How to assign probabilities to events using Bayes theorem.

Also, the effieiency achieved using Naive Bayes algorithm in this project is much better than the expected efficiency.

	____	kuch	atm	patients	hourish	08718727870150ppm	鈥	hoody	bthere	wipe	...	fancied	jog	haha	scream	or2optout	ana	waiting	shortcode	97n7qp	batchlor
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	____	kuch	atm	patients	hourish	08718727870150ppm	鈥	hoody	bthere	wipe	...	fancied	jog	haha	scream	or2optout	ana	waiting	shortcode	97n7qp	batchlor
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	____	kuch	atm	patients	hourish	08718727870150ppm	鈥	hoody	bthere	wipe	...	fancied	jog	haha	scream	or2optout	ana	waiting	shortcode	97n7qp	batchlor
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0