The aim of this project is to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm and a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from this repository.
To classify messages as spam or non-spam, the computer:
In other words, our task for this project is to "teach" the computer how to classify messages.
We created a highly accurate spam filter, managing to reach the accuracy of 98.74%, which is almost 20% higher than our initial focus. A few messages classified incorrectly revealed some features in common. The attempt to increase the accuracy even further by making the algorithm sensitive to letter case resulted, just the opposite, in rendering the spam filter 13.5% less efficient.
import pandas as pd
import matplotlib.pyplot as plt
import re
import operator
from wordcloud import WordCloud, STOPWORDS
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
print(f'Number of SMS messages: {sms.shape[0]:,}')
print(f'Number of missing values in the dataframe: {sms.isnull().sum().sum()}\n')
def pretty_print_table(df, substring):
'''Pretty-prints a table of the result of `value_counts` method (in % and
rounded) on the `Label` column of an input dataframe. Prints the title of
the table with an input substring incorporated.
'''
print(f'Spam vs. ham {substring}, %')
spam_ham_pct = round(df['Label'].value_counts(normalize=True)*100, 0)
print(spam_ham_pct.to_markdown(tablefmt='pretty', headers=['Label', '%']))
# Pretty-printing % of spam and ham messages
pretty_print_table(df=sms, substring='(non-spam)')
# Plotting % of spam and ham messages
spam_pct = round(sms['Label'].value_counts(normalize=True)*100, 0)
fig, ax = plt.subplots(figsize=(8,2))
spam_pct.plot.barh(color='slateblue')
ax.set_title('Spam vs. ham, %', fontsize=25)
ax.set_xlabel(None)
ax.tick_params(axis='both', labelsize=16, left=False)
for side in ['top', 'right', 'left']:
ax.spines[side].set_visible(False)
plt.show()
sms.head()
Number of SMS messages: 5,572 Number of missing values in the dataframe: 0 Spam vs. ham (non-spam), % +-------+------+ | Label | % | +-------+------+ | ham | 87.0 | | spam | 13.0 | +-------+------+
Label | SMS | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
To start with, we have to put apart a portion of the entire dataset that we'll use at the end to test how well our spam filter classifies new messages. Hence, we have to split our dataset into 2 parts:
Let's focus on creating a spam filter that classifies new messages with an accuracy greater than 80%. First, we're going to randomize the dataset to ensure that spam and ham messages are spread properly throughout the dataset.
sms_randomized = sms.sample(frac=1, random_state=1)
# Creating a training set (80%) and a test set (20%)
training_set = sms_randomized[:4458].reset_index(drop=True)
test_set = sms_randomized[4458:].reset_index(drop=True)
# Finding the % of spam and ham in both sets
pretty_print_table(df=training_set, substring='in the training set')
print('\n')
pretty_print_table(df=test_set, substring='in the test set')
Spam vs. ham in the training set, % +-------+------+ | Label | % | +-------+------+ | ham | 87.0 | | spam | 13.0 | +-------+------+ Spam vs. ham in the test set, % +-------+------+ | Label | % | +-------+------+ | ham | 87.0 | | spam | 13.0 | +-------+------+
We see that the percentages of spam/ham messages in each set are representative of those in the whole dataset.
Let's perform some data cleaning to bring the data of the training set in a format that will allow us to extract easily all the necessary information. This format implies a table with the following features:
Label
column (spam/ham), the SMS
column, and a series of new columns, each representing a unique word from the vocabulary,First, we'll remove the punctuation and bring all the words to lower case:
# Removing punctuation and making all the words lower case
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ').str.lower()
training_set.head()
Label | SMS | |
---|---|---|
0 | ham | yep by the pretty sculpture |
1 | ham | yes princess are you going to make me moan |
2 | ham | welp apparently he retired |
3 | ham | havent |
4 | ham | i forgot 2 ask ü all smth there s a card on ... |
Next, we'll create a list of all the unique words that occur in the messages of our training set.
training_set['SMS'] = training_set['SMS'].str.split()
training_set.head(3)
Label | SMS | |
---|---|---|
0 | ham | [yep, by, the, pretty, sculpture] |
1 | ham | [yes, princess, are, you, going, to, make, me,... |
2 | ham | [welp, apparently, he, retired] |
vocabulary = []
for sms in training_set['SMS']:
for word in sms:
vocabulary.append(word)
vocabulary = list(set(vocabulary))
print(f'Number of unique words in the vocabulary of the training set: {len(vocabulary):,}')
Number of unique words in the vocabulary of the training set: 7,783
The last step of data cleaning includes using the vocabulary to make the final data transformation to our training set:
# Creating a dictionary where each key is a unique word from the vocabulary,
# and each value is a list of the frequencies of that word in each message
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(training_set['SMS']):
for word in sms:
word_counts_per_sms[word][index]+=1
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head(3)
gari | eye | gifted | chip | ken | alone | uhhhhrmm | 45pm | dentists | bbq | ... | charity | 600 | names | filthy | tessy | cutter | 10 | basically | teenager | workand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 7783 columns
training_set_final = pd.concat([training_set, word_counts], axis=1)
training_set_final.head(3)
Label | SMS | gari | eye | gifted | chip | ken | alone | uhhhhrmm | 45pm | ... | charity | 600 | names | filthy | tessy | cutter | 10 | basically | teenager | workand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ham | [yep, by, the, pretty, sculpture] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | ham | [yes, princess, are, you, going, to, make, me,... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | ham | [welp, apparently, he, retired] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 7785 columns
Before moving forward with creating the spam filter, let's figure out which words are the most popular in spam messages. This information will help us later to obtain more insights when testing the already ready filter.
spam_sms = training_set_final[training_set_final['Label']=='spam']
ham_sms = training_set_final[training_set_final['Label']=='ham']
# Creating a dictionary of words from all spam messages with their frequencies
spam_dict = {}
for sms in spam_sms['SMS']:
for word in sms:
if word not in spam_dict:
spam_dict[word]=0
spam_dict[word]+=1
# Sorting the dictionary in descending order of word frequencies
sorted_spam_dict = dict(sorted(spam_dict.items(), key=operator.itemgetter(1), reverse=True))
Now when we have a dictionary of all the words used in spam messages with their corresponding frequencies sorted in descending order, we can extract the 100 most meaningful ones and display them on a word cloud according to their frequency. To obtain an insightful visualization, we have to do a significant amount of manual work: to exclude auxiliary words (and, of, inside, etc.), neutral ones (day, please, box), numbers (1, 16, 50), or words merged with numbers (150ppm, 12hrs, 2lands). The last category actually looks interesting and definitely spam-like, but showing these words on a word cloud would make them unreadable and, hence, useless. Of course, in this manual part of work, there is some degree of subjectivity, and different people can percept "neutral" words differently. This is a common issue when creating word clouds. Anyway, our approach here will include:
selected = ['call', 'free', 'stop', 'mobile', 'text', 'claim', 'www',
'prize', 'send', 'cash', 'nokia', 'win', 'urgent', 'service',
'contact', 'com', 'msg', 'chat', 'guaranteed', 'customer',
'awarded', 'sms', 'ringtone', 'video', 'rate', 'latest',
'award', 'code', 'camera', 'chance', 'apply', 'valid', 'selected',
'offer', 'tones', 'collection', 'mob', 'network', 'attempt',
'bonus', 'delivery', 'weekly', 'club', 'http', 'help', 'dating',
'vouchers', 'poly', 'auction', 'ltd', 'pounds', 'special',
'services', 'games', 'await', 'double', 'unsubscribe', 'hot',
'price', 'sexy', 'camcorder', 'content', 'top', 'calls',
'account', 'private', 'winner', 'savamob', 'offers', 'pobox',
'gift', 'net', 'quiz', 'expires', 'freemsg', 'play', 'ipod',
'last', 'order', 'anytime', 'congratulations', 'caller', 'points',
'identifier', 'voucher', 'statement', 'operator', 'real',
'mobiles', 'important', 'join', 'rental', 'valued', 'congrats',
'final', 'enjoy', 'unlimited', 'tv', 'charged', 'sex']
# Extracting only the 100 most frequent spam words with their frequencies
filtered_sorted_spam_dict = {}
for word in selected:
if word in sorted_spam_dict:
filtered_sorted_spam_dict[word]=sorted_spam_dict[word]
print(f'The number of the most popular spam words selected: {len(filtered_sorted_spam_dict)}')
The number of the most popular spam words selected: 100
# Creating a word cloud
fig = plt.subplots(figsize=(12,10))
wordcloud = WordCloud(width=1000, height=700,
background_color='white',
random_state=1).generate_from_frequencies(filtered_sorted_spam_dict)
plt.title('The most frequent words in spam messages\n', fontsize=29)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()