SPAM Classifier¶

Steps
Read in data
Feature Engineering
-- Simple Bins
-- TFIDF
-- NLP
Sparse Representation
Training
-- Naive Bayes
-- SGD

Reading and Preprocessing the Data¶

The data used in this module is from the CSDMC2010 SPAM corpus. If you want to follow along with your own data, or make any modifications on the examples/data, do the following first in a Python compatible environment:

Download and unzip the data
Run the 'ExtractContent.py' to extract the subject and body from each email file. Note, if you'd like to make your SPAM classifier even better, you can modify this Python script to use more than just subject and body information

In [ ]: