In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam.

The feature vectors generated in this notebook are composed of simple summaries of the text data. We begin by loading in the data produced by the generator notebook.

In [ ]:

import pandas as pd

df = pd.read_parquet("data/training.parquet")

To illustrate the computation of feature vectors, we compute them for a sample of three documents from the data loaded in above.

In [ ]:

import numpy as np

np.random.seed(0xc0fee)
df_samp = df.sample(3)

In [ ]:

pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible

df_samp

The summmaries we will compute for each document are:

number of pieces of punctuation
number of words
average word length
maximum word length
minimum word length
10th percentile word length
90th percentile word length
number of words containing upper case letters
number 'stop words'

To begin, we count the number of pieces of punctuation in each piece of text. We will remove the punctuation from the text as it is counted. This will make computing the later summaries a little simpler.

In [ ]:

import re

def strip_punct(doc):
    """
    takes in a document _doc_ and
    returns a tuple of the punctuation-free
    _doc_ and the count of punctuation in _doc_
    """
    
    return re.subn(r"""[!.><:;',@#~{}\[\]\-_+=£$%^&()?]""", "", doc, count=0, flags=0)

In [ ]:

df_samp["text_str"]= df_samp["text"].apply(strip_punct)

In [ ]:

df_samp

We will store the count of punctuation in a new summaries vector:

In [ ]:

df_summaries = pd.DataFrame({'num_punct' :df_samp["text_str"].apply(lambda x: x[1])})
df_summaries

In [ ]:

df_samp.reset_index(inplace=True) 

#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000

In [ ]:

df_samp

Many of the summaries we will compute require us to consider each word in the text, one by one. To prevent needing to 'split' the text multiple times, we split once, then apply each function to the resultant words.

To do this, we "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label".

In [ ]:

rows = []
_ = df_samp.apply(lambda row: [rows.append([ row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text_str[0].split()], axis=1)
df_samp_explode = pd.DataFrame(rows, columns=df_samp.columns[0:4])

In [ ]:

df_samp_explode

Column level_0 contains the index we want to aggregate any calculations over.

Computing the number of words in each document is now simply calculating the number of rows for each value of level_0.

In [ ]:

df_summaries["num_words"] = df_samp_explode['level_0'].value_counts()
df_summaries

Many of the remaining summaries require word length to be computed. To save us from recomputing this every time, we will add a column containing this information to our 'exploded' data frame:

In [ ]:

df_samp_explode["word_len"] = df_samp_explode["text"].apply(len) 

In [ ]:

df_samp_explode.sample(10) 

In the next cell we compute the average word length as well as the minimum and maximum, for each document.

In [ ]:

df_summaries["av_wl"] = df_samp_explode.groupby('level_0')['word_len'].mean() #average word length
df_summaries["max_wl"] = df_samp_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_samp_explode.groupby('level_0')['word_len'].min() #min word length

We can also compute quantiles of the word length:

In [ ]:

df_summaries["10_quantile"] = df_samp_explode.groupby('level_0')['word_len'].quantile(0.1) #10th quantile word length
df_summaries["90_quantile"]= df_samp_explode.groupby('level_0')['word_len'].quantile(0.9) #90th quantile word length

In [ ]:

df_summaries

As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. For each document we will compute:

the number of words which contain at least one capital letter
the number of stop words

In [ ]:

#item.islower returns true if all characters are lowercase, else false.
#nb: isupper only returns true if all characters are upper case. 
def caps(word):
    return not word.islower()
df_samp_explode["upper_case"]=df_samp_explode['text'].apply(caps)
df_summaries["upper_case"] = df_samp_explode.groupby('level_0')['upper_case'].sum() 

In [ ]:

df_summaries

Stop words are commonly used words which are usually considered to be unrelated to the document topic. Examples include 'in', 'the', 'at' and 'otherwise'.

In [ ]:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [ ]:

def isstopword(word):
    return word in ENGLISH_STOP_WORDS

df_samp_explode["stop_words"]=df_samp_explode['text'].apply(isstopword)

In [ ]:

df_samp_explode.sample(10)

In [ ]:

df_summaries["stop_words"] = df_samp_explode.groupby('level_0')['stop_words'].sum() 

In [ ]:

df_summaries

Now that we've illustrated how to compute the summaries on a subsample of our data, we will go ahead and compute the summaries for each of the texts in the full dataset. In order to minimise clutter in this notebook we have introduced a helper function called features_simple.

In [ ]:

df.reset_index(inplace=True)

In [ ]:

from mlworkflows import featuressimple

In [ ]:

simple_summary = featuressimple.SimpleSummaries()

summaries = simple_summary.transform(df["text"])

In [ ]:

from sklearn.pipeline import Pipeline

feat_pipeline = Pipeline([
    ('features',simple_summary)
])

from mlworkflows import util
util.serialize_to(feat_pipeline, "feature_pipeline.sav")

In [ ]:

features = pd.concat([df[["index", "label"]],
                                pd.DataFrame(summaries)], axis=1)

In [ ]:

features

In [ ]:

features.columns = features.columns.astype(str)

Visualisation:¶

As in earlier notebooks, we use PCA to project the space of summaries to 2 dimensions, which we can then plot.

In [ ]:

import sklearn.decomposition

DIMENSIONS = 2

pca = sklearn.decomposition.PCA(DIMENSIONS)

pca_summaries = pca.fit_transform(features.iloc[:,2:features.shape[1]])

In [ ]:

from mlworkflows import plot

pca_summaries_plot_data = pd.concat([df, pd.DataFrame(pca_summaries, columns=["x", "y"])], axis=1)

plot.plot_points(pca_summaries_plot_data, x="x", y="y", color="label")

In [ ]:

features.to_parquet("data/features.parquet")

Now that we have a feature engineering approach, next step is to train a model. Again, you have two choices for your next step: click here for a model based on logistic regression, or click here for a model based on ensembles of decision trees.