In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam.
The feature vectors generated in this notebook are composed of simple summaries of the text data. We begin by loading in the data produced by the generator notebook.
import pandas as pd
df = pd.read_parquet("data/training.parquet")
To illustrate the computation of feature vectors, we compute them for a sample of three documents from the data loaded in above.
import numpy as np
np.random.seed(0xc0fee)
df_samp = df.sample(3)
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_samp
The summmaries we will compute for each document are:
To begin, we count the number of pieces of punctuation in each piece of text. We will remove the punctuation from the text as it is counted. This will make computing the later summaries a little simpler.
import re
def strip_punct(doc):
"""
takes in a document _doc_ and
returns a tuple of the punctuation-free
_doc_ and the count of punctuation in _doc_
"""
return re.subn(r"""[!.><:;',@#~{}\[\]\-_+=£$%^&()?]""", "", doc, count=0, flags=0)
df_samp["text_str"]= df_samp["text"].apply(strip_punct)
df_samp
We will store the count of punctuation in a new summaries vector:
df_summaries = pd.DataFrame({'num_punct' :df_samp["text_str"].apply(lambda x: x[1])})
df_summaries
df_samp.reset_index(inplace=True)
#note level and index coincide for the legitimate documents, but not for the spam -
#for spam, index = level_0 mod 20,000
df_samp
Many of the summaries we will compute require us to consider each word in the text, one by one. To prevent needing to 'split' the text multiple times, we split once, then apply each function to the resultant words.
To do this, we "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label".
rows = []
_ = df_samp.apply(lambda row: [rows.append([ row['level_0'], row['index'], row['label'], word])
for word in row.text_str[0].split()], axis=1)
df_samp_explode = pd.DataFrame(rows, columns=df_samp.columns[0:4])
df_samp_explode
Column level_0
contains the index we want to aggregate any calculations over.
Computing the number of words in each document is now simply calculating the number of rows for each value of level_0
.
df_summaries["num_words"] = df_samp_explode['level_0'].value_counts()
df_summaries
Many of the remaining summaries require word length to be computed. To save us from recomputing this every time, we will add a column containing this information to our 'exploded' data frame:
df_samp_explode["word_len"] = df_samp_explode["text"].apply(len)
df_samp_explode.sample(10)
In the next cell we compute the average word length as well as the minimum and maximum, for each document.
df_summaries["av_wl"] = df_samp_explode.groupby('level_0')['word_len'].mean() #average word length
df_summaries["max_wl"] = df_samp_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_samp_explode.groupby('level_0')['word_len'].min() #min word length
We can also compute quantiles of the word length:
df_summaries["10_quantile"] = df_samp_explode.groupby('level_0')['word_len'].quantile(0.1) #10th quantile word length
df_summaries["90_quantile"]= df_samp_explode.groupby('level_0')['word_len'].quantile(0.9) #90th quantile word length
df_summaries
As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. For each document we will compute:
#item.islower returns true if all characters are lowercase, else false.
#nb: isupper only returns true if all characters are upper case.
def caps(word):
return not word.islower()
df_samp_explode["upper_case"]=df_samp_explode['text'].apply(caps)
df_summaries["upper_case"] = df_samp_explode.groupby('level_0')['upper_case'].sum()
df_summaries
Stop words are commonly used words which are usually considered to be unrelated to the document topic. Examples include 'in', 'the', 'at' and 'otherwise'.
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
def isstopword(word):
return word in ENGLISH_STOP_WORDS
df_samp_explode["stop_words"]=df_samp_explode['text'].apply(isstopword)
df_samp_explode.sample(10)
df_summaries["stop_words"] = df_samp_explode.groupby('level_0')['stop_words'].sum()
df_summaries
Now that we've illustrated how to compute the summaries on a subsample of our data, we will go ahead and compute the summaries for each of the texts in the full dataset. In order to minimise clutter in this notebook we have introduced a helper function called features_simple
.
df.reset_index(inplace=True)
from mlworkflows import featuressimple
simple_summary = featuressimple.SimpleSummaries()
summaries = simple_summary.transform(df["text"])
from sklearn.pipeline import Pipeline
feat_pipeline = Pipeline([
('features',simple_summary)
])
from mlworkflows import util
util.serialize_to(feat_pipeline, "feature_pipeline.sav")
features = pd.concat([df[["index", "label"]],
pd.DataFrame(summaries)], axis=1)
features
features.columns = features.columns.astype(str)
As in earlier notebooks, we use PCA to project the space of summaries to 2 dimensions, which we can then plot.
import sklearn.decomposition
DIMENSIONS = 2
pca = sklearn.decomposition.PCA(DIMENSIONS)
pca_summaries = pca.fit_transform(features.iloc[:,2:features.shape[1]])
from mlworkflows import plot
pca_summaries_plot_data = pd.concat([df, pd.DataFrame(pca_summaries, columns=["x", "y"])], axis=1)
plot.plot_points(pca_summaries_plot_data, x="x", y="y", color="label")
features.to_parquet("data/features.parquet")
Now that we have a feature engineering approach, next step is to train a model. Again, you have two choices for your next step: click here for a model based on logistic regression, or click here for a model based on ensembles of decision trees.