Working with custom preprocessing methods

To open and run this notebook online, click here:

In this notebook, we will see an example showing how we can define and use our own custom preprocessing methods in PySS3 and also how we can tell the Live Test Tool to use them as well.

Let's begin by importing the needed modules:

In [ ]:
%matplotlib inline
import re

from pyss3 import SS3
from pyss3.util import Dataset, Evaluation, span
from pyss3.server import Live_Test

from sklearn.metrics import accuracy_score
from nltk.stem import SnowballStemmer

To keep things simple, the preprocessing process will consists of applying just a simple stemming on the documents using the Snowball Stemmer. So, we will define a simple function, called my_preprocessing, to carry out that task for us.

In [ ]:
stemmer = SnowballStemmer('english')

def stem(match):
    return stemmer.stem(

def my_preprocessing(text):
    # replace each word (\w+) with its stemmed version
    return re.sub(r"\w+", stem, text)

We will use the dataset we used for Sentiment Analysis on Movie Reviews tutorial. Let's load the training and test sets.

In [ ]:
x_train, y_train = Dataset.load_from_files("datasets/movie_review/train")
x_test, y_test = Dataset.load_from_files("datasets/movie_review/test")

Now that the dataset has been loaded, we will use our my_preprocessing function to preprocess all the training and test documents like so.

In [ ]:
# Note: A better option would be to preprocess all the documents
# only once and then stored them to disk. Then, later we could
# load our new preprocessed version of the dataset. But we're keeping
# things simple :)
x_train_prep = [my_preprocessing(doc) for doc in x_train]
x_test_prep = [my_preprocessing(doc) for doc in x_test]

Let's train our model using the preprocessed training documents we have stored in x_train_prep. In addition, we need to use the prepargument to tell our classifier to disable the default preprocessing process (prep=False).

In [ ]:
# In the "Hyperparameter Optimization" section at the bottom,
# it is shown how we obtained these hyperparemter values: s=.44, l=.48, p=.5
clf = SS3(s=.44, l=.48, p=.5)

# Let the training begin!
clf.train(x_train_prep, y_train, n_grams=3, prep=False)

Let's check if our classifier performs well at classifying the test documents.

In [ ]:
# Here we're also disabling default preprocessing
# since ``x_test_prep`` is already preprocessed
# by our custom function
y_pred = clf.predict(x_test_prep, prep=False)

accuracy = accuracy_score(y_pred, y_test)

print("Accuracy was:", accuracy)

Not bad. Note: better performance perhaps could be obtained by performing hyperparameter optimization with our new preprocessed dataset, since the hyperparameter values we've used (s=0.44, l=0.48, p=0.5) were selected using the default preprocessing (but we're keeping this notebook as simple as possible).

OK, suppose we now want to visualize what our classifier is learning and how he's carrying out the classification process, we could just use the live test as usual but this time using our preprocessed test documents (x_test_prep) and again disabling the default preprocessing process (prep=False), as follows:

In [ ]:
# note we are using the preprocessed documents here (`x_test_prep`), x_test_prep, y_test, prep=False)

# Press Esc. + the I key twice to stop it
# * Remember that the Live Test will only work if you're running
#   this notebook, locally, on your computer :(

The visualization isn't bad, however, documents are displayed as they are in x_test_prep, that is, they are displayed preprocessed (stemmed), as shown below:

It would be very nice to have the Live Test to display the documents as they originally were, that is, to display the true raw documents. This could be accomplish by running the Live Test using the original documents stored in x_test and using the prep_func argument to tell what function we want to be applied when classifying, in our case it would be prep_func=my_preprocessing, as follows:

In [ ]:, x_test, y_test, prep_func=my_preprocessing)

Now the documents are displayed in their original format. We then can interactively select individual words (or n-grams) and see, at the bottom, its preprocessed version (that is, the actual token that was used by SS3 to represent the word, or n-gram). For instance, as shown below, when the user select the 3-gram "wasting your time", at the bottom is displayed "$wast\rightarrow your\rightarrow time$" indicating the true value used by the classifier (arrows indicate "transitions", i.e., "going from one word to the other"). The same happens, for instance, with the "watching" (and "behaving") word, indicating that it was converted to "$watch$" (and "$behav$") by our custom preprocessing process (my_preprocessing).

... and... that's it for now, well done! :D

Hyperparameter Optimization

In [ ]:
clf = SS3(name="movie-reviews")

# to speed up the process, we won't use 3-gram but single words
# (i.e. we won't use the n_grams=3 argument)
clf.train(x_train_prep, y_train, prep=False)
In [ ]:
best_s, best_l, best_p, best_a = Evaluation.grid_search(
    clf, x_test_prep, y_test,
    s=span(0.2, 0.8, 6),
    l=span(0.1, 2, 6),
    p=span(0.5, 2, 6),
    a=[0, .1, .2],
    prep=False,  # <- do not forget to disable default preprocessing
    tag="grid search (test)"

print("The hyperparameter values that obtained the best Accuracy are:")
print("Smoothness(s):", best_s)
print("Significance(l):", best_l)
print("Sanction(p):", best_p)
print("Alpha(a):", best_a)

In [ ]:
clf.set_hyperparameters(0.44, 0.48, 0.5, 0.0)
y_pred = clf.predict(x_test_prep, prep=False)

accuracy = accuracy_score(y_pred, y_test)
print("Accuracy was:", accuracy)

The best accuracy with the obtained hyperparameters is 0.828. Now let's train a 3-grams version using the same hyperparameters:

In [ ]:
clf = SS3(0.44, 0.48, 0.5, 0.0, name="movie-reviews")

clf.train(x_train_prep, y_train, n_grams=3, prep=False)
In [ ]:
y_pred = clf.predict(x_test_prep, prep=False)

accuracy = accuracy_score(y_pred, y_test)
print("Accuracy was:", accuracy)

The accuracy improved! it went from 0.828 to 0.853 :)