Text Feature Extraction for Classification and Clustering¶

Text Classification in 15 lines of Python¶

Let's start by implementing a canonical text classification example:

The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums
Bag of Words features extraction with TF-IDF weighting
Naive Bayes classifier or Linear Support Vector Machine for the classifier itself

In [ ]:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

twenty_train = fetch_20newsgroups(subset='train')
twenty_test = fetch_20newsgroups(subset='test')

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(twenty_train.data)
y_train = twenty_train.target

classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

X_test = vectorizer.transform(twenty_test.data)
y_test = twenty_test.target
print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))

Let's now decompose what we just did to understand and customize each step.

Loading the Dataset¶

In [ ]:

twenty_train = fetch_20newsgroups(subset='train')
twenty_test = fetch_20newsgroups(subset='test')

In [ ]:

target_names = twenty_train.target_names
target_names

In [ ]:

twenty_train.target

In [ ]:

twenty_train.target.shape

In [ ]:

twenty_test.target.shape

In [ ]:

len(twenty_train.data)

In [ ]:

type(twenty_train.data[0])

In [ ]:

def display_sample(i):
    print("Class name: " + target_names[twenty_train.target[i]])
    print("Text content:\n")
    print(twenty_train.data[i])

In [ ]:

display_sample(0)

In [ ]:

display_sample(1)

Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset).

In [ ]:

def text_size(text, charset='iso-8859-1'):
    return len(text.encode(charset)) * 8 * 1e-6

train_size_mb = sum(text_size(text) for text in twenty_train.data) 
test_size_mb = sum(text_size(text) for text in twenty_test.data)

print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))

Extracting Text Features¶

In [ ]:

# TODO

Training a Classifier on Text Features¶

In [ ]:

# TODO

Exercise:

Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.
Vectorize the data again and measure the impact on performance of removing the header info from the datasets.
Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?

Hint: the TfidfVectorizer class can accept python functions to customize the preprocessor, tokenizer or analyzer stages of the vectorizer. Don't forget to use the IPython ? suffix operator on a any Python class or method to read the docstring or even the ?? operator to read the source code.

In [ ]:

Setting Up a Pipeline for Cross Validation¶

In [ ]:

# TODO
# Use a subset of the data to make this fast to compute

More Complex Text Features¶

In [ ]:

# TODO

Tips for Improving the Predictive Performance¶

In [ ]:

# TODO: Print the confusion matrix
# TODO: display the most important features for each linear model
# TODO: analyze mis-classifications by ranking by violations of the decision / probas thresholds
# TODO: semi supervised learning / active learning
# TODO: K-Means features from large unlabeled datasets

Clustering and Topics Extraction¶

In [ ]:

# TODO: quick presentation of minibatch kmeans + NMF + display important cluster terms + word clouds