Let's start by implementing a canonical text classification example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
twenty_train = fetch_20newsgroups(subset='train')
twenty_test = fetch_20newsgroups(subset='test')
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(twenty_train.data)
y_train = twenty_train.target
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
X_test = vectorizer.transform(twenty_test.data)
y_test = twenty_test.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
Let's now decompose what we just did to understand and customize each step.
twenty_train = fetch_20newsgroups(subset='train')
twenty_test = fetch_20newsgroups(subset='test')
target_names = twenty_train.target_names
target_names
twenty_train.target
twenty_train.target.shape
twenty_test.target.shape
len(twenty_train.data)
type(twenty_train.data[0])
def display_sample(i):
print("Class name: " + target_names[twenty_train.target[i]])
print("Text content:\n")
print(twenty_train.data[i])
display_sample(0)
display_sample(1)
Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset).
def text_size(text, charset='iso-8859-1'):
return len(text.encode(charset)) * 8 * 1e-6
train_size_mb = sum(text_size(text) for text in twenty_train.data)
test_size_mb = sum(text_size(text) for text in twenty_test.data)
print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))
# TODO
# TODO
Exercise:
Hint: the TfidfVectorizer
class can accept python functions to customize the preprocessor
, tokenizer
or analyzer
stages of the vectorizer. Don't forget to use the IPython ?
suffix operator on a any Python class or method to read the docstring or even the ??
operator to read the source code.
# TODO
# Use a subset of the data to make this fast to compute
# TODO
# TODO: Print the confusion matrix
# TODO: display the most important features for each linear model
# TODO: analyze mis-classifications by ranking by violations of the decision / probas thresholds
# TODO: semi supervised learning / active learning
# TODO: K-Means features from large unlabeled datasets
# TODO: quick presentation of minibatch kmeans + NMF + display important cluster terms + word clouds