The objective of this site is to show how to use Multinomial Naive Bayes method to classify news according to some predefined classes.
The News Aggregator Data Set comes from the UCI Machine Learning Repository.
This specific dataset can be found in the UCI ML Repository at this URL: http://archive.ics.uci.edu/ml/datasets/News+Aggregator
This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories in this dataset are labelled:
Using Multinomial Naive Bayes method, we will try to predict the category (business, entertainment, etc.) of a news article given only its headline.
Let's begin importing the Pandas (Python Data Analysis Library) module. The import statement is the most common way to gain access to the code in another module.
import pandas as pd
This way we can refer to pandas by its alias 'pd'. Let's import news aggregator data via Pandas
news = pd.read_csv("uci-news-aggregator.csv")
Function head gives us the first 5 items in a column (or the first 5 rows in the DataFrame)
ID TITLE \ 0 1 Fed official says weak data caused by weather,... 1 2 Fed's Charles Plosser sees high bar for change... 2 3 US open: Stocks fall after Fed official hints ... 3 4 Fed risks falling 'behind the curve', Charles ... 4 5 Fed's Plosser: Nasty Weather Has Curbed Job Gr... URL PUBLISHER \ 0 http://www.latimes.com/business/money/la-fi-mo... Los Angeles Times 1 http://www.livemint.com/Politics/H2EvwJSK2VE6O... Livemint 2 http://www.ifamagazine.com/news/us-open-stocks... IFA Magazine 3 http://www.ifamagazine.com/news/fed-risks-fall... IFA Magazine 4 http://www.moneynews.com/Economy/federal-reser... Moneynews CATEGORY STORY HOSTNAME TIMESTAMP 0 b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.latimes.com 1394470370698 1 b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.livemint.com 1394470371207 2 b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.ifamagazine.com 1394470371550 3 b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.ifamagazine.com 1394470371793 4 b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.moneynews.com 1394470372027
We want to predict the category of a news article based only on its title. Class LabelEncoder allows to encode labels with values between 0 and n_classes-1.
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() y = encoder.fit_transform(news['CATEGORY']) print(y[:5])
[0 0 0 0 0]
categories = news['CATEGORY'] titles = news['TITLE'] N = len(titles) print('Number of news',N)
Number of news 422419
labels = list(set(categories)) print('possible categories',labels)
possible categories ['t', 'm', 'e', 'b']
for l in labels: print('number of ',l,' news',len(news.loc[news['CATEGORY'] == l]))
number of t news 108344 number of m news 45639 number of e news 152469 number of b news 115967
Categories are literal labels, but it is better for machine learning algorithms just to work with numbers, so we will encode them using LabelEncoder, which encode labels with value between 0 and n_classes-1.
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() ncategories = encoder.fit_transform(categories)
Now we should split our data into two sets:
Samples should be first shuffled and then split into a pair of train and test sets. Make sure you permute (shuffle) your training data before fitting the model.
Ntrain = int(N * 0.7) from sklearn.utils import shuffle titles, ncategories = shuffle(titles, ncategories, random_state=0)
X_train = titles[:Ntrain] print('X_train.shape',X_train.shape) y_train = ncategories[:Ntrain] print('y_train.shape',y_train.shape) X_test = titles[Ntrain:] print('X_test.shape',X_test.shape) y_test = ncategories[Ntrain:] print('y_test.shape',y_test.shape)
X_train.shape (295693,) y_train.shape (295693,) X_test.shape (126726,) y_test.shape (126726,)
In order to make the training process easier, scikit-learn provides a Pipeline class that behaves like a compound classifier. The first step should be to tokenize and count the number of occurrence of each word that appears into the news'titles. For that, we will use the CountVectorizer class. Then we will transform the counters to a tf-idf representation using TfidfTransformer class. The last step creates the Naive Bayes classifier
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline
print('Training...') text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ])
Now we procede to fit the Naive Bayes classifier to the train set
text_clf = text_clf.fit(X_train, y_train)
Now we can procede to apply the classifier to the test set and calculate the predicted values
print('Predicting...') predicted = text_clf.predict(X_test)
sklearn.metrics module includes score functions, performance metrics, and pairwise metrics and distance computations. accuracy_score: computes subset accuracy; used to compare set of predicted labels for a sample to the corresponding set of true labels
from sklearn import metrics print('accuracy_score',metrics.accuracy_score(y_test,predicted)) print('Reporting...')
accuracy_score 0.92380411281 Reporting...
Let's build a text report showing the main classification metrics with the Precision/Recall/F1-score measures for each element in the test data.
print(metrics.classification_report(y_test, predicted, target_names=labels))
precision recall f1-score support t 0.90 0.91 0.90 34729 m 0.95 0.97 0.96 45625 e 0.97 0.85 0.90 13709 b 0.90 0.90 0.90 32663 avg / total 0.92 0.92 0.92 126726
Have you heard about cross-validation? What about k-fold cross-validation? You can try it now just by repeating the previous steps (don't forget the shuffle part) and averaging the results. Let's try it! Have a nice day!