News Categorization using Multinomial Naive Bayes¶

The objective of this site is to show how to use Multinomial Naive Bayes method to classify news according to some predefined classes.

The News Aggregator Data Set comes from the UCI Machine Learning Repository.

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

This specific dataset can be found in the UCI ML Repository at this URL: http://archive.ics.uci.edu/ml/datasets/News+Aggregator

This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories in this dataset are labelled:

b: business;
t: science and technology;
e: entertainment; and
m: health.

Using Multinomial Naive Bayes method, we will try to predict the category (business, entertainment, etc.) of a news article given only its headline.

Let's begin importing the Pandas (Python Data Analysis Library) module. The import statement is the most common way to gain access to the code in another module.

In [3]:

import pandas as pd

This way we can refer to pandas by its alias 'pd'. Let's import news aggregator data via Pandas

In [4]:

news = pd.read_csv("uci-news-aggregator.csv")

Function head gives us the first 5 items in a column (or the first 5 rows in the DataFrame)

In [5]:

print(news.head())

   ID                                              TITLE  \
0   1  Fed official says weak data caused by weather,...   
1   2  Fed's Charles Plosser sees high bar for change...   
2   3  US open: Stocks fall after Fed official hints ...   
3   4  Fed risks falling 'behind the curve', Charles ...   
4   5  Fed's Plosser: Nasty Weather Has Curbed Job Gr...   

                                                 URL          PUBLISHER  \
0  http://www.latimes.com/business/money/la-fi-mo...  Los Angeles Times   
1  http://www.livemint.com/Politics/H2EvwJSK2VE6O...           Livemint   
2  http://www.ifamagazine.com/news/us-open-stocks...       IFA Magazine   
3  http://www.ifamagazine.com/news/fed-risks-fall...       IFA Magazine   
4  http://www.moneynews.com/Economy/federal-reser...          Moneynews   

  CATEGORY                          STORY             HOSTNAME      TIMESTAMP  
0        b  ddUyU0VZz0BRneMioxUPQVP6sIxvM      www.latimes.com  1394470370698  
1        b  ddUyU0VZz0BRneMioxUPQVP6sIxvM     www.livemint.com  1394470371207  
2        b  ddUyU0VZz0BRneMioxUPQVP6sIxvM  www.ifamagazine.com  1394470371550  
3        b  ddUyU0VZz0BRneMioxUPQVP6sIxvM  www.ifamagazine.com  1394470371793  
4        b  ddUyU0VZz0BRneMioxUPQVP6sIxvM    www.moneynews.com  1394470372027

We want to predict the category of a news article based only on its title. Class LabelEncoder allows to encode labels with values between 0 and n_classes-1.

In [6]:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
print(y[:5])

[0 0 0 0 0]

In [7]:

categories = news['CATEGORY']
titles = news['TITLE']
N = len(titles)
print('Number of news',N)

Number of news 422419

In [8]:

labels = list(set(categories))
print('possible categories',labels)

possible categories ['t', 'm', 'e', 'b']

In [9]:

for l in labels:
    print('number of ',l,' news',len(news.loc[news['CATEGORY'] == l]))

number of  t  news 108344
number of  m  news 45639
number of  e  news 152469
number of  b  news 115967

Categories are literal labels, but it is better for machine learning algorithms just to work with numbers, so we will encode them using LabelEncoder, which encode labels with value between 0 and n_classes-1.

In [10]:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
ncategories = encoder.fit_transform(categories)

Now we should split our data into two sets:

a training set (70%) used to discover potentially predictive relationships, and
a test set (30%) used to evaluate whether the discovered relationships hold and to assess the strength and utility of a predictive relationship.

Samples should be first shuffled and then split into a pair of train and test sets. Make sure you permute (shuffle) your training data before fitting the model.

In [11]:

Ntrain = int(N * 0.7)
from sklearn.utils import shuffle
titles, ncategories = shuffle(titles, ncategories, random_state=0)

In [12]:

X_train = titles[:Ntrain]
print('X_train.shape',X_train.shape)
y_train = ncategories[:Ntrain]
print('y_train.shape',y_train.shape)
X_test = titles[Ntrain:]
print('X_test.shape',X_test.shape)
y_test = ncategories[Ntrain:]
print('y_test.shape',y_test.shape)

X_train.shape (295693,)
y_train.shape (295693,)
X_test.shape (126726,)
y_test.shape (126726,)

In order to make the training process easier, scikit-learn provides a Pipeline class that behaves like a compound classifier. The first step should be to tokenize and count the number of occurrence of each word that appears into the news'titles. For that, we will use the CountVectorizer class. Then we will transform the counters to a tf-idf representation using TfidfTransformer class. The last step creates the Naive Bayes classifier

In [13]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [14]:

print('Training...')

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                     ])

Training...

Now we procede to fit the Naive Bayes classifier to the train set

In [15]:

text_clf = text_clf.fit(X_train, y_train)

Now we can procede to apply the classifier to the test set and calculate the predicted values

In [16]:

print('Predicting...')
predicted = text_clf.predict(X_test)

Predicting...

sklearn.metrics module includes score functions, performance metrics, and pairwise metrics and distance computations. accuracy_score: computes subset accuracy; used to compare set of predicted labels for a sample to the corresponding set of true labels

In [17]:

from sklearn import metrics

print('accuracy_score',metrics.accuracy_score(y_test,predicted))
print('Reporting...')

accuracy_score 0.92380411281
Reporting...

Let's build a text report showing the main classification metrics with the Precision/Recall/F1-score measures for each element in the test data.

In [18]:

print(metrics.classification_report(y_test, predicted, target_names=labels))

             precision    recall  f1-score   support

          t       0.90      0.91      0.90     34729
          m       0.95      0.97      0.96     45625
          e       0.97      0.85      0.90     13709
          b       0.90      0.90      0.90     32663

avg / total       0.92      0.92      0.92    126726

Have you heard about [cross-validation][1]? What about k-fold cross-validation? You can try it now just by repeating the previous steps (don't forget the shuffle part) and averaging the results. Let's try it! Have a nice day! [1]: https://en.wikipedia.org/wiki/Cross-validation_(statistics) "cross-validation"