Classifying News Headlines and Explaining the Result

Data is from Kaggle's News Aggregator Dataset

In [1]:
import pandas as pd

I sampled 10% of the data to speed up the analysis.

In [2]:
news = pd.read_csv('data/uci-news-aggregator.csv').sample(frac=0.1)
In [3]:
len(news)
Out[3]:
42242
In [4]:
news.head(3)
Out[4]:
ID TITLE URL PUBLISHER CATEGORY STORY HOSTNAME TIMESTAMP
58434 58435 Russell Crowe Sings Johnny Cash on 'The Tonigh... http://screencrush.com/russell-crowe-johnny-cash/ ScreenCrush e dxzxHQTC1v6cP7MdjlKbJkMlfYwLM screencrush.com 1396019111324
244967 245413 HP cuts more jobs than expected http://www.digitaljournal.com/business/busines... DigitalJournal.com b de8PjvC03vbwIdMC0hkfXZTLVY0sM www.digitaljournal.com 1400928726875
314969 315429 NTSB faults pilots in last year's Asiana flight http://ktar.com/23/1744462/NTSB-faults-pilots-... KTAR.com b deigsQuEj4RZW3M_TqkzwLBT_oUTM ktar.com 1403705331596
In [5]:
from sklearn.preprocessing import LabelEncoder
In [6]:
encoder = LabelEncoder()
In [7]:
X = news['TITLE']
y = encoder.fit_transform(news['CATEGORY'])
In [8]:
from sklearn.model_selection import train_test_split
In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

We count the number of occurences of each word and use it as our features.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
In [13]:
vectorizer = CountVectorizer(min_df=3)
In [25]:
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

train_vectors
Out[25]:
<31681x9925 sparse matrix of type '<class 'numpy.int64'>'
	with 267231 stored elements in Compressed Sparse Row format>

We use a random forest for classification.

In [14]:
from sklearn.ensemble import RandomForestClassifier
In [15]:
rf = RandomForestClassifier(n_estimators=20)
rf.fit(train_vectors, y_train)
Out[15]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
In [16]:
from sklearn.metrics import accuracy_score
In [17]:
pred = rf.predict(test_vectors)
accuracy_score(y_test, pred, )
Out[17]:
0.85048764321560455

85% accuracy, not a bad score.

Explaining the result

We'll use lime to explain the model.

To use lime, we need to construct a pipeline that does the process of vectorizing and classfying together.

In [19]:
from sklearn.pipeline import make_pipeline
In [26]:
c = make_pipeline(vectorizer, rf)
In [27]:
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=list(encoder.classes_))

We take an example text from data.

In [28]:
example = X_test.sample(1).iloc[0]
example
Out[28]:
'Scientific Games to buy Bally Tech'
In [30]:
c.predict_proba([example])
Out[30]:
array([[ 0.95,  0.  ,  0.  ,  0.05]])
In [32]:
exp = explainer.explain_instance(example, c.predict_proba, top_labels=1)
/Users/libelo/anaconda/lib/python3.5/re.py:203: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
In [33]:
exp.show_in_notebook()

Above is the explanation of the classification generated by lime.

Reference


dreamgonfly@gmail.com