e-Mail: lukasz.augustyniak@pwr.edu.pl
Twitter: @luk_augustyniak
LinkedIn: Łukasz Augustyniak
GitHub: laugustyniak
Ipython Notebook view: SAS2015 Notebook
The IPython Notebook - is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. Just like you see it now :)
sas2015 = 'Welcome at Sentiment Symposium'
print sas2015
Welcome at Sentiment Symposium
sas2015 + ' 2015'
'Welcome at Sentiment Symposium 2015'
Python interpreter with pre-installed libraries - Anaconda - is a completely free Python distribution (including for commercial use and redistribution). It includes over 195 of the most popular Python packages for science, math, engineering, data analysis.
scikit-learn - Machine Learning in Python
pandas - Python Data Analysis Library
import pandas as pd
Whole preprocessing and model creation is possible with scikit-learn library.
from os import path
notebook_path = 'C:/Users/Dell/Documents/GitHub/Presentations/sas2015/'
approximately 6 000 of tweets with annotation negative/neutral/positive
Tabular data structure with labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. The primary pandas data structure
data = pd.read_csv(path.join(notebook_path, 'data', 'SemEval-2014.csv'), index_col=0)
data
sentiment | document | |
---|---|---|
0 | 3 | Gas by my house hit $3.39!!!! I'm going to Cha... |
1 | 1 | Theo Walcott is still shit, watch Rafa and Joh... |
2 | 1 | its not that I'm a GSP fan, i just hate Nick D... |
3 | 1 | Iranian general says Israel's Iron Dome can't ... |
4 | 3 | with J Davlar 11th. Main rivals are team Polan... |
5 | 1 | Talking about ACT's && SAT's, deciding where I... |
6 | 2 | Why is "Happy Valentines Day" trending? It's o... |
7 | 1 | They may have a SuperBowl in Dallas, but Dalla... |
8 | 2 | Im bringing the monster load of candy tomorrow... |
9 | 2 | Apple software, retail chiefs out in overhaul:... |
10 | 2 | #Livewire Nadal confirmed for Mexican Open in ... |
11 | 2 | #Iran US delisting MKO from global terrorists ... |
12 | 2 | Expect light-moderate rains over E. Visayas; C... |
13 | 3 | One ticket left for the @49ers game tomorrow! ... |
14 | 2 | Game 1 of the NLCS and a rematch of the NFC Ch... |
15 | 3 | Never start working on your dreams and goals t... |
16 | 2 | BLACK FRIDAY Huge Saving Aerial View of a City... |
17 | 3 | YES we all know INDIO vs CV is tomorrow the BE... |
18 | 2 | Mohamed Morsi, Egypt's Muslim Brotherhood pres... |
19 | 2 | C'mon Avila! You just got tagged out by a guy ... |
20 | 2 | At the first Grammy Awards, held on 4 May 1959... |
21 | 3 | Good morning Thursday. "Life is fragile. We're... |
22 | 3 | #Twitition Mcfly come back to Argentina but th... |
23 | 1 | My teachers call themselves givng us candy....... |
24 | 2 | #Broncos Peyton Manning named AFC Offensive Pl... |
25 | 2 | @TooZany is bringing out Kendrick Lamar the 6t... |
26 | 2 | Andre's Wigan Warning - #COYS Official Site Wi... |
27 | 2 | When my professor passes out candy and says "a... |
28 | 2 | How are they going to act in new york with the... |
29 | 1 | Homegrown talent missing on Signing Day: Throu... |
... | ... | ... |
6235 | 3 | Yay !!!!RT @kellymonaco1: Excited to interview... |
6236 | 3 | @TomFelton Hope you win tonight Tom! Your US a... |
6237 | 3 | If I'm reading the Twitter Trend list correctl... |
6238 | 3 | Colts game tonight! Yay! |
6239 | 2 | Is this on tv again? "@kugrlover: Most of Reds... |
6240 | 3 | What will have a better TV rating...#Cardinals... |
6241 | 3 | Yes. I'm ready for HS, college & pro. Bring it... |
6242 | 3 | Trying to leave, I'm only 10 minutes late (so ... |
6243 | 1 | I fail to see why the Rams are playing TWICE o... |
6244 | 2 | On the night Hank Williams came to town. |
6245 | 3 | There won't be just a Party in the USA tonight... |
6246 | 3 | After that, I'll start plugging mine and @joes... |
6247 | 3 | Man was that Jets and Cowboys game awesome or ... |
6248 | 3 | MNF tonight! Let's go Sexy Rexy! |
6249 | 2 | Just checking in to see if I had a nightmare l... |
6250 | 1 | Lmao RT @HeatherNoel13: Curtis painter looks l... |
6251 | 2 | Monday Night Football #TeamTexans all day & to... |
6252 | 2 | @PierreGarcon85 come out and watch THOSE GUYS ... |
6253 | 3 | Huge thanks to those of you who came out to my... |
6254 | 1 | #Londonriots is trending 3rd worldwide ..... T... |
6255 | 3 | I had a fun day on terra nova. Followed by a h... |
6256 | 1 | Today, we found out that Rob Henry tore his AC... |
6257 | 3 | Monday Night Football - Gary Neville did well ... |
6258 | 3 | Happy birthday, Hank Williams. In honor if the... |
6259 | 3 | New cast of DWTS tba at 8pm tonight!! So excit... |
6260 | 1 | @stoney16 @JeffMossDSR I'd recommend just turn... |
6261 | 3 | RT @MNFootNg It's monday and Monday Night Foot... |
6262 | 3 | All I know is the road for that Lomardi start ... |
6263 | 2 | All Blue and White fam, we r meeting at Golden... |
6264 | 1 | I'm pisseeedddd that I missed Kid Cudi's show ... |
6265 rows × 2 columns
%matplotlib inline
data.sentiment.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x17d9f278>
docs = data['document']
y = data['sentiment'] # standart name for labels/classes variable
docs[0]
"Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :)"
y[0]
3
'I like new Note IV.' -> [0, 1, 1, 1, 1, 0, 0]
'I was dissapointed by new Samsung phone.' -> [1, 0, 0, 1, 0, 1, 1]
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english')
X = count_vect.fit_transform(docs)
print '#features=%s for #documents=%s' % (X.shape[1], X.shape[0])
#features=74011 for #documents=6265
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
def sentiment_classification(X, y, n_folds=10, classifier=None):
"""
Counting sentiment with cross validation - supervised method
:type X: ndarray feature matrix for classification
:type y: list or ndarray of classes
:type n_folds: int # of folds for CV
:type classifier: classifier which we train and predict sentiment
:return: measures: accuracy, precision, recall, f1
"""
results = {'acc': [], 'prec': [], 'rec': [], 'f1': [], 'cm': []}
kf = cross_validation.StratifiedKFold(y, n_folds=n_folds, shuffle=True)
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
######################## Most important part ##########################
clf = classifier.fit(X_train, y_train) # train the classifier
predicted = clf.predict(X_test) # predict test the classifier
#######################################################################
results['acc'].append(metrics.accuracy_score(y_test, predicted))
results['prec'].append(metrics.precision_score(y_test, predicted, average='weighted'))
results['rec'].append(metrics.recall_score(y_test, predicted, average='weighted'))
results['f1'].append(metrics.f1_score(y_test, predicted, average='weighted'))
results['cm'].append(metrics.confusion_matrix(y_test, predicted))
return results
results = sentiment_classification(X, y, n_folds=4, classifier=LogisticRegression())
import numpy as np
print 'Accuracy: %s' % np.mean(results['acc'])
print 'F1-measure: %s' % np.mean(results['f1'])
Accuracy: 0.67119115848 F1-measure: 0.649526332256
In the computer programming language Python, pickle is the standard mechanism for object serialization; pickling is the common term among Python programmers for serialization (unpickling for deserializing).
classifier = LogisticRegression()
clf = classifier.fit(X, y) # trained
clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0)
from sklearn.externals import joblib
fn_clf = 'sentiment-classifier.pkl'
joblib.dump(clf, fn_clf)
['sentiment-classifier.pkl', 'sentiment-classifier.pkl_01.npy', 'sentiment-classifier.pkl_02.npy', 'sentiment-classifier.pkl_03.npy']
clf_loaded = joblib.load(fn_clf)
print 'predictions => %s' % clf_loaded.predict(X)
print 'classifier: %s' % clf_loaded
predictions => [3 1 1 ..., 3 2 1] classifier: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0)
# load data
data = pd.read_csv('C:/Users/Dell/Documents/GitHub/Presentations/sas2015/data/SemEval-2014.csv', index_col=0)
count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english')
X = count_vect.fit_transform(data.document)
results = sentiment_classification(X, y, n_folds=4, classifier=LogisticRegression())
joblib.dump(clf, 'sentiment-classifier.pkl')
# save classifier
['sentiment-classifier.pkl', 'sentiment-classifier.pkl_01.npy', 'sentiment-classifier.pkl_02.npy', 'sentiment-classifier.pkl_03.npy']
print 'Accuracy: %s' % np.mean(results['acc'])
print 'F1-measure: %s' % np.mean(results['f1'])
Accuracy: 0.667995763517 F1-measure: 0.645418740216
When building the vocabulary ignore terms that have a document frequency (TF) strictly lower than the given threshold. This value is also called cut-off in the literature.
If float, the parameter represents a proportion of documents, integer absolute counts.
min_df=2
count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english', min_df=min_df)
X = count_vect.fit_transform(docs)
print '#features=%s for #documents=%s' % (X.shape[1], X.shape[0])
#features=11871 for #documents=6265
Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
max_features=1000
count_vect = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english', max_features=max_features)
X = count_vect.fit_transform(docs)
print '#features=%s for #documents=%s' % (X.shape[1], X.shape[0])
#features=1000 for #documents=6265
min_words = [1, 2, 5, 10, 100, 1000]
features_counts = []
for m in min_words:
docs_fitted = CountVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english', min_df=m).fit_transform(docs)
print '#features=%s for #documents=%s (min_df=%s)' % (docs_fitted.shape[1], docs_fitted.shape[0], m)
features_counts.append((m, docs_fitted.shape[1]))
#features=74011 for #documents=6265 (min_df=1) #features=11871 for #documents=6265 (min_df=2) #features=3735 for #documents=6265 (min_df=5) #features=1630 for #documents=6265 (min_df=10) #features=69 for #documents=6265 (min_df=100) #features=2 for #documents=6265 (min_df=1000)
import matplotlib.pyplot as plt
plt.bar(range(len(features_counts)), [x[1] for x in features_counts], align='center')
plt.xticks(range(len(features_counts)), [x[0] for x in features_counts])
plt.xlabel('min_df')
plt.ylabel('#features')
plt.show()
X
<6265x1000 sparse matrix of type '<type 'numpy.int64'>' with 44424 stored elements in Compressed Sparse Row format>
%timeit sentiment_classification(X, y, n_folds=4, classifier=LogisticRegression())
1 loops, best of 3: 904 ms per loop
X_array = X.toarray()
%timeit sentiment_classification(X_array, y, n_folds=4, classifier=LogisticRegression())
1 loops, best of 3: 904 ms per loop