Copyright (c) 2015, Taposh Dutta Roy All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Taposh Dutta Roy nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL TAPOSH ROY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011)
Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semantic term–document information as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset of movie reviews to serve as a more robust benchmark for work in this area
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.
Many different approaches:
Porter stemmer utilizes suffix stripping it does not address prefixes
Porter's Words 2001
"There are two main reasons for creating Snowball. One is the lack of readily available stemming algorithms for languages other than English. The other is the consciousness of a certain failure on my part in promoting exact implementations of the stemming algorithm described in (Porter 1980), which has come to be called the Porter stemming algorithm."
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.
In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word.
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.
labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one.
unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review.
sampleSubmission - A comma-delimited sample submission file in the correct format.
##############################################################################
# Taposh Dutta Roy
# Sentiment Analysis
##############################################################################
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn import cross_validation
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
## Stemming functionality
class stemmerUtility(object):
"""Stemming functionality"""
@staticmethod
def stemPorter(review_text):
porter = PorterStemmer()
preprocessed_docs = []
for doc in review_text:
final_doc = []
for word in doc:
final_doc.append(porter.stem(word))
#final_doc.append(wordnet.lemmatize(word)) #note that lemmatize() can also takes part of speech as an argument!
preprocessed_docs.append(final_doc)
return preprocessed_docs
## Originally provided by Google
## Modified by Taposh
class KaggleWord2VecUtility(object):
"""KaggleWord2VecUtility is a utility class for processing raw HTML text into segments for further learning"""
@staticmethod
def review_to_wordlist( review, remove_stopwords=False ):
# 1. Remove HTML
review_text = BeautifulSoup(review).get_text()
# 2. Remove non-letters
review_text = re.sub("[^a-zA-Z]"," ", review_text)
# 2.1 Remove single letters
review_text = re.sub('/(?<!\S).(?!\S)\s*/', '', review_text);
# 3. Convert words to lower case and split them
words = review_text.lower().split()
newwords=[]
for word in words:
if len(word)>2:
newwords.append(word)
# 4. Optionally remove stop words (false by default)
if remove_stopwords:
stops = set(stopwords.words("english"))
newwords = [w for w in newwords if not w in stops]
#
# 5. Return a list of words
return(newwords)
# Define a function to split a review into parsed sentences
@staticmethod
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
# Function to split a review into parsed sentences. Returns a
# list of sentences, where each sentence is a list of words
#
# 1. Use the NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.decode('utf8').strip())
#
# 2. Loop over each sentence
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( KaggleWord2VecUtility.review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
train = pd.read_csv("/Users/taposh/workspace/mlearning/nlp/sentiment/bow/labeledTrainData.tsv", header=0,delimiter="\t", quoting=3)
test = pd.read_csv("/Users/taposh/workspace/mlearning/nlp/sentiment/bow/testData.tsv", header=0, delimiter="\t",quoting=3)
#train = pd.read_csv("/Users/taposh/workspace/kaggle/bow/labeledTrainData.tsv", header=0, \
# delimiter="\t", quoting=3)
#test = pd.read_csv("/Users/taposh/workspace/kaggle/bow/testData.tsv", header=0, delimiter="\t", \
#quoting=3 )
y = train["sentiment"]
print("Cleaning and parsing movie reviews...\n")
traindata = []
for i in range( 0, len(train["review"])):
#for i in range(0,10):
traindata.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i], False)))
testdata = []
####
for i in range(0,len(test["review"])):
testdata.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], False)))
#print(testdata)
print ('vectorizing... ',)
tfv = TfidfVectorizer(min_df=2, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 2), use_idf=2,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')
X_all = traindata + testdata
lentrain = len(traindata)
print ("fitting pipeline... ",)
tfv.fit(X_all)
X_all = tfv.transform(X_all)
# RF transform 1st column to numbers
#X_all[:,0] = LabelEncoder().fit_transform(X_all[:,0])
#for Logit
X = X_all[:lentrain]
X_test = X_all[lentrain:]
#model = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None
model = LogisticRegression(penalty='l2', dual=True, tol=0.0001,C=14, fit_intercept=True, intercept_scaling=1,class_weight=None, random_state=None)
#http://nbviewer.ipython.org/gist/rjweiss/7577004
#model = RandomForestRegressor(n_estimators=150, min_samples_split=1)
#model.fit(X, y)
#print X
#print regressor.predict(X)
print("25 Fold CV Score: ", np.mean(cross_validation.cross_val_score(model, X, y, cv=36, scoring='roc_auc')))
print("Retrain on all training data, predicting test labels...\n")
model.fit(X,y)
result = model.predict_proba(X_test)[:,1]
#result = model.predict(X_test)
print((result))
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
import csv
# Use pandas to write the comma-separated output file
output.to_csv('/Users/taposh/workspace/mlearning/nlp/sentiment/bow/Bag_of_Words_model_v17.csv',quoting=3, escapechar=",",index=False,encoding='utf-8')
#output.to_csv("/Users/taposhdr/workspace/decision_science/kaggle/bow/data/Bag_of_Words_model-1.csv", index=False, quoting=csv.QUOTE_NONE)
print("Wrote results to csv file")
Cleaning and parsing movie reviews... vectorizing... fitting pipeline... 25 Fold CV Score: 0.964641955994 Retrain on all training data, predicting test labels... [ 0.98821375 0.02077675 0.5737732 ..., 0.38864424 0.96238561 0.690408 ] Wrote results to csv file
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.
Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. Source: http://www.tfidf.com/