Week 4, Assignment 1, SVM, logistic regression and some sentiments

Given date: Monday Sept. 30

Due date: Monday October 21


image credit: Analytics Vidhya

Now that we have accumulated knowledge on a couple of machine learning algorithms, it is time to consider some interesting application. The Assignment is on 25pts + 2 bonus questions that are on 4 pts each and which comes on top of your total grade, meaning you can score up to 33/25.

1. (6pts) SVM and outliers

Exercise 1a (3pts). Linear SVM

Consider the data given in the files 'ex1_class1.mat' and 'ex1_class2.mat'.

  • Load the data using the lines below.
  • Plot the points from each class, using a different color for each class
  • Then learn a linear classifier for that dataset and plot the line corresponding to the separating plane on top of the data set (you can use meshgrid but I also want you to plot the line corresponding to the boundary, there are several approaches and you can check the Scikit documentation if you don't know which to use).
In [ ]:
import scipy.io as sio
data1 = sio.loadmat('ex1_class1.mat')
data2 = sio.loadmat('ex1_class2.mat')

data1 = data1['ex1_class1']
data2 = data2['ex1_class1']

# put your code here 

Exercise 1b (3pts). Outliers

For this exercise, we will use the data stored in 'ex1b_class1.mat' and 'ex1b_class2.mat'. Load those those datasets with the lines below.

  • Plot the data using 'scatter( )'.
  • Then use the LinearSVC classifier from scikit learn with various values of the parameter 'C'. Try a few values between .1 and 100 (including 1) and plot the boundary between classes for each value. What do you see? In particular, study the evolution of the orientation/position of the separating plane with respect to the outlier. What conclusion can you draw regarding the effect of the parameter $C$ ? What is the role of $C$?
In [ ]:
import scipy.io as sio
data1 = sio.loadmat('ex2_class1.mat')
data2 = sio.loadmat('ex2_class2.mat')

data1 = data1['ex2_class1']
data2 = data2['ex2_class2']

from sklearn.svm import LinearSVC

# put your code here 

2. (6pts) Comparing kernels

  • The data for this exercise is stored in the variables 'ex2_class1.mat' and 'ex2_class2.mat'. Start by loading the data, using the lines below.
  • Plot the data using the function scatter( ) and a different color for each class.
  • We will consider several kernels. Start with a linear SVC. Plot the classification boundary on top of the dataset.

  • The svm class and the SVC classifier from scikit-learn enable you to define your own kernel matrix. Using this feature, define the following three kernel matrices

    • Simple scalar product on polynomial features $\kappa(\boldsymbol x_i, \boldsymbol x_j) = \langle \phi(\boldsymbol x_i), \phi(\boldsymbol x_j)\rangle$ for features up to degrees $d=2$, $d = 4$ and $d=10$. Note that this means that for any data point $(x_1, x_2)$ from your dataset, you must generate the monomials $\phi(\boldsymbol x) = (1, x_1, x_2, x_1^2, x_2^2, x_1x_2)$ (i.e don't forget the cross products $x_i*x_j$). You might want to use the function 'PolynomialFeatures' to get that done.

    • Gaussian kernel $\kappa(\boldsymbol x_i, \boldsymbol x_j) = exp(-\frac{\|\boldsymbol x_ i - \boldsymbol x_j \|^2}{2\sigma^2})$. Note that you will have to tune the $\sigma$ in order to fit the data finely enough to reveal the shape of each cluster.

    • Polynomial kernel $\kappa(\boldsymbol x_i, \boldsymbol x_j) = (\boldsymbol x_i^T\boldsymbol x_j + 1)^d$ for degrees $d = 2$, $d=4$ and $d=6$.

In each of the frameworks mentioned above, plot the resulting decision boundary.

In [ ]:
import scipy.io as sio

data1 = data1['ex1b_class1']
data2 = data2['ex1b_class2']

# put your code here

3. (5pts) Cross validation

Now that you are a little more familiar with the meaning of the parameter $C$, we will go back to cross validation and try to implement it properly on our kernel SVC.The data for this exercise is stored in the variables 'ex3_class1.mat' and 'ex3_class2.math'.

  • Load the variables by using the lines below.
  • Then learn a support vector classifier with RBF kernels for various values of $\gamma$ and $C$. As an example, you could take $C = 0.1, 1, 10$ and then $\gamma = 0.1, 1, 10$. Keep in mind that to do proper cross validation, you need to try out every possible pair $(C, \gamma)$ so you may want to try a few more values but don't be too greedy.

For each value $(C, \gamma)$, proceed as follows (K-fold cross validation, see section 7.10.1 in Hastie Tibshirani Friedman if you want more info)

  • Split the training set into K equal size (i.e N/K) batches.
  • Train the kernel SVC with every batch except the $K^{th}$ and evaluate the obtained model on the $K^{th}$ batch that you left aside.
  • Repeat the line above, for each and every batch.
  • Compute the error as $$ E_{\text{average}}(C, \gamma) = \frac{1}{(N/K)}\sum_{i=1}^{N/K} E_k$$ where $E_k$ is the error that you obtained when taking the batch $K$ as your validation batch.

  • Plot the surface $E(C, \gamma)$.

In [ ]:
import scipy.io as sio

data1 = data1['ex3_class1']
data2 = data2['ex3_class2']

# put your code here 

4. (8pts) Some Natural Language Processing


image credit: hackerearth.com

Exercise 4a

In this exercise, you will get to use a support vector classifier to discriminate between positive and negative movie reviews.

More specifically, you will get to first code your own term frequency–inverse document frequency (TF-IDF) statistic from the document.

The data for this exercise is stored in the file 'trainingSimpler.txt'. Start by loading this corpus.

We will need to do some pre-processing. In particular we will use functions from the natural language toolkit (NLTK).

  1. We now need a way to encode the sentences in order to be able to compare them to one another. To do this, we will split each sentence into a list of words. This can be done using the class 'TweetTokenizer' from NLTK. An instance of that class is initialized below. Once you have read a line (document) from the corpus, you can split that document into words by using a call to the function tokenize from TweetTokenizer.

  2. As a first step, you should just read each sentence from the file (corpus) and store the target/sentiment (1 vs 0) which appears at the end of each line in a separate numpy vector (called 'targets' below). Save the vector as a numpy array of integers. Keep all the other words and store them in a temporary variable

  3. As a second step, we will remove from each sentence the so-called "stop words" as well as the punctuation which do not carry any valuable information regarding the sentiments. To prune the sentence we will use two lists. The first one from the corpus module of nltk and the second one from the second one (encoding the punctuation) from the string module. To prune the lines (or documents), remove all the words that appear in the lists 'stopwords.words('english')' or 'string.punctuation'.

  4. Two dictionnaries are initialized below. We will use the first one, 'wordfreq' to keep track of the words and their number of occurence in the whole corpus. This first dictionnary should thus be of the form wordfreq[word] = num_occurence. In the loop, you should therefore add a line that for each word in the current list of words extracted from the current line in the file, either adds that word to the dictionnary with 1 as its occurence, or add 1 to the previous number of occurence for that word if the word already belongs to the dictionnary.

  5. We will also use the loop to build a second dictionnary, called 'corpus' which is simply a list of the pruned sentences. This dictionnary should be as follows corpus[k] = 'pruned list of words' corresponding to $k^{th}$ sentence in the file.

N.B While you are reading the file, line by line, also keep track of the number of lines and store that number in a variable num_lines or num_sentences. We will use this variable later.

In [26]:
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import nltk 
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
tknzr = TweetTokenizer()

from nltk.corpus import stopwords
import string

f = open('trainingSimpler.txt')
wordfreq = {}
corpus = {}
ll = 0
targets = []

for line in f:
    line = line.rstrip('\n')
    # put your code here
num_sentences = ll 
In [14]:

Exercise 4b As a second step, we will use the 'wordfreq' to retain as features, the 200 most frequent words from the total corpus.

One approach is to use the functions .values( ) and .keys( ) to respectively access the values and the words from 'wordfreq'. Once you have the list of values, you can just use 'argsort' to return indices of the sorted values. [Be careful that arsort sorts in ascending order]

Then build a new list of words by taking the words wordfreq.keys( )[Indice1] to wordfreq.keys( )[Indice200].

You should finish this step with a list of the 200 most frequent words from the corpus

In [27]:
word_freq_values = wordfreq.values()

# put your code here

most_freq = # complete here

Execise 4c. We will now define both the TF (term frequency) and IDF (inverse document frequency) values.

  • The Inverse document frequency (IDF) value is a measure of how much value any given word provides. The IDF index is computed as
\begin{align} \log\left(\frac{\text{Total number of docs}}{\text{num. of docs in which word appears}}\right) \end{align}
  • The term frequency (TF) is usually simply given by the number of times a particular word appears in a document.

The idea of TF-IDF is then to weight the term frequency by the Inverse document frequency. This gives a set of feature vectors that can be used to classify the documents.

  • The TF-IDF statistic is then computed as the product of the TF and IDF values.

a). TF index To compute the TF index, we will again use a dictionnary, called 'word_tf_values' below. We will loop over all of the 200 most frequent words, then for each of the most frequent words, we will loop over all the (pruned) sentences from your variable 'corpus' and for each pruned sentence, we will loop over all its words and simply count the number of times the current 'most frequent word' appears in the sentence. For each sentence and each most frequent word, store the value in a variable 'doc_freq'.

  • Once you have read the whole line (or document), store the value 'doc_freq/len(corpus[line])' (that is the relative frequency of occurence of the current most_frequent word in the sentence) in a variable word_tf

  • Repeat this for each sentence and store the variables word_tf in a vector of length 'number of lines'.

  • After you have looped over all the sentences for the current 'most frequent word', store the vector encoding all the word_tf indices for each sentence in the dictionnary at the entry corresponding to the most frequent word. I.e. word_tf_values[most frequent word] = tf_vector

You should end up with a dictionnary of size '200 x number of documents/sentences'

In [28]:
word_tf_values = {}

kk = 0
for token in most_freq:
    sent_tf_vector = []
    ll = 0
    for line in range(0,num_sentences):
        doc_freq = 0
        for word in corpus[line]:
            # put your code here

b) IDF value For the IDF values, the principle is even simpler. We only need to compute the ratio between the total number of documents in the corpus and the number of documents in which the current 'most frequent' word appears.

Again, we will store the result in a dictionnary 'word_idf_values' whose keys are the most frequent words and whose values will be the IDF index.

Run over all the most frequent words. For each of the 200 most frequent words, run over all the lines from the corpus. Then if the most frequent word appears in the line, increment a counting variable (for example 'doc_containing_word') by 1 (you can just use the condition 'if token in corpus[line]:' for line that runs from 0 to the total number of lines in the file)

Once you have run over all the lines for the current most frequent word, add the entry in the dictionnary as

word_idf_values[token] = float(np.log(len(corpus[line]))/float((1 + doc_containing_word)))

In [29]:
word_idf_values = {}
for token in most_freq:
    # put your code here

c)TF-IDF Finally we compute the TF-IDF feature vectors for each sentence. The TF-IDF representation of each sentence can be viewed as a feature vector whose entries give us an indication of the weight of any of the most frequent word in that sentence as well as the importance of each word in terms of the meaning of the sentence.

To build the TF-IDF representation of each sentence, we will use two variables :

'tfidf_values' and 'tfidf_sentences'

Loop over the list of most frequent words. For each most frequent word, loop over the list of documents (or sentences from the txt file). Store the product of the tf index by the idf index for the current word and sentence in a variable (e.g. 'tf_idf_score'). Append those values to teh variable 'tfidf_sentences' through the loop on sentences, thus obtaining a vector of size 'num_sentences'.

Then append those vectors to the variable 'tfidf_values', through the loop on the most frequent words, thus obtaining a '200' by 'num_setences' array.

You should end up with a 2D numpy array of size 'num_sentences' by 'num_most_frequent_words' that encodes each sentences as a 1 by 200 feature vector.

In [30]:
tfidf_values = []
for token in most_freq:
    tfidf_sentences = []
    for tf_sentence in range(0,num_sentences):
        tf_idf_score = word_tf_values[token][tf_sentence] * word_idf_values[token]

d) Support vector classifier Now that we have our feature representation, we can train our classifier. Initialize a SVC classifier from scikit learn. Start by taking a RBF kernel with C=1000, gamma= 1. Then train it on your data. Make sure to transpose your TF-IDF array for it to be 'num_samples' x 'num_features'. Then test if on your validation set.

In [ ]:
from sklearn.svm import SVC

Bonus I (4pts): Improving your classifier. The 'trainingSimpler.txt' corpus on which you applied you NLP classifier so far can be considered as relatively simple as the sentences are relatively different and always contain clear positive or negative statements. It is however not necessarily the case. An example of a more advanced dataset is given by the pair ('trainingMICH.txt';'testdataShortMICH.txt') which encode a variety of Sentiment Classification sentences from the University of Michigan.

Try to apply your TF-IDF statistic on this more difficult dataset (note that we use a shorter version 'testdataShortMICH.txt' than the one that is provided on Kaggle). How does it work?

What could you do to improve the efficiency of your classifier ? (hint: think of the redundant words that do not really bring any sentimental value to the sentences)

Bonus II (4pts): Train a logistic regression classifier on your TF-IDF vectors and apply it to the validation set.