All the IPython Notebooks in Python Natural Language Processing lecture series by Dr. Milaan Parmar are available @ GitHub

10 TF-IDF (Term Frequency-Inverse Document Frequency )¶

The scoring method being used above takes the count of each word and represents the word in the vector by the number of counts of that particular word. What does a word having high word count signify?

Does this mean that the word is important in retrieving information about documents? The answer is NO. Let me explain, if a word occurs many times in a document but also along with many other documents in our dataset, maybe it is because this word is just a frequent word; not because it is relevant or meaningful.One approach is to rescale the frequency of words

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
The term frequency (TF) of a word in a document is number of times nth word occurred in a document by total number of words in a document.
TF(i,j)=n(i,j)/Σ n(i,j)

n(i,j )= number of times nth word occurred in a document

Σn(i,j) = total number of words in a document.

The inverse document frequency(IDF) of the word across a set of documents is logarithmic of total number of documents in the dataset by total number of documents in which nth word occur.
IDF=1+log(N/dN)

N=Total number of documents in the dataset

dN=total number of documents in which nth word occur

NOTE: The 1 added in the above formula is so that terms with zero IDF don’t get suppressed entirely.

if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

The TF-IDF is obtained by TF-IDF=TF*IDF

Limitations of Bag-of-Words¶

The model ignores the location information of the word. The location information is a piece of very important information in the text. For example “today is off” and “Is today off”, have the exact same vector representation in the BoW model.
Bag of word models doesn’t respect the semantics of the word. For example, words ‘soccer’ and ‘football’ are often used in the same context. However, the vectors corresponding to these words are quite different in the bag of words model. The problem becomes more serious while modeling sentences. Ex: “Buy used cars” and “Purchase old automobiles” are represented by totally different vectors in the Bag-of-words model.
The range of vocabulary is a big issue faced by the Bag-of-Words model. For example, if the model comes across a new word it has not seen yet, rather we say a rare, but informative word like Biblioklept(means one who steals books). The BoW model will probably end up ignoring this word as this word has not been seen by the model yet.

In [1]:

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The car is driven on the road.","The truck is driven on the highway"]

In [2]:

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

Out[2]:

CountVectorizer()

In [3]:

# summarize
print(vectorizer.vocabulary_)

{'the': 6, 'car': 0, 'is': 3, 'driven': 1, 'on': 4, 'road': 5, 'truck': 7, 'highway': 2}

In [4]:

# encode document
newvector = vectorizer.transform(text)

# summarize encoded vector
print(newvector.toarray())

[[1 1 0 1 1 1 2 0]
 [0 1 1 1 1 0 2 1]]

TF - IDF¶

In [5]:

from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The car is driven on the road.","The truck is driven on the highway"]

In [6]:

# create the transform
vectorizer = TfidfVectorizer()

In [7]:

# tokenize and build vocab
vectorizer.fit(text)

Out[7]:

TfidfVectorizer()

In [8]:

#Focus on IDF VALUES
print(vectorizer.idf_)

[1.40546511 1.         1.40546511 1.         1.         1.40546511
 1.         1.40546511]

In [9]:

# summarize
print(vectorizer.vocabulary_)

{'the': 6, 'car': 0, 'is': 3, 'driven': 1, 'on': 4, 'road': 5, 'truck': 7, 'highway': 2}

TF-IDF (KN)¶

In [10]:

# import nltk
# nltk.download('popular')

In [11]:

import nltk

paragraph =  """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               I see four milestones in my career"""
               
               
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    

In [12]:

# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()

In [13]:

Out[13]:

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.57735027,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.57735027,
        0.        , 0.57735027, 0.        , 0.        ],
       [0.        , 0.        , 0.31622777, 0.        , 0.31622777,
        0.31622777, 0.        , 0.        , 0.31622777, 0.        ,
        0.31622777, 0.31622777, 0.        , 0.31622777, 0.        ,
        0.        , 0.31622777, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.31622777, 0.31622777],
       [0.30151134, 0.30151134, 0.        , 0.30151134, 0.        ,
        0.        , 0.30151134, 0.30151134, 0.        , 0.        ,
        0.        , 0.        , 0.30151134, 0.        , 0.30151134,
        0.30151134, 0.        , 0.30151134, 0.30151134, 0.        ,
        0.30151134, 0.        , 0.        , 0.        ]])

In [ ]: