All the IPython Notebooks in Python Natural Language Processing lecture series by Dr. Milaan Parmar are available @ GitHub
The scoring method being used above takes the count of each word and represents the word in the vector by the number of counts of that particular word. What does a word having high word count signify?
Does this mean that the word is important in retrieving information about documents? The answer is NO. Let me explain, if a word occurs many times in a document but also along with many other documents in our dataset, maybe it is because this word is just a frequent word; not because it is relevant or meaningful.One approach is to rescale the frequency of words
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
The term frequency (TF) of a word in a document is number of times nth word occurred in a document by total number of words in a document.
TF(i,j)=n(i,j)/Σ n(i,j)
n(i,j )= number of times nth word occurred in a document
Σn(i,j) = total number of words in a document.
The inverse document frequency(IDF) of the word across a set of documents is logarithmic of total number of documents in the dataset by total number of documents in which nth word occur.
IDF=1+log(N/dN)
N=Total number of documents in the dataset
dN=total number of documents in which nth word occur
NOTE: The 1 added in the above formula is so that terms with zero IDF don’t get suppressed entirely.
The TF-IDF is obtained by TF-IDF=TF*IDF
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The car is driven on the road.","The truck is driven on the highway"]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
CountVectorizer()
# summarize
print(vectorizer.vocabulary_)
{'the': 6, 'car': 0, 'is': 3, 'driven': 1, 'on': 4, 'road': 5, 'truck': 7, 'highway': 2}
# encode document
newvector = vectorizer.transform(text)
# summarize encoded vector
print(newvector.toarray())
[[1 1 0 1 1 1 2 0] [0 1 1 1 1 0 2 1]]
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The car is driven on the road.","The truck is driven on the highway"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
TfidfVectorizer()
#Focus on IDF VALUES
print(vectorizer.idf_)
[1.40546511 1. 1.40546511 1. 1. 1.40546511 1. 1.40546511]
# summarize
print(vectorizer.vocabulary_)
{'the': 6, 'car': 0, 'is': 3, 'driven': 1, 'on': 4, 'road': 5, 'truck': 7, 'highway': 2}
# import nltk
# nltk.download('popular')
import nltk
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
the world have come and invaded us, captured our lands, conquered our minds.
From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
I see four milestones in my career"""
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
review = re.sub('[^a-zA-Z]', ' ', sentences[i])
review = review.lower()
review = review.split()
review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()
X
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.57735027, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.57735027, 0. , 0.57735027, 0. , 0. ], [0. , 0. , 0.31622777, 0. , 0.31622777, 0.31622777, 0. , 0. , 0.31622777, 0. , 0.31622777, 0.31622777, 0. , 0.31622777, 0. , 0. , 0.31622777, 0. , 0. , 0. , 0. , 0. , 0.31622777, 0.31622777], [0.30151134, 0.30151134, 0. , 0.30151134, 0. , 0. , 0.30151134, 0.30151134, 0. , 0. , 0. , 0. , 0.30151134, 0. , 0.30151134, 0.30151134, 0. , 0.30151134, 0.30151134, 0. , 0.30151134, 0. , 0. , 0. ]])