Text Cleaning
Basic Text Preprocessing
Advanced Preprocessing
Text Pre-Processing on Tweets Dataset
import sys
!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q --upgrade numpy pandas sklearn
!{sys.executable} -m pip install -q --upgrade nltk spacy gensim wordcloud textblob contractions clean-text unicode
Sometimes it happens that words and digits combine are written in the text which creates a problem for machines to understand. Hence, we need to remove the words and digits which are combined like game57 or game5ts7.
For such and many other tasks we normally use Regular Expressions.
Watch my two videos on regular expressions:
The re.sub(pattern, replacement_string, str)
method return the string obtained by replacing the occurrences of pattern
in str
with the replacement_string
. If the pattern isn’t found, the string is returned unchanged.
import re
mystr = "This is abc32 a abc32xyz string containing 32abc words 32 having digits"
re.sub('\w*\d\w*', '', mystr)
'This is a string containing words having digits'
import re
mystr = " This is a string with lots of extra spaces in beteween words ."
re.sub(' +', ' ', mystr)
' This is a string with lots of extra spaces in beteween words .'
mystr = "This is\na string\nwith lots of new\nline characters."
print("Original String:\n", mystr)
print("Preprocessed String:", re.sub('\n', ' ', mystr))
Original String: This is a string with lots of new line characters. Preprocessed String: This is a string with lots of new line characters.
import re
mystr = "<html> <head> An empty head. </head><body><p> This is so simple and fun. </p> </body> </html>"
print("Original String: ", mystr)
print("Preprocessed String: ", re.sub('<.*?>', '', mystr))
Original String: <html> <head> An empty head. </head><body><p> This is so simple and fun. </p> </body> </html> Preprocessed String: An empty head. This is so simple and fun.
import re
mystr = "Good youTube lectures by Arif are available at http://www.youtube.com/c/LearnWithArif/playlists"
re.sub('https?://\S+|www.\.\S+', '', mystr)
'Good youTube lectures by Arif are available at '
string.punctuation
constant and replace()
method to replace any punctuation in text with an empty stringimport string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
- Check for other constants like
string.whitespace
,string.printable
,string.ascii_letters
,string.digits
as well.
mystr = 'A {text} ^having$ "lot" of #s and [puncutations]!.;%..'
mystr
'A {text} ^having$ "lot" of #s and [puncutations]!.;%..'
newstr = ''.join([ch for ch in mystr if ch not in string.punctuation])
newstr
'A text having lot of s and puncutations'
mystr = "This IS GREAT series of Lectures by Arif at the Deaprtment of DS"
mystr.lower()
'this is great series of lectures by arif at the deaprtment of ds'
contractions
module or can create your own dictionary to expand contractionsimport sys
!{sys.executable} -m pip install -q contractions
import contractions
print(contractions.fix("you're")) # you are
print(contractions.fix("ain't")) # am not / are not / is not / has not / have not
print(contractions.fix("you'll")) #you shall / you will
print(contractions.fix("wouldn't've")) #"wouldn't've": "would not have",
you are are not you will would not have
mystr = '''I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear.
It's awesome to meet new friends. We've been waiting for this day for so long.'''
mystr
"I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. \nIt's awesome to meet new friends. We've been waiting for this day for so long."
# use loop
mylist = []
for word in mystr.split(sep=' '):
mylist.append(contractions.fix(word))
newstring = ' '.join(mylist)
print(newstring)
I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.
# use list comprehension and join the words of list on space
expanded_string = ' '.join([contractions.fix(word) for word in mystr.split()])
expanded_string
'I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.'
Some commonly used abbreviated chat words that are used on social media these days are:
To pre-process any text containing such abbreviations we can search for an online dictionary, or can create a dictionary of our own
dict_chatwords = {
'ack': 'acknowledge',
'omg': 'oh my God',
'aisi': 'as i see it',
'bi5': 'back in 5 minutes',
'lmk': 'let me know',
'gn' : 'good night',
'fyi': 'for your information',
'asap': 'as soon as possible',
'yolo': 'you only live once',
'rofl': 'rolling on floor laughing',
'nvm': 'never ming',
'ofc': 'ofcourse',
'blv' : 'boulevard',
'cir' : 'circle',
'hwy' : 'highway',
'ln' : 'lane',
'pt' : 'point',
'rd' : 'road',
'sq' : 'square',
'st' : 'street'
}
mystr = "omg this is aisi I ack your work and will be bi5"
mystr
'omg this is aisi I ack your work and will be bi5'
# dict.items() method returns all the key-value pairs of a dict as a two object tuple
# dict.keys() method returns all the keys of a dict object
# dict.values() method returns all the values of a dict object
mylist = []
for word in mystr.split(sep=' '):
if word in dict_chatwords.keys():
mylist.append(dict_chatwords[word])
else:
mylist.append(word)
newstring = ' '.join(mylist)
print(newstring)
oh my God this is as i see it I acknowledge your work and will be back in 5 minutes
clean-text
librarymystr = "These emojis needs to be removed, there is a huge list...😃😬😂😅😇😉😊😜😎🤗🙄🤔😡😤😭🤠🤡🤫💩😈👻🙌👍✌️👌🙏"
mystr
'These emojis needs to be removed, there is a huge list...😃😬😂😅😇😉😊😜😎🤗🙄🤔😡😤😭🤠🤡🤫💩😈👻🙌👍✌️👌🙏'
import re
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # code range for emoticons
u"\U0001F300-\U0001F5FF" # code range for symbols & pictographs
u"\U0001F680-\U0001F6FF" # code range for transport & map symbols
u"\U0001F1E0-\U0001F1FF" # code range for flags (iOS)
u"\U00002700-\U000027BF" # code range for Dingbats
u"\U00002500-\U00002BEF" # code range for chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f"
u"\u3030"
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', mystr)) # no emoji
These emojis needs to be removed, there is a huge list...
import sys
!{sys.executable} -m pip install -q emoji
import emoji
mystr = "This is 👍"
emoji.demojize(mystr)
'This is :thumbs_up:'
mystr = "I am 🤔"
emoji.demojize(mystr)
'I am :thinking_face:'
mystr = "This is 👍"
emoji.replace_emoji(mystr, replace='positive')
'This is positive'
reading
the user typed reeding
. These are easy to detect as they do not exist in the language dictionary and can be corrected using algorithms like shortest weighted edit distance and highest noisy channel probability.great
the user typed greet
two
the user typed too
import sys
!{sys.executable} -m pip install -q textblob
import textblob
textblob.__version__
'0.17.1'
from textblob import TextBlob
mystr = "I am reeding thiss gret boook on deta sciance suject, which is a greet curse"
blob = TextBlob(mystr)
type(blob)
textblob.blob.TextBlob
print(dir(blob))
['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cmpkey', '_compare', '_create_sentence_objects', '_strkey', 'analyzer', 'classifier', 'classify', 'correct', 'detect_language', 'ends_with', 'endswith', 'find', 'format', 'index', 'join', 'json', 'lower', 'ngrams', 'noun_phrases', 'np_counts', 'np_extractor', 'parse', 'parser', 'polarity', 'pos_tagger', 'pos_tags', 'raw', 'raw_sentences', 'replace', 'rfind', 'rindex', 'sentences', 'sentiment', 'sentiment_assessments', 'serialized', 'split', 'starts_with', 'startswith', 'string', 'strip', 'stripped', 'subjectivity', 'tags', 'title', 'to_json', 'tokenize', 'tokenizer', 'tokens', 'translate', 'translator', 'upper', 'word_counts', 'words']
blob.correct().string
'I am reading this great book on data science subject, which is a greet curse'
- The non-word errors like
reeding
,this
,gret
,boook
,deta
,sciance
andsuject
have been corrected byblob.correct()
method- However, the real word errors like
greet
andcurse
are not corrected
Let us try to understand how textblob.correct()
method do this?
# The word attribute of textblob object returns list of words in the text
blob.words
WordList(['I', 'am', 'reeding', 'thiss', 'gret', 'boook', 'on', 'deta', 'sciance', 'suject', 'which', 'is', 'a', 'greet', 'curse'])
# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions
# 'reeding'
blob.words[2].spellcheck()
[('reading', 0.7651006711409396), ('feeding', 0.10067114093959731), ('heeding', 0.053691275167785234), ('rending', 0.026845637583892617), ('breeding', 0.026845637583892617), ('receding', 0.013422818791946308), ('reeling', 0.006711409395973154), ('needing', 0.006711409395973154)]
# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions
# 'boook'
blob.words[5].spellcheck()
[('book', 0.946969696969697), ('brook', 0.05303030303030303)]
# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions
# 'greet'
blob.words[13].spellcheck()
[('greet', 1.0)]
( “ $ Rs Dr
km ) , . ! ”
- -- / ...
L.A.!
the exclamation mark (!) is separated, while L.A.
is not split
string.split()
Method¶mystr.split()
method, which returns a list of strings.mystr.split()
method splits a string into a list of strings at every occurrence of space character by default and discard empty strings from the result.sep='i'
to split method to split at that specific character instead.mystr="Learning is fun with Arif"
print(mystr.split())
['Learning', 'is', 'fun', 'with', 'Arif']
mystr="This example is great!"
print(mystr.split())
['This', 'example', 'is', 'great!']
Observe the output, the exclamation symbol has become part of the token great (which is wrong)
re.split()
Method¶re.split()
method splits the source string by the occurrences of the pattern, returning a list containing the resulting substrings.import re
mystr="This example is great!"
pattern = re.compile(r'\W+')
pattern.split(mystr)
['This', 'example', 'is', 'great', '']
- The exclamation symbol is not part of the token great, but what if I need that symbol as a separate token?
- Moreover, you need to write different regular expression for different scenarios
nltk.tokenize.sent_tokenize(str)
for sentence tokenizationnltk.tokenize.word_tokenize(str)
for word tokenizationnltk.tokenize.treebank.TreebankWordTokenizer(str)
import sys
!{sys.executable} -m pip install -q nltk
import nltk
nltk.__version__
'3.7'
from nltk.tokenize import word_tokenize, sent_tokenize
mystr="This example is great!"
print(word_tokenize(mystr))
['This', 'example', 'is', 'great', '!']
Observe the output, this time the exclamation symbol is kept as a separate tokens.
mystr="You should do your Ph.D in A.I!"
print(word_tokenize(mystr))
['You', 'should', 'do', 'your', 'Ph.D', 'in', 'A.I', '!']
mystr="You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me"
print(word_tokenize(mystr))
['You', 'should', "'ve", 'sent', 'me', 'an', 'email', 'at', 'arif', '@', 'pucit.edu.pk', 'or', 'vist', 'http', ':', '//www/arifbutt.me']
mystr="Here's an example worth $100. I am 384400km away from earth's moon!"
print(word_tokenize(mystr))
['Here', "'s", 'an', 'example', 'worth', '$', '100', '.', 'I', 'am', '384400km', 'away', 'from', 'earth', "'s", 'moon', '!']
spaCy (https://spacy.io/) is an open-source Natural Language Processing library designed to handle NLP tasks with the most efficient and state of the art algorithm, released in 2015.
Spacy support many languages (over 65) where you can perform tokenizing, however, for this other than importing spacy, you have to load the appropriate library using spacy.load() method. But before that make sure you have downloaded the model in your system.
spaCy will isolate punctuation that does not form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.
Download spacy model for English language
en
for English, fr
for French, zh
for Chinesesm
for small, md
for medium, lg
for large and `trf for transformerFor details read spaCy101: https://spacy.io/usage/spacy-101
import sys
!{sys.executable} -m pip install -q spacy
import spacy
spacy.__version__
/Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
'3.4.1'
Download spacy model for English language
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import sys
!{sys.executable} -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.4.0 Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 6.1 MB/s eta 0:00:0000:0100:01m Requirement already satisfied: spacy<3.5.0,>=3.4.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from en-core-web-sm==3.4.0) (3.4.1) Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (8.1.1) Requirement already satisfied: typer<0.5.0,>=0.3.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.4.2) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.7) Requirement already satisfied: pathy>=0.3.5 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.6.2) Requirement already satisfied: setuptools in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (58.0.4) Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.10.1) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.8) Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.9.2) Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.10) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (4.64.1) Requirement already satisfied: jinja2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.2) Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.3.0) Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.28.1) Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.3) Requirement already satisfied: packaging>=20.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (21.3) Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.4.4) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.6) Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.8) Requirement already satisfied: numpy>=1.15.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.22.4) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from packaging>=20.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.4) Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from pathy>=0.3.5->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (5.2.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.10.0.2) Requirement already satisfied: certifi>=2017.4.17 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2020.6.20) Requirement already satisfied: idna<4,>=2.5 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.3) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.26.12) Requirement already satisfied: charset-normalizer<3,>=2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.1.1) Requirement already satisfied: blis<0.10.0,>=0.7.8 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.9.1) Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.0.1) Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (8.1.3) Requirement already satisfied: MarkupSafe>=2.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from jinja2->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.1) ✔ Download and installation successful You can now load the package via spacy.load('en_core_web_sm')
Example 1:
# import spacy and load the language library
import spacy
nlp = spacy.load('en_core_web_lg')
mystr="'A 7km Uber cab ride from Gulberg to Joher Town will cost you $20"
doc = nlp(mystr)
for token in doc:
print(token, end=' , ')
' , A , 7 , km , Uber , cab , ride , from , Gulberg , to , Joher , Town , will , cost , you , $ , 20 ,
Note that spacy has successfully tokenized the distance symbol, which nltk failed to separate.
Example 2:
# import spacy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')
mystr="You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me"
doc = nlp(mystr)
for token in doc:
print(token, end=' , ')
You , should , 've , sent , me , an , email , at , arif@pucit.edu.pk , or , vist , http://www , / , arifbutt.me ,
- Note that spacy has kept the email as a single token, while nltk separated it.
- However, spacy also failed to properly tokenize the URL :(
Additional Token Attributes: Once the string is passed to nlp()
method of spacy, the tokens of the resulting doc
object has many other associated attributes other than just tokens:
Tag | Description |
---|---|
.text |
The original word text |
.lemma_ |
The base form of the word |
.pos_ |
The simple part-of-speech tag |
.tag_ |
The detailed part-of-speech tag |
.shape_ |
The word shape – capitalization, punctuation, digits |
.is_alpha , is_ascii , is_digit |
Token text consists of alphanumeric characters, ASCII characters, digits |
.is_lower , is_upper , is_title |
Token text is in lowercase, uppercase, titlecase |
.is_punct , is_space , is_stop |
Token is punctuation, whitespace, stopword |
good food
carries more meaning than just good
and food
when observed independently)
import nltk
mystr = "Allama Iqbal was a visionary philosopher and politician. Thank you"
tokens = nltk.tokenize.word_tokenize(mystr)
bgs = nltk.bigrams(tokens)
print(bgs)
for grams in bgs:
print(grams)
<generator object bigrams at 0x7fd61bb02f10> ('Allama', 'Iqbal') ('Iqbal', 'was') ('was', 'a') ('a', 'visionary') ('visionary', 'philosopher') ('philosopher', 'and') ('and', 'politician') ('politician', '.') ('.', 'Thank') ('Thank', 'you')
\begin{equation} \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 2 + 1 \hspace{0.5cm} = \hspace{0.5cm} 10 \end{equation}
- The formula to calculate the count of n-grams in a document is:
X - N + 1
, whereX
is the number of words in a given document andN
is the number of words in n-gram
tgs = nltk.trigrams(tokens)
for grams in tgs:
print(grams)
('Allama', 'Iqbal', 'was') ('Iqbal', 'was', 'a') ('was', 'a', 'visionary') ('a', 'visionary', 'philosopher') ('visionary', 'philosopher', 'and') ('philosopher', 'and', 'politician') ('and', 'politician', '.') ('politician', '.', 'Thank') ('.', 'Thank', 'you')
ngrams = nltk.ngrams(tokens, 4)
for grams in ngrams:
print(grams)
('Allama', 'Iqbal', 'was', 'a') ('Iqbal', 'was', 'a', 'visionary') ('was', 'a', 'visionary', 'philosopher') ('a', 'visionary', 'philosopher', 'and') ('visionary', 'philosopher', 'and', 'politician') ('philosopher', 'and', 'politician', '.') ('and', 'politician', '.', 'Thank') ('politician', '.', 'Thank', 'you')
nltk.download()
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import nltk
nltk.download("stopwords")
# nltk.download()
[nltk_data] Downloading package stopwords to /Users/arif/nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
After completion of downloading, you can load the package of
stopwords
from thenltk.corpus
and use it to load the stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
{'has', 'here', 'doesn', "hasn't", 'mustn', 'further', "shan't", 'for', "needn't", 'not', 'than', 'am', 'isn', 'our', 'been', 'with', 'through', 'now', 'ourselves', 'themselves', 'these', 'from', 'its', "that'll", 'how', 'until', 'who', 'both', 'couldn', 'then', "you've", 'ma', 'wasn', 'of', 'same', "doesn't", 'don', "it's", 'in', 've', 'very', 'himself', 'again', 'on', 'them', 'there', 'because', "you're", 'wouldn', 'some', 'too', 'hadn', 'the', 'just', 'are', "hadn't", 'to', 'had', 'when', 'needn', 'other', 'hers', 'be', "shouldn't", 'mightn', "won't", 'whom', 'own', 'should', 'after', 'yours', 'being', 'as', 'nor', 'down', 'more', 'before', "mustn't", 'it', "wouldn't", 'will', 'were', "don't", "weren't", 'myself', 'we', 'yourself', 'doing', 're', 'few', 'aren', "haven't", 'weren', 'he', 'by', 'at', 'didn', "mightn't", 'him', 'was', "didn't", "you'll", 'why', 'against', 'any', 'you', "she's", 'her', 'does', "isn't", 'can', 'those', 'herself', 'll', 'so', 'she', 'an', 'ain', "couldn't", 'yourselves', 'shouldn', 'd', 'off', 'no', "wasn't", "you'd", 'ours', 'once', 't', 'where', 'over', 'shan', 'under', 'all', 'about', 'do', 'itself', 'only', 'most', 'o', 'have', 'did', 'if', 'while', 'during', 'y', 'what', 'that', 'out', 'below', 'm', 'my', 'me', 'they', 'or', 'up', 'haven', 'your', 'such', 'hasn', 'into', 'won', 'but', 'and', 'a', 'this', "aren't", 's', 'their', 'theirs', 'having', 'which', 'i', 'is', 'above', 'between', 'his', "should've", 'each'}
def remove_stopwords(text):
new_text = list()
for word in text.split():
if word not in stopwords.words('english'):
new_text.append(word)
return " ".join(new_text)
Removing Stopwords from Text of an Email
import nltk
from nltk.corpus import stopwords
mystr="Your Google account has been compromised. \
Your account will be closed. Immediately click this link to update your account"
remove_stopwords(mystr)
'Your Google account compromised. Your account closed. Immediately click link update account'
Removing Stopwords for a Sentiment Analysis Application
mystr="This movie is not good"
remove_stopwords(mystr)
'This movie good'
- For sentiment analysis purposes, the overall meaning of the resulting sentence is positive, which is not at all the reality. So either do not remove sentiment analysis while doing sentiment analysis or handle the negation before removing stopwords
For details read spaCy101: https://spacy.io/usage/spacy-101
import spacy
nlp = spacy.load('en_core_web_sm')
# returns a set of around 326 English stopwords built into spaCy
print(len(nlp.Defaults.stop_words))
print(nlp.Defaults.stop_words)
326 {'n‘t', 'whereas', 'yet', "'d", 'than', 'anyone', 'am', 'still', 'with', 'afterwards', 'anywhere', 'these', 'hence', 'hereupon', 'namely', 'else', 'get', 'using', "n't", 'fifty', 'five', 'same', 'please', "'re", '’ll', 'herein', 'since', 'empty', 'there', 'move', 'the', 'forty', 'hers', 'although', 'yours', 'third', 'though', 'sometimes', 'were', 'six', 'could', 'yourself', 'ever', 'him', 'against', 'seem', 'herself', 'so', 'every', '‘re', 'somehow', 'where', 'also', 'amount', 'do', 'most', 'have', 'us', 'whenever', 'otherwise', 'never', 'former', 'next', 'out', 'become', 'formerly', 'or', 'make', 'into', 'but', 'beforehand', 'perhaps', 'each', 'has', 'bottom', 'ca', 'latterly', 'eight', "'ve", 'further', 'through', 'many', 'from', 'wherever', 'until', 'both', 'whereafter', 'must', 'then', 'however', 'of', 'mine', 'onto', 'anyway', 'on', 'back', 'cannot', 'ten', 'some', 'too', 'regarding', 'name', 'just', '‘ll', 'are', 'when', 'other', 'three', 'be', 'would', 'towards', 'noone', 'whence', 'as', 'being', 'behind', 'down', 'more', 'anyhow', 'before', '‘m', 'mostly', 'various', 'everywhere', "'s", 'beyond', 'we', 'take', 're', 'few', 'becoming', 'full', 'he', 'by', 'at', 'without', 'unless', 'none', 'any', 'does', 'her', 'done', 'nothing', 'whereupon', 'she', 'almost', 'used', '’s', 'side', 'off', 'no', 'whose', 'besides', 'seems', 'under', 'several', 'always', 'sometime', 'thereafter', "'m", 'such', 'and', 'a', 'is', 'really', 'someone', 'fifteen', 'for', 'not', "'ll", 'been', 'ourselves', 'themselves', 'eleven', 'how', 'might', 'who', 'thence', 'twenty', 'seemed', 'whole', 'least', 'in', '‘ve', 'well', 'together', 'them', 'twelve', 'may', 'because', 'nowhere', 'thru', 'became', 'even', 'among', 'elsewhere', 'whom', 'own', 'after', 'enough', 'alone', 'it', 'was', 'whoever', 'quite', 'becomes', 'due', 'moreover', 'others', 'an', 'per', 'except', 'call', 'once', 'about', 'around', 'go', 'n’t', 'anything', 'hereafter', '’d', 'often', 'serious', 'up', 'show', 'amongst', 'between', 'his', 'here', 'either', 'our', 'nevertheless', 'now', 'one', 'its', 'see', 'first', 'thereby', 'upon', 'via', 'much', 'put', 'thus', 'very', 'therein', 'himself', 'four', 'again', 'say', 'neither', 'along', '’re', 'beside', 'something', 'less', 'part', 'to', 'had', 'last', 'two', 'should', 'everyone', 'nor', 'within', 'will', 'hereby', 'made', 'myself', '‘d', 'doing', 'sixty', 'everything', 'throughout', 'another', 'hundred', 'why', 'toward', 'you', 'can', 'those', 'whither', '’m', 'across', 'somewhere', 'thereupon', 'yourselves', 'whether', 'meanwhile', '‘s', 'give', 'already', 'seeming', 'ours', 'rather', 'over', 'front', 'keep', 'nobody', 'all', 'itself', 'only', 'did', 'if', 'while', 'during', 'what', 'that', 'below', 'top', 'my', 'me', 'they', 'wherein', 'latter', 'your', 'nine', 'this', 'indeed', 'whereby', 'their', '’ve', 'therefore', 'which', 'i', 'above', 'whatever'}
def remove_stopwords_spacy(text):
new_text = list()
for word in text.split():
if word not in nlp.Defaults.stop_words:
new_text.append(word)
return " ".join(new_text)
mystr="This is a sample text and we need to remove stopwords from it"
remove_stopwords_spacy(mystr)
'This sample text need remove stopwords'
Add a stop word to the existing list of spaCy:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('aka')
# Set the stop_word tag on the lexeme
nlp.vocab['aka'].is_stop = True
nlp.vocab['aka'].is_stop
True
len(nlp.Defaults.stop_words)
327
To remove a stop word: Alternatively, you may decide that 'always'
should not be considered a stop word.
nlp.vocab['aka'].is_stop
True
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('aka')
# Remove the stop_word tag from the lexeme
nlp.vocab['aka'].is_stop = False
nlp.vocab['aka'].is_stop
False
len(nlp.Defaults.stop_words)
326
import pandas as pd
df = pd.read_csv("./datasets/imdb-dataset.csv")
df.head()
review | sentiment | |
---|---|---|
0 | One of the other reviewers has mentioned that ... | positive |
1 | A wonderful little production. <br /><br />The... | positive |
2 | I thought this was a wonderful way to spend ti... | positive |
3 | Basically there's a family where a little boy ... | negative |
4 | Petter Mattei's "Love in the Time of Money" is... | positive |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review 50000 non-null object 1 sentiment 50000 non-null object dtypes: object(2) memory usage: 781.4+ KB
# Check the count of positive and negative reviews to ensure that the dataset is balanced
df['sentiment'].value_counts()
positive 25000 negative 25000 Name: sentiment, dtype: int64
import seaborn as sns
sns.catplot(x ='sentiment', kind='count', data = df);
Reduce the records from 50K to 1K for quick processing
# save 1000 rows in a new dataframe
temp_df = df.iloc[0:1000,:]
temp_df.shape
(1000, 2)
# check out the count of positive and negative reviews
temp_df['sentiment'].value_counts()
positive 501 negative 499 Name: sentiment, dtype: int64
# save the dataframe to a new csv file
temp_df.to_csv('datasets/imdb-dataset-1000.csv', index=False)
Read the Dataset:
import pandas as pd
df = pd.read_csv("./datasets/imdb-dataset-1000.csv")
df
review | sentiment | |
---|---|---|
0 | One of the other reviewers has mentioned that ... | positive |
1 | A wonderful little production. <br /><br />The... | positive |
2 | I thought this was a wonderful way to spend ti... | positive |
3 | Basically there's a family where a little boy ... | negative |
4 | Petter Mattei's "Love in the Time of Money" is... | positive |
... | ... | ... |
995 | Nothing is sacred. Just ask Ernie Fosselius. T... | positive |
996 | I hated it. I hate self-aware pretentious inan... | negative |
997 | I usually try to be professional and construct... | negative |
998 | If you like me is going to see this in a film ... | negative |
999 | This is like a zoology textbook, given that it... | negative |
1000 rows × 2 columns
df.review[0]
"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side."
import re
import string
import contractions
from textblob import TextBlob
def text_cleaning(mystr):
mystr = mystr.lower() # case folding
mystr = re.sub('\w*\d\w*', '', mystr) # removing digits
mystr = re.sub('\n', ' ', mystr) # replace new line characters with space
mystr = re.sub('[‘’“”…]', '', mystr) # removing double quotes and single quotes
mystr = re.sub('<.*?>', '', mystr) # removing html tags
mystr = re.sub('https?://\S+|www.\.\S+', '', mystr) # removing URLs
mystr = ''.join([c for c in mystr if c not in string.punctuation]) # remove punctuations
mystr = ' '.join([contractions.fix(word) for word in mystr.split()]) # expand contractions
return mystr
df['r_cleaned'] = df['review'].apply(lambda x : text_cleaning(x))
df.head()
review | sentiment | r_cleaned | |
---|---|---|---|
0 | One of the other reviewers has mentioned that ... | positive | one of the other reviewers has mentioned that ... |
1 | A wonderful little production. <br /><br />The... | positive | a wonderful little production the filming tech... |
2 | I thought this was a wonderful way to spend ti... | positive | i thought this was a wonderful way to spend ti... |
3 | Basically there's a family where a little boy ... | negative | basically there is a family where a little boy... |
4 | Petter Mattei's "Love in the Time of Money" is... | positive | petter matteis love in the time of money is a ... |
from nltk.tokenize import word_tokenize
df['r_tokenized'] = df['r_cleaned'].apply(lambda x: word_tokenize(x))
df.head()
review | sentiment | r_cleaned | r_tokenized | |
---|---|---|---|---|
0 | One of the other reviewers has mentioned that ... | positive | one of the other reviewers has mentioned that ... | [one, of, the, other, reviewers, has, mentione... |
1 | A wonderful little production. <br /><br />The... | positive | a wonderful little production the filming tech... | [a, wonderful, little, production, the, filmin... |
2 | I thought this was a wonderful way to spend ti... | positive | i thought this was a wonderful way to spend ti... | [i, thought, this, was, a, wonderful, way, to,... |
3 | Basically there's a family where a little boy ... | negative | basically there is a family where a little boy... | [basically, there, is, a, family, where, a, li... |
4 | Petter Mattei's "Love in the Time of Money" is... | positive | petter matteis love in the time of money is a ... | [petter, matteis, love, in, the, time, of, mon... |
import nltk
stop_words = nltk.corpus.stopwords.words('english')
def remove_stopwords(tokenized_text):
new_words = [word for word in tokenized_text if word not in stop_words]
return new_words
df['r_no_sw'] = df['r_tokenized'].apply(lambda token: remove_stopwords(token))
df.head()
review | sentiment | r_cleaned | r_tokenized | r_no_sw | |
---|---|---|---|---|---|
0 | One of the other reviewers has mentioned that ... | positive | one of the other reviewers has mentioned that ... | [one, of, the, other, reviewers, has, mentione... | [one, reviewers, mentioned, watching, oz, epis... |
1 | A wonderful little production. <br /><br />The... | positive | a wonderful little production the filming tech... | [a, wonderful, little, production, the, filmin... | [wonderful, little, production, filming, techn... |
2 | I thought this was a wonderful way to spend ti... | positive | i thought this was a wonderful way to spend ti... | [i, thought, this, was, a, wonderful, way, to,... | [thought, wonderful, way, spend, time, hot, su... |
3 | Basically there's a family where a little boy ... | negative | basically there is a family where a little boy... | [basically, there, is, a, family, where, a, li... | [basically, family, little, boy, jake, thinks,... |
4 | Petter Mattei's "Love in the Time of Money" is... | positive | petter matteis love in the time of money is a ... | [petter, matteis, love, in, the, time, of, mon... | [petter, matteis, love, time, money, visually,... |
# join the tokens of pre-processed text
df['processed_reviews'] = df['r_no_sw'].apply(lambda x: ' '.join(x))
new_df = pd.concat([df['sentiment'], df['processed_reviews']], axis=1)
# save the resulting datafrrame to a new csv file
new_df.to_csv('datasets/processed_imdb_reviews.csv', index=False)
new_df.head()
sentiment | processed_reviews | |
---|---|---|
0 | positive | one reviewers mentioned watching oz episode ho... |
1 | positive | wonderful little production filming technique ... |
2 | positive | thought wonderful way spend time hot summer we... |
3 | negative | basically family little boy jake thinks zombie... |
4 | positive | petter matteis love time money visually stunni... |