Notebook

Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 7.2 (Basic Text Pre-Processing)

In [ ]:

Learning agenda of this notebook¶

Text Cleaning
- Removing digits and words containing digits
- Removing newline characters and extra spaces
- Removing HTML tags
- Removing URLs
- Removing punctuations
Basic Text Preprocessing
- Case folding
- Expand contractions
- Chat word treatment
- Handle emojis
- Spelling correction
- Tokenization
- Creating N-grams
- Stop words Removal
Advanced Preprocessing
- Stemming
- Lemmatization
- POS tagging
- NER
- Parsing
- Coreference Resolution
Text Pre-Processing on Tweets Dataset

In [ ]:

Download and Install Required Libraries¶

In [1]:

import sys
!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q --upgrade numpy pandas sklearn
!{sys.executable} -m pip install -q --upgrade nltk spacy gensim wordcloud textblob contractions clean-text unicode

In [ ]:

1. Text Cleaning¶

a. Removing Digits and Words Containing Digits¶

Sometimes it happens that words and digits combine are written in the text which creates a problem for machines to understand. Hence, we need to remove the words and digits which are combined like game57 or game5ts7.
For such and many other tasks we normally use Regular Expressions.
Watch my two videos on regular expressions:
- https://www.youtube.com/watch?v=DhQ-kc6FPVk
- https://www.youtube.com/watch?v=3J62z5aGTQc
The re.sub(pattern, replacement_string, str) method return the string obtained by replacing the occurrences of pattern in str with the replacement_string. If the pattern isn’t found, the string is returned unchanged.

In [2]:

import re
mystr = "This is abc32 a abc32xyz string containing 32abc words  32 having digits"
re.sub('\w*\d\w*', '', mystr)

Out[2]:

'This is  a  string containing  words   having digits'

In [ ]:

b. Removing New Line Characters and Extra Spaces¶

Most of the time text data contain extra spaces or while removing digits more than one space is left between the text.
We can use Python's string and re module to perform this pre-processing task.

In [3]:

import re
mystr = "      This         is a       string  with   lots of   extra spaces      in beteween    words     ."
re.sub(' +', ' ', mystr)

Out[3]:

' This is a string with lots of extra spaces in beteween words .'

In [4]:

mystr = "This is\na string\nwith lots of new\nline characters."
print("Original String:\n", mystr)
print("Preprocessed String:", re.sub('\n', ' ', mystr))

Original String:
 This is
a string
with lots of new
line characters.
Preprocessed String: This is a string with lots of new line characters.

In [ ]:

c. Removing HTML Tags¶

Once you get data via scraping websites, your data might contain HTML tags, which are not required as such in the data. So we need to remove them.

In [5]:

import re
mystr = "<html> <head> An empty head. </head><body><p> This is so simple and fun. </p> </body> </html>"
print("Original String: ", mystr)
print("Preprocessed String: ", re.sub('<.*?>', '', mystr))

Original String:  <html> <head> An empty head. </head><body><p> This is so simple and fun. </p> </body> </html>
Preprocessed String:    An empty head.  This is so simple and fun.

In [ ]:

d. Removing URLs¶

At times the text data you have some URLS, which might not be helpful in suppose sentiment analysis. So better to remove those URLS from your dataset
Once again, we can use Python's re module to remove the URLs.

In [6]:

import re
mystr = "Good youTube lectures by Arif are available at http://www.youtube.com/c/LearnWithArif/playlists"
re.sub('https?://\S+|www.\.\S+', '', mystr)

Out[6]:

'Good youTube lectures by Arif are available at '

In [ ]:

e. Removing Punctuations¶

Punctuations are symbols that are used to divide written words into sentences and clauses
Once you tokenize your text, these punctuation symbols may become part of a token, and may become a token by itself, which is not required in most of the cases
We can use Python's string.punctuation constant and replace() method to replace any punctuation in text with an empty string

In [7]:

import string
string.punctuation

Out[7]:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Check for other constants like string.whitespace, string.printable, string.ascii_letters, string.digits as well.

In [8]:

mystr = 'A {text} ^having$ "lot" of #s and [puncutations]!.;%..'
mystr

Out[8]:

'A {text} ^having$ "lot" of #s and [puncutations]!.;%..'

In [9]:

newstr = ''.join([ch for ch in mystr if ch not in string.punctuation])
newstr

Out[9]:

'A text having lot of s and puncutations'

In [ ]:

2. Basic Text Preprocessing¶

a. Case Folding¶

The text we need to process may come in lower, upper, sentence, camel cases
If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine
In applications like Information Retrieval, we reduce all letters to lower case
In applications like sentiment analysis, machine translation and information extraction, keeping the case might be helpful. For example US vs us.

In [10]:

mystr = "This IS GREAT series of Lectures by Arif at the Deaprtment of DS"
mystr.lower()

Out[10]:

'this is great series of lectures by arif at the deaprtment of ds'

In [ ]:

b. Expand Contractions¶

Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.
Examples:
- you're ---> you are
- ain't ---> am not / are not / is not / has not / have not
- you'll ---> you shall / you will
- wouldn't 've ---> would not haveyou are
In order to expand contractions, you can install and use the contractions module or can create your own dictionary to expand contractions

In [11]:

import sys
!{sys.executable} -m pip install -q  contractions

In [12]:

import contractions
print(contractions.fix("you're"))      # you are
print(contractions.fix("ain't"))       # am not / are not / is not / has not / have not
print(contractions.fix("you'll"))      #you shall / you will
print(contractions.fix("wouldn't've")) #"wouldn't've": "would not have",

you are
are not
you will
would not have

In [ ]:

In [13]:

mystr = '''I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. 
It's awesome to meet new friends. We've been waiting for this day for so long.'''
mystr

Out[13]:

"I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. \nIt's awesome to meet new friends. We've been waiting for this day for so long."

In [14]:

# use loop
mylist = []   
for word in mystr.split(sep=' '):
    mylist.append(contractions.fix(word))

newstring = ' '.join(mylist)
print(newstring)

I will be there within 5 min. Should not you be there too? I would love to see you there my dear. 
It is awesome to meet new friends. We have been waiting for this day for so long.

In [15]:

# use list comprehension and join the words of list on space
expanded_string = ' '.join([contractions.fix(word) for word in mystr.split()])
expanded_string

Out[15]:

'I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.'

In [ ]:

c. Chat Word Treatment¶

Some commonly used abbreviated chat words that are used on social media these days are:
- GN for good night
- fyi for for your information
- asap for as soon as possible
- yolo for you only live once
- rofl for rolling on floor laughing
- nvm for never mind
- ofc for ofcourse
To pre-process any text containing such abbreviations we can search for an online dictionary, or can create a dictionary of our own

In [16]:

dict_chatwords = { 
    'ack': 'acknowledge',
    'omg': 'oh my God',
    'aisi': 'as i see it',
    'bi5': 'back in 5 minutes',
    'lmk': 'let me know',
    'gn' : 'good night',
    'fyi': 'for your information',
    'asap': 'as soon as possible',
    'yolo': 'you only live once',
    'rofl': 'rolling on floor laughing',
    'nvm': 'never ming',
    'ofc': 'ofcourse',
    'blv' : 'boulevard',
    'cir' : 'circle',
    'hwy' : 'highway',
    'ln' : 'lane',
    'pt' : 'point',
    'rd' : 'road',
    'sq' : 'square',
    'st' : 'street'
    }

In [17]:

mystr = "omg this is aisi I ack your work and will be bi5"
mystr

Out[17]:

'omg this is aisi I ack your work and will be bi5'

In [18]:

# dict.items() method returns all the key-value pairs of a dict as a two object tuple
# dict.keys() method returns all the keys  of a dict object
# dict.values() method returns all the values  of a dict object
mylist = []   
for word in mystr.split(sep=' '):
    if word in dict_chatwords.keys():
        mylist.append(dict_chatwords[word])
    else:
        mylist.append(word)
newstring = ' '.join(mylist)
print(newstring)

oh my God this is as i see it I acknowledge your work and will be back in 5 minutes

In [ ]:

d. Handle Emojis¶

We come across lots and lots of emojis while scraping comments/posts from social media websites like Facebook, Instagram, Whatsapp, Twitter, LinkedIn, which needs to be removed from text.
Machine Learrning algorithm cannot understand emojis, so we have two options:
- Simply remove the emojis from the text data, and this can be done using clean-text library
- Replace the emoji with its meaning happy, sad, angry,....

(i) Remove Emojis¶

In [19]:

mystr = "These emojis needs to be removed, there is a huge list...😃😬😂😅😇😉😊😜😎🤗🙄🤔😡😤😭🤠🤡🤫💩😈👻🙌👍✌️👌🙏"
mystr

Out[19]:

'These emojis needs to be removed, there is a huge list...😃😬😂😅😇😉😊😜😎🤗🙄🤔😡😤😭🤠🤡🤫💩😈👻🙌👍✌️👌🙏'

In [20]:

import re
 
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # code range for emoticons
        u"\U0001F300-\U0001F5FF"  # code range for symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # code range for transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # code range for flags (iOS)
        u"\U00002700-\U000027BF"  # code range for Dingbats
        u"\U00002500-\U00002BEF"  # code range for chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f" 
        u"\u3030"
        "]+", flags=re.UNICODE)

print(emoji_pattern.sub(r'', mystr)) # no emoji

These emojis needs to be removed, there is a huge list...

In [ ]:

(ii) Replace Emojis with their Meanings¶

In [21]:

import sys
!{sys.executable} -m pip install -q  emoji

In [22]:

import emoji
mystr = "This is  👍"
emoji.demojize(mystr)

Out[22]:

'This is  :thumbs_up:'

In [23]:

mystr = "I am 🤔"
emoji.demojize(mystr)

Out[23]:

'I am :thinking_face:'

In [24]:

mystr = "This is  👍"
emoji.replace_emoji(mystr, replace='positive')

Out[24]:

'This is  positive'

In [ ]:

e. Spelling Correction¶

Most of the times the text data you have contain spelling errors, which if not corrected the same word may be represented in two or may be more different ways.
Almost all word editors, today underline incorrectly typed words and provide you possible correct options
So spelling correction is a two step task:
- Detection of spelling errors
- Correction of spelling errors
  - Autocorrect as you type space
  - Suggest a single correct word
  - Suggest a list of words (from which you can choose one)
Types of spelling errors:
- Non-word Errors: are non-dictionary words or words that do not exist in the language dictionary. For example instead of typing reading the user typed reeding. These are easy to detect as they do not exist in the language dictionary and can be corrected using algorithms like shortest weighted edit distance and highest noisy channel probability.
- Real-word Errors: are dictionary words and are hard to detect. These can be of two types:
  - Typographical errors: For example instead of typing great the user typed greet
  - Cognitive errors (homophones: For example instead of typing two the user typed too

"I am reeding thiss gret boook on deta sciance suject, which is a greet curse"

In [ ]:

In [25]:

import sys
!{sys.executable} -m pip install -q  textblob

In [26]:

import textblob
textblob.__version__

Out[26]:

'0.17.1'

In [27]:

from textblob import TextBlob
mystr = "I am reeding thiss gret boook on deta sciance suject, which is a greet curse"
blob = TextBlob(mystr)
type(blob)

Out[27]:

textblob.blob.TextBlob

In [28]:

print(dir(blob))

['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cmpkey', '_compare', '_create_sentence_objects', '_strkey', 'analyzer', 'classifier', 'classify', 'correct', 'detect_language', 'ends_with', 'endswith', 'find', 'format', 'index', 'join', 'json', 'lower', 'ngrams', 'noun_phrases', 'np_counts', 'np_extractor', 'parse', 'parser', 'polarity', 'pos_tagger', 'pos_tags', 'raw', 'raw_sentences', 'replace', 'rfind', 'rindex', 'sentences', 'sentiment', 'sentiment_assessments', 'serialized', 'split', 'starts_with', 'startswith', 'string', 'strip', 'stripped', 'subjectivity', 'tags', 'title', 'to_json', 'tokenize', 'tokenizer', 'tokens', 'translate', 'translator', 'upper', 'word_counts', 'words']

In [29]:

blob.correct().string

Out[29]:

'I am reading this great book on data science subject, which is a greet curse'

The non-word errors like reeding, this, gret, boook, deta, sciance and suject have been corrected by blob.correct() method

However, the real word errors like greet and curse are not corrected

In [ ]:

Let us try to understand how textblob.correct() method do this?

In [30]:

# The word attribute of textblob object returns list of words in the text
blob.words

Out[30]:

WordList(['I', 'am', 'reeding', 'thiss', 'gret', 'boook', 'on', 'deta', 'sciance', 'suject', 'which', 'is', 'a', 'greet', 'curse'])

In [31]:

# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions
# 'reeding'
blob.words[2].spellcheck()

Out[31]:

[('reading', 0.7651006711409396),
 ('feeding', 0.10067114093959731),
 ('heeding', 0.053691275167785234),
 ('rending', 0.026845637583892617),
 ('breeding', 0.026845637583892617),
 ('receding', 0.013422818791946308),
 ('reeling', 0.006711409395973154),
 ('needing', 0.006711409395973154)]

In [32]:

# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions
# 'boook'
blob.words[5].spellcheck()

Out[32]:

[('book', 0.946969696969697), ('brook', 0.05303030303030303)]

In [33]:

# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions
# 'greet'
blob.words[13].spellcheck()

Out[33]:

[('greet', 1.0)]

In [ ]:

f. Tokenize Text¶

What is Tokenization: Tokenization is a process of splitting text into meaningful segments called tokens. It can be character level, subword level, word level (unigram), two word level (bigram), three word level (trigram), and sentence level.
Why to do Tokenization: For classification of a product review as positive or negative, we may need to count the number of positive words and compare them with the count of negative words in the text of that review. For this we first need to tokenize the text of the product review. Tokens are the basic uilding locks of a document oject. Everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.
How to do Tokenization: In a sentence you may come across following four items:
- Prefix: Character(s) at the beginning ▸ ( “ $ Rs Dr
- Suffix: Character(s) at the end ▸ km ) , . ! ”
- Infix: Character(s) in between ▸ - -- / ...
- Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied. From L.A.! the exclamation mark (!) is separated, while L.A. is not split

In [ ]:

(i) Tokenization with `string.split()` Method¶

The easiest way to tokenize is to use the mystr.split() method, which returns a list of strings.
The mystr.split() method splits a string into a list of strings at every occurrence of space character by default and discard empty strings from the result.
You may pass a parameter sep='i' to split method to split at that specific character instead.
It's limitation is that it do not consider punctuation symbols as a separate token

In [34]:

mystr="Learning is fun with Arif" 
print(mystr.split())

['Learning', 'is', 'fun', 'with', 'Arif']

In [35]:

mystr="This example is great!" 
print(mystr.split())

['This', 'example', 'is', 'great!']

Observe the output, the exclamation symbol has become part of the token great (which is wrong)

In [ ]:

(ii) Tokenization with `re.split()` Method¶

The re.split() method splits the source string by the occurrences of the pattern, returning a list containing the resulting substrings.

In [36]:

import re
mystr="This example is great!" 
pattern = re.compile(r'\W+')
pattern.split(mystr)

Out[36]:

['This', 'example', 'is', 'great', '']

The exclamation symbol is not part of the token great, but what if I need that symbol as a separate token?

Moreover, you need to write different regular expression for different scenarios

In [ ]:

(iii) Tokenization using NLTK¶

NLTK stands for Natural Language Toolkit (https://www.nltk.org/). This is a suite of libraries and programs for statistical natural language processing for English language
NLTK was released in 2001, and is available for Windows, Mac OS X, and Linux..
NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
NLTK fully supports the English language, but others like Spanish or French are not supported as extensively.
It is a string processing libbrary, i.e., you give a string as input and get a string as output
There are. different tokenizer available in nltk:
- nltk.tokenize.sent_tokenize(str) for sentence tokenization
- nltk.tokenize.word_tokenize(str) for word tokenization
- nltk.tokenize.treebank.TreebankWordTokenizer(str)

In [37]:

import sys
!{sys.executable} -m pip install -q nltk

In [38]:

import nltk
nltk.__version__

Out[38]:

'3.7'

In [39]:

from nltk.tokenize import word_tokenize, sent_tokenize
mystr="This example is great!" 
print(word_tokenize(mystr))

['This', 'example', 'is', 'great', '!']

Observe the output, this time the exclamation symbol is kept as a separate tokens.

In [40]:

mystr="You should do your Ph.D in A.I!" 
print(word_tokenize(mystr))

['You', 'should', 'do', 'your', 'Ph.D', 'in', 'A.I', '!']

In [41]:

mystr="You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me"
print(word_tokenize(mystr))

['You', 'should', "'ve", 'sent', 'me', 'an', 'email', 'at', 'arif', '@', 'pucit.edu.pk', 'or', 'vist', 'http', ':', '//www/arifbutt.me']

In [42]:

mystr="Here's an example worth $100. I am 384400km away from earth's moon!" 
print(word_tokenize(mystr))

['Here', "'s", 'an', 'example', 'worth', '$', '100', '.', 'I', 'am', '384400km', 'away', 'from', 'earth', "'s", 'moon', '!']

In [ ]:

(iv) Tokenization with spaCy¶

spaCy (https://spacy.io/) is an open-source Natural Language Processing library designed to handle NLP tasks with the most efficient and state of the art algorithm, released in 2015.
Spacy support many languages (over 65) where you can perform tokenizing, however, for this other than importing spacy, you have to load the appropriate library using spacy.load() method. But before that make sure you have downloaded the model in your system.
spaCy will isolate punctuation that does not form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.
Download spacy model for English language
- Spacy comes with pretrained models and pipelines for different languages.
- You can download any of the following models for English language, but better to download the small as this will require a reasonable amount of space on your disk, and may take a bit of time to download:
  - en_core_web_sm
  - en_core_web_md
  - en_core_web_lg
  - en_core_web_trf
- The model name consist of four parts:
  - Language (en): The language abreviation can be en for English, fr for French, zh for Chinese
  - Type (core/dep): It can be core for general-purpose pipeline with tagging, parsing, lemmatization and NER recognition. It can be dep for only tagging, parsing and lemmatization
  - Genre (web/news): It measn the type of text the pipeline is trained on, e.g., web or news.
  - Size: Package size indicator. sm for small, md for medium, lg for large and `trf for transformer
  - Package version (a.b.c): Here a is the major version for spaCy, b is the minor version for spaCy, while c is the model verion dependent to the data on which the model is trained, it parameters, number of iterations and different vectors.

For details read spaCy101: https://spacy.io/usage/spacy-101

In [43]:

import sys
!{sys.executable} -m pip install -q spacy

In [44]:

import spacy
spacy.__version__

/Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Out[44]:

'3.4.1'

Download spacy model for English language

In [45]:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [46]:

import sys
!{sys.executable} -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 6.1 MB/s eta 0:00:0000:0100:01m
Requirement already satisfied: spacy<3.5.0,>=3.4.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from en-core-web-sm==3.4.0) (3.4.1)
Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (8.1.1)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.4.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.7)
Requirement already satisfied: pathy>=0.3.5 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.6.2)
Requirement already satisfied: setuptools in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (58.0.4)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.10.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.8)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.9.2)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.10)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (4.64.1)
Requirement already satisfied: jinja2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.3.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.28.1)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.3)
Requirement already satisfied: packaging>=20.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (21.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.4.4)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.6)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.8)
Requirement already satisfied: numpy>=1.15.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.22.4)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from packaging>=20.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.4)
Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from pathy>=0.3.5->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.10.0.2)
Requirement already satisfied: certifi>=2017.4.17 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2020.6.20)
Requirement already satisfied: idna<4,>=2.5 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.26.12)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.1.1)
Requirement already satisfied: blis<0.10.0,>=0.7.8 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.9.1)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.0.1)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (8.1.3)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from jinja2->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.1)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')

In [ ]:

Example 1:

In [2]:

# import spacy and load the language library
import spacy
nlp = spacy.load('en_core_web_lg')

mystr="'A 7km Uber cab ride from Gulberg to Joher Town will cost you $20" 
doc = nlp(mystr)

for token in doc:
    print(token, end=' , ')

' , A , 7 , km , Uber , cab , ride , from , Gulberg , to , Joher , Town , will , cost , you , $ , 20 ,

Note that spacy has successfully tokenized the distance symbol, which nltk failed to separate.

In [ ]:

Example 2:

In [48]:

# import spacy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

mystr="You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me"
doc = nlp(mystr)

for token in doc:
    print(token, end=' , ')

You , should , 've , sent , me , an , email , at , arif@pucit.edu.pk , or , vist , http://www , / , arifbutt.me ,

Note that spacy has kept the email as a single token, while nltk separated it.

However, spacy also failed to properly tokenize the URL :(

Additional Token Attributes: Once the string is passed to nlp() method of spacy, the tokens of the resulting doc object has many other associated attributes other than just tokens:

Tag	Description
`.text`	The original word text
`.lemma_`	The base form of the word
`.pos_`	The simple part-of-speech tag
`.tag_`	The detailed part-of-speech tag
`.shape_`	The word shape – capitalization, punctuation, digits
`.is_alpha`, `is_ascii`, `is_digit`	Token text consists of alphanumeric characters, ASCII characters, digits
`.is_lower`, `is_upper`, `is_title`	Token text is in lowercase, uppercase, titlecase
`.is_punct`, `is_space`, `is_stop`	Token is punctuation, whitespace, stopword

In [ ]:

g. Creating N-grams¶

What are n-grams?
- A sequence of n words, can be bigram, trigram,....
Why to use n-grams?
- Capture contextual information (good food carries more meaning than just good and food when observed independently)
- Applications of N-grams:
  - Sentence Completion
  - Auto Spell Check and correction
  - Auto Grammer Check and correction
- Is there a perfect value of n?
  - Different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all for your text analysis.

In [ ]:

How to create n-grams?

In [49]:

import nltk
mystr = "Allama Iqbal was a visionary philosopher and politician. Thank you"
tokens = nltk.tokenize.word_tokenize(mystr)
bgs = nltk.bigrams(tokens)
print(bgs)
for grams in bgs:
    print(grams)

<generator object bigrams at 0x7fd61bb02f10>
('Allama', 'Iqbal')
('Iqbal', 'was')
('was', 'a')
('a', 'visionary')
('visionary', 'philosopher')
('philosopher', 'and')
('and', 'politician')
('politician', '.')
('.', 'Thank')
('Thank', 'you')

The formula to calculate the count of n-grams in a document is: X - N + 1, where X is the number of words in a given document and N is the number of words in n-gram

\begin{equation} \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 2 + 1 \hspace{0.5cm} = \hspace{0.5cm} 10 \end{equation}

In [50]:

tgs = nltk.trigrams(tokens)
for grams in tgs:
    print(grams)

('Allama', 'Iqbal', 'was')
('Iqbal', 'was', 'a')
('was', 'a', 'visionary')
('a', 'visionary', 'philosopher')
('visionary', 'philosopher', 'and')
('philosopher', 'and', 'politician')
('and', 'politician', '.')
('politician', '.', 'Thank')
('.', 'Thank', 'you')

\begin{equation} \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 3 + 1 \hspace{0.5cm} = \hspace{0.5cm} 9 \end{equation}

In [51]:

ngrams = nltk.ngrams(tokens, 4)
for grams in ngrams:
    print(grams)

('Allama', 'Iqbal', 'was', 'a')
('Iqbal', 'was', 'a', 'visionary')
('was', 'a', 'visionary', 'philosopher')
('a', 'visionary', 'philosopher', 'and')
('visionary', 'philosopher', 'and', 'politician')
('philosopher', 'and', 'politician', '.')
('and', 'politician', '.', 'Thank')
('politician', '.', 'Thank', 'you')

In [ ]:

h. Stopwords Removal¶

Stopwords are extremely common words of a language having very little meanings, and it is usually safe to remove them and not consider them as important for later processing of our data.
Every language has its own set of stopwords. For example, some stopwords of English language are: the, a, an, was, were, at, will, on, in, from, to, me, you, yours,....
Whether you should remove stop words from your text or not mainly depends on the problem you are solving.
Remove stop words from your text if you are working on:
- Text Classification (Spam Filtering, Language Classification, Genre Classification)
- Caption Generation
- Auto-Tag Generation
Avoid removing stop words from your text if you are working on:
- Machine Translation
- Language Modeling
- Text Summarization
- Question-Answering problems

In [ ]:

(i) Using NLTK¶

The NLTK library has a defined set of stopwords for different languages like English. Here, we will focus on ‘english’ stopwords. One can also consider additional stopwords if required
Note that there is no single universal list of stopwords. The list of the stop words can change depending on your problem statement
Once you install nltk, it just install the base library and do not install all the packages related to different languages, different tokenization schemes, etc. To install all the nltk packages and corpora use nltk.download()
An installation window will pop up. Select all and click ‘Download’ to download and install the additional bundles. This will download all the dictionaries and other language and grammar data frames necessary for full NLTK functionality.

In [52]:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [53]:

import nltk
nltk.download("stopwords")
# nltk.download()

[nltk_data] Downloading package stopwords to /Users/arif/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[53]:

True

After completion of downloading, you can load the package of stopwords from the nltk.corpus and use it to load the stop words

In [ ]:

In [54]:

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'has', 'here', 'doesn', "hasn't", 'mustn', 'further', "shan't", 'for', "needn't", 'not', 'than', 'am', 'isn', 'our', 'been', 'with', 'through', 'now', 'ourselves', 'themselves', 'these', 'from', 'its', "that'll", 'how', 'until', 'who', 'both', 'couldn', 'then', "you've", 'ma', 'wasn', 'of', 'same', "doesn't", 'don', "it's", 'in', 've', 'very', 'himself', 'again', 'on', 'them', 'there', 'because', "you're", 'wouldn', 'some', 'too', 'hadn', 'the', 'just', 'are', "hadn't", 'to', 'had', 'when', 'needn', 'other', 'hers', 'be', "shouldn't", 'mightn', "won't", 'whom', 'own', 'should', 'after', 'yours', 'being', 'as', 'nor', 'down', 'more', 'before', "mustn't", 'it', "wouldn't", 'will', 'were', "don't", "weren't", 'myself', 'we', 'yourself', 'doing', 're', 'few', 'aren', "haven't", 'weren', 'he', 'by', 'at', 'didn', "mightn't", 'him', 'was', "didn't", "you'll", 'why', 'against', 'any', 'you', "she's", 'her', 'does', "isn't", 'can', 'those', 'herself', 'll', 'so', 'she', 'an', 'ain', "couldn't", 'yourselves', 'shouldn', 'd', 'off', 'no', "wasn't", "you'd", 'ours', 'once', 't', 'where', 'over', 'shan', 'under', 'all', 'about', 'do', 'itself', 'only', 'most', 'o', 'have', 'did', 'if', 'while', 'during', 'y', 'what', 'that', 'out', 'below', 'm', 'my', 'me', 'they', 'or', 'up', 'haven', 'your', 'such', 'hasn', 'into', 'won', 'but', 'and', 'a', 'this', "aren't", 's', 'their', 'theirs', 'having', 'which', 'i', 'is', 'above', 'between', 'his', "should've", 'each'}

In [ ]:

In [55]:

def remove_stopwords(text):
    new_text = list()
    for word in text.split():
        if word not in stopwords.words('english'):
            new_text.append(word)
    return " ".join(new_text)

Removing Stopwords from Text of an Email

In [56]:

import nltk
from nltk.corpus import stopwords

mystr="Your Google account has been compromised. \
    Your account will be closed. Immediately click this link to update your account"
remove_stopwords(mystr)

Out[56]:

'Your Google account compromised. Your account closed. Immediately click link update account'

Removing Stopwords for a Sentiment Analysis Application

In [57]:

mystr="This movie is not good"
remove_stopwords(mystr)

Out[57]:

'This movie good'

For sentiment analysis purposes, the overall meaning of the resulting sentence is positive, which is not at all the reality. So either do not remove sentiment analysis while doing sentiment analysis or handle the negation before removing stopwords

In [ ]:

(ii) Using spaCy¶

spaCy (https://spacy.io/) is an open-source Natural Language Processing library designed to handle NLP tasks with the most efficient and state of the art algorithm, released in 2015.
Spacy support many languages (over 65) where you can perform tokenizing, however, for this other than importing spacy, you have to load the appropriate library using spacy.load() method. But before that make sure you have downloaded the model in your system.
Download spacy model for English language: Spacy comes with pretrained models and pipelines for different languages. We have already downloaded the pre-trained spacy model for English language

For details read spaCy101: https://spacy.io/usage/spacy-101

In [58]:

import spacy
nlp = spacy.load('en_core_web_sm')

In [59]:

# returns a set of around 326 English stopwords built into spaCy
print(len(nlp.Defaults.stop_words))
print(nlp.Defaults.stop_words)

326
{'n‘t', 'whereas', 'yet', "'d", 'than', 'anyone', 'am', 'still', 'with', 'afterwards', 'anywhere', 'these', 'hence', 'hereupon', 'namely', 'else', 'get', 'using', "n't", 'fifty', 'five', 'same', 'please', "'re", '’ll', 'herein', 'since', 'empty', 'there', 'move', 'the', 'forty', 'hers', 'although', 'yours', 'third', 'though', 'sometimes', 'were', 'six', 'could', 'yourself', 'ever', 'him', 'against', 'seem', 'herself', 'so', 'every', '‘re', 'somehow', 'where', 'also', 'amount', 'do', 'most', 'have', 'us', 'whenever', 'otherwise', 'never', 'former', 'next', 'out', 'become', 'formerly', 'or', 'make', 'into', 'but', 'beforehand', 'perhaps', 'each', 'has', 'bottom', 'ca', 'latterly', 'eight', "'ve", 'further', 'through', 'many', 'from', 'wherever', 'until', 'both', 'whereafter', 'must', 'then', 'however', 'of', 'mine', 'onto', 'anyway', 'on', 'back', 'cannot', 'ten', 'some', 'too', 'regarding', 'name', 'just', '‘ll', 'are', 'when', 'other', 'three', 'be', 'would', 'towards', 'noone', 'whence', 'as', 'being', 'behind', 'down', 'more', 'anyhow', 'before', '‘m', 'mostly', 'various', 'everywhere', "'s", 'beyond', 'we', 'take', 're', 'few', 'becoming', 'full', 'he', 'by', 'at', 'without', 'unless', 'none', 'any', 'does', 'her', 'done', 'nothing', 'whereupon', 'she', 'almost', 'used', '’s', 'side', 'off', 'no', 'whose', 'besides', 'seems', 'under', 'several', 'always', 'sometime', 'thereafter', "'m", 'such', 'and', 'a', 'is', 'really', 'someone', 'fifteen', 'for', 'not', "'ll", 'been', 'ourselves', 'themselves', 'eleven', 'how', 'might', 'who', 'thence', 'twenty', 'seemed', 'whole', 'least', 'in', '‘ve', 'well', 'together', 'them', 'twelve', 'may', 'because', 'nowhere', 'thru', 'became', 'even', 'among', 'elsewhere', 'whom', 'own', 'after', 'enough', 'alone', 'it', 'was', 'whoever', 'quite', 'becomes', 'due', 'moreover', 'others', 'an', 'per', 'except', 'call', 'once', 'about', 'around', 'go', 'n’t', 'anything', 'hereafter', '’d', 'often', 'serious', 'up', 'show', 'amongst', 'between', 'his', 'here', 'either', 'our', 'nevertheless', 'now', 'one', 'its', 'see', 'first', 'thereby', 'upon', 'via', 'much', 'put', 'thus', 'very', 'therein', 'himself', 'four', 'again', 'say', 'neither', 'along', '’re', 'beside', 'something', 'less', 'part', 'to', 'had', 'last', 'two', 'should', 'everyone', 'nor', 'within', 'will', 'hereby', 'made', 'myself', '‘d', 'doing', 'sixty', 'everything', 'throughout', 'another', 'hundred', 'why', 'toward', 'you', 'can', 'those', 'whither', '’m', 'across', 'somewhere', 'thereupon', 'yourselves', 'whether', 'meanwhile', '‘s', 'give', 'already', 'seeming', 'ours', 'rather', 'over', 'front', 'keep', 'nobody', 'all', 'itself', 'only', 'did', 'if', 'while', 'during', 'what', 'that', 'below', 'top', 'my', 'me', 'they', 'wherein', 'latter', 'your', 'nine', 'this', 'indeed', 'whereby', 'their', '’ve', 'therefore', 'which', 'i', 'above', 'whatever'}

In [60]:

def remove_stopwords_spacy(text):
    new_text = list()
    for word in text.split():
        if word not in nlp.Defaults.stop_words:
            new_text.append(word)
    return " ".join(new_text)

In [61]:

mystr="This is a sample text and we need to remove stopwords from it"
remove_stopwords_spacy(mystr)

Out[61]:

'This sample text need remove stopwords'

In [ ]:

Add a stop word to the existing list of spaCy:

In [62]:

# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('aka')

# Set the stop_word tag on the lexeme
nlp.vocab['aka'].is_stop = True

In [63]:

nlp.vocab['aka'].is_stop

Out[63]:

True

In [64]:

len(nlp.Defaults.stop_words)

Out[64]:

In [ ]:

To remove a stop word: Alternatively, you may decide that 'always' should not be considered a stop word.

In [65]:

nlp.vocab['aka'].is_stop

Out[65]:

True

In [66]:

# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('aka')

# Remove the stop_word tag from the lexeme
nlp.vocab['aka'].is_stop = False

In [67]:

nlp.vocab['aka'].is_stop

Out[67]:

False

In [68]:

len(nlp.Defaults.stop_words)

Out[68]:

In [ ]:

3. Text Pre-Processing on IMDB Dataset¶

Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

a. EDA on IMD Dataset¶

In [69]:

import pandas as pd
df = pd.read_csv("./datasets/imdb-dataset.csv")
df.head()

Out[69]:

	review	sentiment
0	One of the other reviewers has mentioned that ...	positive
1	A wonderful little production. <br /><br />The...	positive
2	I thought this was a wonderful way to spend ti...	positive
3	Basically there's a family where a little boy ...	negative
4	Petter Mattei's "Love in the Time of Money" is...	positive

In [ ]:

In [70]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB

In [ ]:

In [71]:

# Check the count of positive and negative reviews to ensure that the dataset is balanced
df['sentiment'].value_counts()

Out[71]:

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [ ]:

In [72]:

import seaborn as sns
sns.catplot(x ='sentiment', kind='count', data = df);

In [ ]:

Reduce the records from 50K to 1K for quick processing

In [73]:

# save 1000 rows in a new dataframe
temp_df = df.iloc[0:1000,:]
temp_df.shape

Out[73]:

(1000, 2)

In [74]:

# check out the count of positive and negative reviews
temp_df['sentiment'].value_counts()

Out[74]:

positive    501
negative    499
Name: sentiment, dtype: int64

In [75]:

# save the dataframe to a new csv file
temp_df.to_csv('datasets/imdb-dataset-1000.csv', index=False)

In [ ]:

b. Case folding, removing digits, punctuations and substituting contractions¶

Read the Dataset:

In [1]:

import pandas as pd
df = pd.read_csv("./datasets/imdb-dataset-1000.csv")
df

Out[1]:

	review	sentiment
0	One of the other reviewers has mentioned that ...	positive
1	A wonderful little production. <br /><br />The...	positive
2	I thought this was a wonderful way to spend ti...	positive
3	Basically there's a family where a little boy ...	negative
4	Petter Mattei's "Love in the Time of Money" is...	positive
...	...	...
995	Nothing is sacred. Just ask Ernie Fosselius. T...	positive
996	I hated it. I hate self-aware pretentious inan...	negative
997	I usually try to be professional and construct...	negative
998	If you like me is going to see this in a film ...	negative
999	This is like a zoology textbook, given that it...	negative

1000 rows × 2 columns

In [2]:

df.review[0]

Out[2]:

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me. The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word. It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away. I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side."

In [ ]:

In [3]:

import re
import string
import contractions
from textblob import TextBlob

def text_cleaning(mystr):
    mystr = mystr.lower()     # case folding
    mystr = re.sub('\w*\d\w*', '', mystr) # removing digits
    mystr = re.sub('\n', ' ', mystr)      # replace new line characters with space
    mystr = re.sub('[‘’“”…]', '', mystr) # removing double quotes and single quotes
    mystr = re.sub('<.*?>', '', mystr)   # removing html tags 
    mystr = re.sub('https?://\S+|www.\.\S+', '', mystr) # removing URLs
    mystr = ''.join([c for c in mystr if c not in string.punctuation])  # remove punctuations
    mystr = ' '.join([contractions.fix(word) for word in mystr.split()]) # expand contractions
    return mystr

In [ ]:

In [4]:

df['r_cleaned'] = df['review'].apply(lambda x : text_cleaning(x))
df.head()

Out[4]:

	review	sentiment	r_cleaned
0	One of the other reviewers has mentioned that ...	positive	one of the other reviewers has mentioned that ...
1	A wonderful little production. <br /><br />The...	positive	a wonderful little production the filming tech...
2	I thought this was a wonderful way to spend ti...	positive	i thought this was a wonderful way to spend ti...
3	Basically there's a family where a little boy ...	negative	basically there is a family where a little boy...
4	Petter Mattei's "Love in the Time of Money" is...	positive	petter matteis love in the time of money is a ...

In [ ]:

b. Tokenization¶

In [5]:

from nltk.tokenize import word_tokenize
df['r_tokenized'] = df['r_cleaned'].apply(lambda x: word_tokenize(x))
df.head()

Out[5]:

	review	sentiment	r_cleaned	r_tokenized
0	One of the other reviewers has mentioned that ...	positive	one of the other reviewers has mentioned that ...	[one, of, the, other, reviewers, has, mentione...
1	A wonderful little production. <br /><br />The...	positive	a wonderful little production the filming tech...	[a, wonderful, little, production, the, filmin...
2	I thought this was a wonderful way to spend ti...	positive	i thought this was a wonderful way to spend ti...	[i, thought, this, was, a, wonderful, way, to,...
3	Basically there's a family where a little boy ...	negative	basically there is a family where a little boy...	[basically, there, is, a, family, where, a, li...
4	Petter Mattei's "Love in the Time of Money" is...	positive	petter matteis love in the time of money is a ...	[petter, matteis, love, in, the, time, of, mon...

In [ ]:

c. Remove Stop Words¶

In [6]:

import nltk
stop_words = nltk.corpus.stopwords.words('english')

def remove_stopwords(tokenized_text):
    new_words = [word for word in tokenized_text if word not in stop_words]
    return new_words

df['r_no_sw'] = df['r_tokenized'].apply(lambda token: remove_stopwords(token))
df.head()

Out[6]:

	review	sentiment	r_cleaned	r_tokenized	r_no_sw
0	One of the other reviewers has mentioned that ...	positive	one of the other reviewers has mentioned that ...	[one, of, the, other, reviewers, has, mentione...	[one, reviewers, mentioned, watching, oz, epis...
1	A wonderful little production. <br /><br />The...	positive	a wonderful little production the filming tech...	[a, wonderful, little, production, the, filmin...	[wonderful, little, production, filming, techn...
2	I thought this was a wonderful way to spend ti...	positive	i thought this was a wonderful way to spend ti...	[i, thought, this, was, a, wonderful, way, to,...	[thought, wonderful, way, spend, time, hot, su...
3	Basically there's a family where a little boy ...	negative	basically there is a family where a little boy...	[basically, there, is, a, family, where, a, li...	[basically, family, little, boy, jake, thinks,...
4	Petter Mattei's "Love in the Time of Money" is...	positive	petter matteis love in the time of money is a ...	[petter, matteis, love, in, the, time, of, mon...	[petter, matteis, love, time, money, visually,...

In [ ]:

d. Save the Pre-Processed Dataframe in a New CSV File¶

In [7]:

# join the tokens of pre-processed text
df['processed_reviews'] = df['r_no_sw'].apply(lambda x: ' '.join(x))

new_df = pd.concat([df['sentiment'], df['processed_reviews']], axis=1)

# save the resulting datafrrame to a new csv file
new_df.to_csv('datasets/processed_imdb_reviews.csv', index=False)
new_df.head()

Out[7]:

	sentiment	processed_reviews
0	positive	one reviewers mentioned watching oz episode ho...
1	positive	wonderful little production filming technique ...
2	positive	thought wonderful way spend time hot summer we...
3	negative	basically family little boy jake thinks zombie...
4	positive	petter matteis love time money visually stunni...

In [ ]:

Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 7.2 (Basic Text Pre-Processing)

Learning agenda of this notebook¶

Download and Install Required Libraries¶

1. Text Cleaning¶

a. Removing Digits and Words Containing Digits¶

b. Removing New Line Characters and Extra Spaces¶

c. Removing HTML Tags¶

d. Removing URLs¶

e. Removing Punctuations¶

2. Basic Text Preprocessing¶

a. Case Folding¶

b. Expand Contractions¶

c. Chat Word Treatment¶

d. Handle Emojis¶

(i) Remove Emojis¶

(ii) Replace Emojis with their Meanings¶

e. Spelling Correction¶

"I am reeding thiss gret boook on deta sciance suject, which is a greet curse"

f. Tokenize Text¶

(i) Tokenization with string.split() Method¶

(ii) Tokenization with re.split() Method¶

(iii) Tokenization using NLTK¶

(iv) Tokenization with spaCy¶

g. Creating N-grams¶

h. Stopwords Removal¶

(i) Using NLTK¶

(ii) Using spaCy¶

3. Text Pre-Processing on IMDB Dataset¶

a. EDA on IMD Dataset¶

b. Case folding, removing digits, punctuations and substituting contractions¶

b. Tokenization¶

c. Remove Stop Words¶

d. Save the Pre-Processed Dataframe in a New CSV File¶

(i) Tokenization with `string.split()` Method¶

(ii) Tokenization with `re.split()` Method¶