Notebook

Information Retrieval Lab: Text Analysis¶

The text analysis step extracts from a document’s text the keys by which it can be looked up in the index. Two kinds of keys are distinguished:

Index terms (terms, for short)
Features

In this notebook, we will first implement tokenization stopword removal, and simple stemming in pure Python, to give you some more practice on what you have learned in the first two notebooks. Then, we will explore the text analysis capabilities of the spacy package, which will be used in future exercies.

Lets start off with this small document collection:

In [1]:

documents = [
    'This is the first document.',
    'This is the second document.',
    'And the third one.',
    'Is this the first document?',
    'Another document to test if tokens are stemmed correctly.'
]

Tokenization¶

Tokenization turns a text string into a sequence of tokens. For now, we choose a very simple tokenization strategy: separate the input string by punctiation characters and whitespace.

Exercise: Implement a method that turns a given string into lowercase words, according to the tokenization strategy defined above.

In [2]:

import re

def tokenize(document: str) -> list:
    """
    Tokenizes a string by splitting it at spaces (one or more). Retains punctuation and special characters
    :param document: a string representation of a sentence
    :return: A list of lowercase strings representing the sentence, including punctuation
    """
    tokens = re.findall(r"\w+|[^\w\s]", document, re.UNICODE)
    tokens = list(map(str.lower, tokens))
    return tokens

tokenized_documents = list(map(tokenize, documents))
tokenized_documents

Out[2]:

[['this', 'is', 'the', 'first', 'document', '.'],
 ['this', 'is', 'the', 'second', 'document', '.'],
 ['and', 'the', 'third', 'one', '.'],
 ['is', 'this', 'the', 'first', 'document', '?'],
 ['another',
  'document',
  'to',
  'test',
  'if',
  'tokens',
  'are',
  'stemmed',
  'correctly',
  '.']]

Question: why is this simple tokenization scheme insufficient? What are its shortcomings and where might it fail?

Stopping¶

Stopping, also stop word removal, discards a selection of words from the set of index terms of a document. Commonly, stopwords include function words (for example, "the", "of", "to", "for") and highly frequent words in the document collection or language (for example, “Wikipedia” appears on every Wikipedia page).

You can download a list of english stopwords from the course page. We put them in a ./data directory for now.

In [3]:

with open("../data/stopwords.txt", "r") as stopword_file:
    stopwords = stopword_file.readlines()
    stopwords = list(map(str.strip, stopwords))

Exercise: Implement a function that removes stopwords from the token list produced by the tokenization step. Use the stopword list loaded before.

In [4]:

def remove_stopwords(tokens: list, stopwords: list) -> list:
    """
    Removes stopwords from list of tokens, according to a supplied list of words to remove
    :param tokens: a list of string tokens
    :param stopwords: a list of words to remove
    :return: a list of stemmed string tokens
    """
    return list(filter(lambda t: t not in stopwords, tokens))

stopped_documents = list(map(lambda d: remove_stopwords(d, stopwords), tokenized_documents))
stopped_documents

Out[4]:

[['first', 'document'],
 ['second', 'document'],
 ['third', 'one'],
 ['first', 'document'],
 ['another', 'document', 'test', 'tokens', 'stemmed', 'correctly']]

Question: Take a look at the supplied stop word list. Are there words you would add, and why? Why do you think removing stopwords is beneficial to retrieve information from a corpus of documents?

Stemming¶

Stemming aims at reducing inflected index terms to a common stem. For example, “statistics”, “statistic”, and “statistical” refer to basically the same abstract concept and should be mapped to one single term, likely "statistic".

The upside of stemming / lemmatization is an increased chance of finding a document when it uses different grammar or derived words different from a query.

Stemming is a complex subject, but a very basic approach is suffix stripping: simply removing common (english) inflection suffixes.

Exercise: Write a simple stemming function that applies the following three rules:

if the word ends in 'ed', remove the 'ed'
if the word ends in 'ing', remove the 'ing'
if the word ends in 'ly', remove the 'ly'

In [5]:

def stem(tokens: list) -> list:
    """
    Stems each token in a list of tokens by applying rule-based suffix stemming.
    :param tokens: a list of string tokens
    :return: a list of stemmed string tokens
    """
    res = []
    for token in tokens:
        if token[-2:] == "ed":
            res.append(token[:-2])
        elif token[-3:] == "ing":
            res.append(token[:-3])
        elif token[-2:] == "ly":
            res.append(token[:-2])
        else:
            res.append(token)
    return res

stemmed_documents = list(map(lambda d: stem(d), stopped_documents))
stemmed_documents

Out[5]:

[['first', 'document'],
 ['second', 'document'],
 ['third', 'one'],
 ['first', 'document'],
 ['another', 'document', 'test', 'tokens', 'stemm', 'correct']]

Question: think of some examples where this simple stemming scheme does not work. Can you extend the rule set above to also capture these cases? What other ways of stemming, besides prefix/suffix removal can you come up with?

Wrapper Function¶

Now that we have our basic three components of text preprocessing, we can wrap them into a single function, which takes a document as (string) input, and returns the list of preprocessed terms, ready for indexing.

In [6]:

def preprocess(document):
    """
    Converts a string into a list of segmented, stopped, and stemmed tokens.
    :param document: the input string
    :return: a list processed tokens
    """
    tokens = tokenize(document)
    stopped = remove_stopwords(tokens, stopwords)
    stemmed = stem(stopped)
    return stemmed

In [7]:

list(map(preprocess, documents))

Out[7]:

[['first', 'document'],
 ['second', 'document'],
 ['third', 'one'],
 ['first', 'document'],
 ['another', 'document', 'test', 'tokens', 'stemm', 'correct']]

Advanced Text Analysis¶

The three simple techniques implemented above are quite limited: words are not always separable at whitespace, our stopword list is rather small, and the stemming technique implemented only recognizes with a very limited set of possible word inflections.

A widely used tool for advanced text analysis is the spacy package. Go ahead and install the package (and the en_core_web_sm language model) by following the installation instructions for your environment.

In [8]:

import spacy
nlp = spacy.load("en_core_web_sm")

The surface API of spacy is really simple: we can call the nlp model defined above on any string, and spacy will automatically extract different kinds of information for you.

In [9]:

doc = nlp("The quick brown fox jumps over the lazy dog.")

In [10]:

# this is a helper function to nicely visualize the annotated information right in your notebook
from spacy import displacy

# Show the dependency parsing result
displacy.render(doc, style="dep", jupyter=True)

You can access the annotations made by spacy by accessing the attributes of individual tokens in your analyzed document. For a full list of the available information, refer to the spacy API docs.

Exercise: use spacy to print the text, lemma, POS tag, and whether its a stopword or not for each token in the document from above.

In [11]:

for token in doc:
    print(token.text, token.lemma_, token.is_stop, token.pos_)

The the True DET
quick quick False ADJ
brown brown False ADJ
fox fox False NOUN
jumps jump False VERB
over over True ADP
the the True DET
lazy lazy False ADJ
dog dog False NOUN
. . False PUNCT

Exercise: Implement the wrapper function from before, but this time, use spacy to analyze the text.

In [12]:

def preprocess_spacy(document):
    """
    Converts a string into a list of segmented, stopped, and stemmed tokens.
    :param document: the input string
    :return: a list processed tokens
    """
    return [token.lemma_ for token in nlp(document) if not (token.is_stop or token.is_punct)]

In [13]:

list(map(preprocess_spacy, documents))

Out[13]:

[['document'],
 ['second', 'document'],
 [],
 ['document'],
 ['document', 'test', 'token', 'stem', 'correctly']]

Question: Compare the results of both implementations. Try out different sentences - can you spot significant differences between both token lists?