The text analysis step extracts from a document’s text the keys by which it can be looked up in the index. Two kinds of keys are distinguished:
In this notebook, we will first implement tokenization stopword removal, and simple stemming in pure Python, to give you some more practice on what you have learned in the first two notebooks. Then, we will explore the text analysis capabilities of the spacy
package, which will be used in future exercies.
Lets start off with this small document collection:
documents = [
'This is the first document.',
'This is the second second document.',
'And the third one, with a comma.',
'Is this the first document?',
]
Tokenization turns a text string into a sequence of tokens. For now, we choose a very simple tokenization strategy: separate the input string by punctiation characters and whitespace.
Exercise: Implement a method that turns a given string into lowercase words, according to the tokenization strategy defined above.
import re
def tokenize(document):
"""
Tokenizes a string by splitting it at spaces (one or more). Retains punctuation and special characters
:param document: a string representation of a sentence
:return: A list of lowercase strings representing the sentence, including punctuation
"""
# your code here
return []
Apply your tokenize function to the small document collection from above.
Question: why is this simple tokenization scheme insufficient? What are its shortcomings and where might it fail?
Stopping, also stop word removal, discards a selection of words from the set of index terms of a document. Commonly, stopwords include function words (for example, "the", "of", "to", "for") and highly frequent words in the document collection or language (for example, “Wikipedia” appears on every Wikipedia page).
You can download a list of english stopwords from the course page. We put them in a ./data
directory for now.
with open("../data/stopwords.txt", "r") as stopword_file:
stopwords = stopword_file.readlines()
stopwords = list(map(str.strip, stopwords))
Exercise: Implement a function that removes stopwords from the token list produced by the tokenization step. Use the stopword list loaded before.
def remove_stopwords(tokens, stopwords):
"""
Removes stopwords from list of tokens, according to a supplied list of words to remove
:param tokens: a list of string tokens
:param stopwords: a list of words to remove
:return: a list of stemmed string tokens
"""
# your code here
return []
Apply your stopword function function to the tokenized document collection.
Question: Take a look at the supplied stop word list. Are there words you would add, and why? Why do you think removing stopwords is beneficial to retrieve information from a corpus of documents?
Stemming aims at reducing inflected index terms to a common stem. For example, “statistics”, “statistic”, and “statistical” refer to basically the same abstract concept and should be mapped to one single term, likely "statistic".
The upside of stemming / lemmatization is an increased chance of finding a document when it uses different grammar or derived words different from a query.
Stemming is a complex subject, but a very basic approach is suffix stripping: simply removing common (english) inflection suffixes.
Exercise: Write a simple stemming function that applies the following three rules:
def stem(tokens):
"""
Stems each token in a list of tokens by applying rule-based suffix stemming.
:param tokens: a list of string tokens
:return: a list of stemmed string tokens
"""
# your code here
return []
Apply your stemmung function function to the document collection.
stemmed_documents = list(map(lambda d: stem(d), stopped_documents))
stemmed_documents
Question: think of some examples where this simple stemming scheme does not work. Can you extend the rule set above to also capture these cases? What other ways of stemming, besides prefix/suffix removal can you come up with?
Now that we have our basic three components of text preprocessing, we can wrap them into a single function, which takes a document as (string) input, and returns the list of preprocessed terms, ready for indexing.
def preprocess(document):
"""
Converts a string into a list of segmented, stopped, and stemmed tokens.
:param document: the input string
:return: a list processed tokens
"""
# your code here
return []
Apply the complete preprocessing to the original document collection.
The three simple techniques implemented above are quite limited: words are not always separable at whitespace, our stopword list is rather small, and the stemming technique implemented only recognizes with a very limited set of possible word inflections.
A widely used tool for advanced text analysis is the spacy
package. Go ahead and install the package (and the en_core_web_sm
language model) by following the installation instructions for your environment.
import spacy
nlp = spacy.load("en_core_web_sm")
The surface API of spacy
is really simple: we can call the nlp
model defined above on any string, and spacy
will automatically extract different kinds of information for you.
doc = nlp("The quick brown fox jumps over the lazy dog.")
# this is a helper function to nicely visualize the annotated information right in your notebook
from spacy import displacy
# Show the dependency parsing result
displacy.render(doc, style="dep", jupyter=True)
You can access the annotations made by spacy
by accessing the attributes of individual tokens in your analyzed document. For a full list of the available information, refer to the spacy API docs.
Exercise: use spacy to print the text, lemma, POS tag, and whether its a stopword or not for each token in the document from above (doc
).
Exercise: Implement the wrapper function from before, but this time, use spacy
to analyze the text.
def preprocess_spacy(document):
"""
Converts a string into a list of segmented, stopped, and stemmed tokens.
:param document: the input string
:return: a list processed tokens
"""
return []
Apply the spacy-based preprocessing function to the document collection.
Question: Compare the results of both implementations. Try out different other sentences than the ones in the small text collection - can you spot significant differences between both token lists?