%%html
<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
%%capture
%load_ext autoreload
%autoreload 2
%cd ..
import statnlpbook.tokenization as tok
Before a program can process natural language, we need identify the words that constitute a string of characters. This, in fact, can be seen as a crucial transformation step to improve the input representation of language in the structured prediction recipe.
By default text on a computer is represented through String
values. These values store a sequence of characters (nowadays mostly in UTF-8 format). The first step of an NLP pipeline is therefore to split the text into smaller units corresponding to the words of the language we are considering. In the context of NLP we often refer to these units as tokens, and the process of extracting these units is called tokenisation. Tokenisation is considered boring by most, but it's hard to overemphasize its importance, seeing as it's the first step in a long pipeline of NLP processors, and if you get this step wrong, all further steps will suffer.
In Python a simple way to tokenise a text is via the split
method that divides a text wherever a particular substring is found. In the code below this pattern is simply the whitespace character, and this seems like a reasonable starting point for an English tokenisation approach.
text = "Mr. Bob Dobolina is thinkin' of a master plan." + \
"\nWhy doesn't he quit?"
text.split(" ")
['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan.\nWhy', "doesn't", 'he', 'quit?']
Python allows users to construct tokenisers using regular expressions that define the character sequence patterns at which to either split tokens, or patterns that define what constitutes a token. In general regular expressions are a powerful tool NLP practitioners can use when working with text, and they come in handy when you work with command line tools such as grep. In the code below we use a simple pattern \\s
that matches any whitespace to define where to split.
import re
gap = re.compile('\s')
gap.split(text)
['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan.', 'Why', "doesn't", 'he', 'quit?']
One shortcoming of this tokenisation is its treatment of punctuation because it considers "plan." as a token whereas ideally we would prefer "plan" and "." to be distinct tokens. It is easier to address this problem if we define what a token token is, instead of what constitutes a gap. Below we have define tokens as sequences of alphanumeric characters and punctuation.
token = re.compile('\w+|[.?:]')
token.findall(text)
['Mr', '.', 'Bob', 'Dobolina', 'is', 'thinkin', 'of', 'a', 'master', 'plan', '.', 'Why', 'doesn', 't', 'he', 'quit', '?']
This still isn't perfect as "Mr." is split into two tokens, but it should be a single token. Moreover, we have actually lost an apostrophe. Both is fixed below, although we now fail to break up the contraction "doesn't".
token = re.compile('Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
tokens
['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan', '.', 'Why', "doesn't", 'he', 'quit', '?']
For most English domains powerful and robust tokenisers can be built using the simple pattern matching approach shown above. However, in languages such as Japanese, words are not separated by whitespace, and this makes tokenisation substantially more challenging. Try to, for example, find a good generic regular expression pattern to tokenise the following sentence.
jap = "彼は音楽を聞くのが大好きです"
re.compile('彼|は|く|音楽|を|聞くの|が|大好き|です').findall(jap)
['彼', 'は', '音楽', 'を', '聞くの', 'が', '大好き', 'です']
Even for certain English domains such as the domain of biomedical papers, tokenisation is non-trivial (see an analysis why here).
When tokenisation is more challenging and difficult to capture in a few rules a machine-learning based approach can be useful. In a nutshell, we can treat the tokenisation problem as a character classification problem, or if needed, as a sequential labelling problem.
Many NLP tools work on a sentence-by-sentence basis. The next preprocessing step is hence to segment streams of tokens into sentences. In most cases this is straightforward after tokenisation, because we only need to split sentences at sentence-ending punctuation tokens.
However, keep in mind that, as well as tokenisation, sentence segmentation is language specific - not all languages contain punctuation which denotes sentence boundary, and even if they do, not all segmentations are trivial (can you think of examples?).
tok.sentence_segment(re.compile('\.'), tokens)
[['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan', '.'], ['Why', "doesn't", 'he', 'quit', '?']]