%%capture
%load_ext autoreload
%autoreload 2
%cd ..
import statnlpbook.tokenization as tok
In Python you can tokenize a text via split
:
text = """Mr. Bob Dobolina is thinkin' of a master plan.
Why doesn't he quit?"""
text.split(" ")
['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan.\nWhy', "doesn't", 'he', 'quit?']
Why is this suboptimal?
Python allows users to construct tokenizers using
that define patterns at which to split tokens.
A regular expression is a compact definition of a set of (character) sequences.
Examples:
"Mr."
: set containing only "Mr."
" |\n|!!!"
: set containing the sequences " "
, "\n"
and "!!!"
"[abc]"
: set containing only the characters a
, b
and c
"\s"
: set of all whitespace characters"1+"
: set of all sequences of at least one "1"
import re
re.compile('\s').split(text)
['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan.', 'Why', "doesn't", 'he', 'quit?']
Problems:
Let us use findall
instead:
re.compile('\w+|[.?]').findall(text) # \w+|[.?]
['Mr', '.', 'Bob', 'Dobolina', 'is', 'thinkin', 'of', 'a', 'master', 'plan', '.', 'Why', 'doesn', 't', 'he', 'quit', '?']
Problems:
Both is fixed below ...
re.compile('Mr.|[\w\']+|[.?]').findall(text)
['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan', '.', 'Why', "doesn't", 'he', 'quit', '?']
jap = "今日もしないといけない。"
Try lexicon-based tokenization ...
re.compile('もし|今日|も|しない|と|けない').findall(jap)
['今日', 'もし', 'と', 'けない']
Equally complex for certain English domains (eg. bio-medical text).
bio = """We developed a nanocarrier system of herceptin-conjugated nanoparticles
of d-alpha-tocopheryl-co-poly(ethylene glycol) 1000 succinate (TPGS)-cisplatin
prodrug ..."""
re.compile('\s').split(bio)[:15]
['We', 'developed', 'a', 'nanocarrier', 'system', 'of', 'herceptin-conjugated', 'nanoparticles', 'of', 'd-alpha-tocopheryl-co-poly(ethylene', 'glycol)', '1000', 'succinate', '(TPGS)-cisplatin', 'prodrug']
Solution: Treat tokenization as a statistical NLP problem (and as structured prediction)!
tokens = re.compile('Mr.|[\w\']+|[.?]').findall(text)
# try different regular expressions
tok.sentence_segment(re.compile('\.'), tokens)
[['Mr.', 'Bob', 'Dobolina', 'is', "thinkin'", 'of', 'a', 'master', 'plan', '.'], ['Why', "doesn't", 'he', 'quit', '?']]
What do you do with transcribed speech?