In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # CPU
os.environ['DISABLE_V2_BEHAVIOR'] = '1' # disable V2 Behavior - required for NER in TF2 right now

ShallowNLP Tutorial

The ShallowNLP module in ktrain is a small collection of text-analytic utilities to help analyze text data in English, Chinese, Russian, and other languages. All methods in ShallowNLP are for use on a normal laptop CPU - no GPUs are required. Thus, it is well-suited to those with minimal computational resources and no GPU access.

Let's begin by importing the shallownlp module.

In [2]:
from ktrain.text import shallownlp as snlp
Using DISABLE_V2_BEHAVIOR with TensorFlow
using Keras version: 2.2.4-tf

SECTION 1: Ready-to-Use Named-Entity-Recognition

ShallowNLP includes pre-trained Named Entity Recognition (NER) for English, Chinese, and Russian.

English NER

Extracting entities from:

Xuetao Cao was head of the Chinese Academy of Medical Sciences and is the current president of Nankai University.

In [3]:
ner = snlp.NER('en')
text = """
Xuetao Cao was head of the Chinese Academy of Medical Sciences and is 
the current president of Nankai University.
"""
ner.predict(text)
Out[3]:
[('Xuetao Cao', 'PER'),
 ('Chinese Academy of Medical Sciences', 'ORG'),
 ('Nankai University', 'ORG')]

The ner.predict method automatically merges tokens by entity. To see the unmerged results, set merge_tokens=False:

In [4]:
ner.predict(text, merge_tokens=False)
Out[4]:
[('Xuetao', 'B-PER'),
 ('Cao', 'I-PER'),
 ('was', 'O'),
 ('head', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Chinese', 'B-ORG'),
 ('Academy', 'I-ORG'),
 ('of', 'I-ORG'),
 ('Medical', 'I-ORG'),
 ('Sciences', 'I-ORG'),
 ('and', 'O'),
 ('is', 'O'),
 ('the', 'O'),
 ('current', 'O'),
 ('president', 'O'),
 ('of', 'O'),
 ('Nankai', 'B-ORG'),
 ('University', 'I-ORG'),
 ('.', 'O')]

The ner.predict method typically operates on single sentences, as in the example above. For multi-sentence documents, sentences can be extracted with snlp.sent_tokenize:

In [5]:
document = """Paul Newman is a great actor.  Tommy Wiseau is not."""
sents = []
for idx, sent in enumerate(snlp.sent_tokenize(document)):
    sents.append(sent)
    print('sentence #%d: %s' % (idx+1, sent))
sentence #1: Paul Newman is a great actor .
sentence #2: Tommy Wiseau is not .
In [6]:
ner.predict(sents[0])
Out[6]:
('Paul Newman', 'PER')
In [7]:
ner.predict(sents[1])
Out[7]:
('Tommy Wiseau', 'PER')

Chinese NER

Extracting entities from the Chinese translation of:

Xuetao Cao was head of the Chinese Academy of Medical Sciences and is the current president of Nankai University.

In [8]:
ner = snlp.NER('zh')
ner.predict('曹雪涛曾任中国医学科学院院长,现任南开大学校长。')
Out[8]:
[('曹雪涛', 'PER'), ('中国医学科学院', 'ORG'), ('南开大学', 'ORG')]

Discovered entities with English translations:

  • 曹雪涛 = Cao Xuetao (PER)
  • 中国医学科学院 = Chinese Academy of Medical Sciences (ORG)
  • 南开大学 = Nankai University (ORG)

The snlp.sent_tokenize can also be used with Chinese documents:

In [9]:
document = """这是关于史密斯博士的第一句话。第二句话是关于琼斯先生的。"""
for idx, sent in enumerate(snlp.sent_tokenize(document)):
    print('sentence #%d: %s' % (idx+1, sent))
sentence #1: 这是关于史密斯博士的第一句话。
sentence #2: 第二句话是关于琼斯先生的。

Russian NER

Extracting entities from the Russian translation of:

Katerina Tikhonova, the youngest daughter of Russian President Vladimir Putin, was appointed head of a new artificial intelligence institute at Moscow State University.

In [10]:
ner = snlp.NER('ru')
russian_sentence = """Катерина Тихонова, младшая дочь президента России Владимира Путина, 
была назначена руководителем нового института искусственного интеллекта в МГУ."""
ner.predict(russian_sentence)
Out[10]:
[('Катерина Тихонова', 'PER'),
 ('России', 'LOC'),
 ('Владимира Путина', 'PER'),
 ('МГУ', 'ORG')]

Discovered entities with English translations:

  • Катерина Тихонова = Katerina Tikhonova (PER)
  • России = Russia (LOC)
  • Владимира Путина = Vladimir Putin (PER)
  • МГУ = Moscow State University (ORG)

SECTION 2: Text Classification

ShallowNLP makes it easy to build a text classifier with minimal computational resources. ShallowNLP includes the following sklearn-based text classification models: a non-neural version of NBSVM, Logistic Regression, and Linear SVM with SGD training (SGDClassifier). Logistic regression is the default classifier. For these examples, we will use NBSVM.

A classifier can be trained with minimal effort for both English and Chinese.

English Text Classification

We'll use the IMDb movie review dataset available here to build a sentiment analysis model for English.

In [11]:
datadir = r'/home/amaiya/data/aclImdb'
(x_train,  y_train, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')
(x_test,  y_test, _) = snlp.Classifier.load_texts_from_folder(datadir+'/test', shuffle=False)
print('label names: %s' % (label_names))
clf = snlp.Classifier().fit(x_train, y_train, ctype='nbsvm')
print('validation accuracy: %s%%' % (round(clf.evaluate(x_test, y_test)*100, 2)))
pos_text = 'I loved this movie because it was hilarious.'
neg_text = 'I hated this movie because it was boring.'
print('prediction for "%s": %s (pos)' % (pos_text, clf.predict(pos_text)))
print('prediction for "%s": %s (neg)' % (neg_text, clf.predict(neg_text)))
label names: ['neg', 'pos']
validation accuracy: 92.03%
prediction for "I loved this movie because it was hilarious.": 1 (pos)
prediction for "I hated this movie because it was boring.": 0 (neg)

Chinese Text Classification

We'll use the hotel review dataset available here to build a sentiment analysis model for Chinese.

In [12]:
datadir = '/home/amaiya/data/ChnSentiCorp_htl_ba_6000'
(texts,  labels, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')
print('label names: %s' % (label_names))
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(texts, labels, test_size=0.1, random_state=42)
clf = snlp.Classifier().fit(x_train, y_train, ctype='nbsvm')
print('validation accuracy: %s%%' % (round(clf.evaluate(x_test, y_test)*100, 2)))
pos_text = '我喜欢这家酒店,因为它很干净。'  # I loved this hotel because it was very clean.
neg_text = '我讨厌这家酒店,因为它很吵。'  # I hated this hotel because it was noisy.
print('prediction for "%s": %s' % (pos_text, clf.predict(pos_text)))
print('prediction for "%s": %s' % (neg_text, clf.predict(neg_text)))
Building prefix dict from the default dictionary ...
I0304 17:09:57.303427 140636486911808 __init__.py:111] Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
I0304 17:09:57.305477 140636486911808 __init__.py:131] Loading model from cache /tmp/jieba.cache
detected encoding: GB18030
Decoding with GB18030 failed 1st attempt - using GB18030 with skips
skipped 118 lines (0.3%) due to character decoding errors
label names: ['neg', 'pos']
Loading model cost 0.640 seconds.
I0304 17:09:57.945362 140636486911808 __init__.py:163] Loading model cost 0.640 seconds.
Prefix dict has been built succesfully.
I0304 17:09:57.946959 140636486911808 __init__.py:164] Prefix dict has been built succesfully.
validation accuracy: 91.55%
prediction for "我喜欢这家酒店,因为它很干净。": 1
prediction for "我讨厌这家酒店,因为它很吵。": 0

Tuning Hyperparameters of a Text Classifier

The hyperparameters of a particular classifier can be tuned using the grid_search method. Let's tune the C hyperparameter of a Logistic Regression model to see what is the best value for this dataset.

In [14]:
# setup data
datadir = r'/home/amaiya/data/aclImdb'
(x_train,  y_train, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')
(x_test,  y_test, _) = snlp.Classifier.load_texts_from_folder(datadir+'/test', shuffle=False)

# initialize a model to optimize
clf = snlp.Classifier()
clf.create_model('logreg', x_train)

# create parameter space for values of C
parameters = {'clf__C': (1e0, 1e-1, 1e-2)}

# tune
clf.grid_search(parameters, x_train[:5000], y_train[:5000], n_jobs=-1)
clf__C: 1.0

It looks like a value of 1.0 is best. We can then re-create the model with this hyperparameter value and proceed to train normally:

clf.create_model('logreg', x_train, hp_dict={'C':1.0})
clf.fit(x_train, y_train)
clf.evaluate(x_test, y_test)

SECTION 3: Examples of Searching Text

Here we will show some simple searches over multi-language documents.

In [15]:
document1 ="""
Hello there,

Hope this email finds you well.

Are you available to talk about our meeting?

If so, let us plan to schedule the meeting
at the Hefei National Laboratory for Physical Sciences at the Microscale.

As I always say: живи сегодня надейся на завтра

Sincerely,
John Doe
合肥微尺度国家物理科学实验室
"""

document2 ="""
This is a random document with Arabic about our meeting.

عش اليوم الأمل ليوم غد

Bye for now.
"""

docs = [document1, document2]

Searching English

The search function returns a list of documents that match query. Each entry shows:

  1. the ID of the document
  2. the query (multiple queries can be supplied in a list, if desired)
  3. the number of word hits in the document
In [16]:
snlp.search(['physical sciences', 'meeting', 'Arabic'], docs, keys=['doc1', 'doc2'])
Out[16]:
[('doc1', 'physical sciences', 1),
 ('doc1', 'meeting', 2),
 ('doc2', 'meeting', 1),
 ('doc2', 'Arabic', 1)]

Searching Chinese

The search function returns a list of documents that match query. Each entry shows:

  1. the ID of the document
  2. the query
  3. the number of word hits in the document
In [17]:
snlp.search('合肥微尺度国家物理科学实验室', docs, keys=['doc1', 'doc2'])
Out[17]:
[('doc1', '合肥微尺度国家物理科学实验室', 7)]

For Chinese, the number of word hits is the number of words in the query that appear in the document. Seven of the words in the string 合肥微尺度国家物理科学实验室 were found in doc1.

Other Searches

The search function can also be used for other languages.

Arabic

In [18]:
for result in snlp.search('عش اليوم الأمل ليوم غد', docs, keys=['doc1', 'doc2']):
    print("doc id:%s"% (result[0]))
    print('query:%s' % (result[1]))
    print('# of matches in document:%s' % (result[2]))
doc id:doc2
query:عش اليوم الأمل ليوم غد
# of matches in document:1

Russian

In [19]:
snlp.search('сегодня надейся на завтра', docs, keys=['doc1', 'doc2'])
Out[19]:
[('doc1', 'сегодня надейся на завтра', 1)]

Extract Chinese, Russian, or Arabic from mixed-language documents

In [20]:
snlp.find_chinese(document1)
Out[20]:
['合肥微尺度国家物理科学实验室']
In [21]:
snlp.find_russian(document1)
Out[21]:
['живи', 'сегодня', 'надейся', 'на', 'завтра']
In [22]:
snlp.find_arabic(document2)
Out[22]:
['عش', 'اليوم', 'الأمل', 'ليوم', 'غد']
In [ ]: