%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # CPU
os.environ['DISABLE_V2_BEHAVIOR'] = '1' # disable V2 Behavior - required for NER in TF2 right now
The ShallowNLP module in ktrain is a small collection of text-analytic utilities to help analyze text data in English, Chinese, Russian, and other languages. All methods in ShallowNLP are for use on a normal laptop CPU - no GPUs are required. Thus, it is well-suited to those with minimal computational resources and no GPU access.
Let's begin by importing the shallownlp
module.
from ktrain.text import shallownlp as snlp
Using DISABLE_V2_BEHAVIOR with TensorFlow using Keras version: 2.2.4-tf
ner = snlp.NER('en')
text = """
Xuetao Cao was head of the Chinese Academy of Medical Sciences and is
the current president of Nankai University.
"""
ner.predict(text)
[('Xuetao Cao', 'PER'), ('Chinese Academy of Medical Sciences', 'ORG'), ('Nankai University', 'ORG')]
The ner.predict
method automatically merges tokens by entity. To see the unmerged results, set merge_tokens=False
:
ner.predict(text, merge_tokens=False)
[('Xuetao', 'B-PER'), ('Cao', 'I-PER'), ('was', 'O'), ('head', 'O'), ('of', 'O'), ('the', 'O'), ('Chinese', 'B-ORG'), ('Academy', 'I-ORG'), ('of', 'I-ORG'), ('Medical', 'I-ORG'), ('Sciences', 'I-ORG'), ('and', 'O'), ('is', 'O'), ('the', 'O'), ('current', 'O'), ('president', 'O'), ('of', 'O'), ('Nankai', 'B-ORG'), ('University', 'I-ORG'), ('.', 'O')]
The ner.predict
method typically operates on single sentences, as in the example above. For multi-sentence documents, sentences can be extracted with snlp.sent_tokenize
:
document = """Paul Newman is a great actor. Tommy Wiseau is not."""
sents = []
for idx, sent in enumerate(snlp.sent_tokenize(document)):
sents.append(sent)
print('sentence #%d: %s' % (idx+1, sent))
sentence #1: Paul Newman is a great actor . sentence #2: Tommy Wiseau is not .
ner.predict(sents[0])
('Paul Newman', 'PER')
ner.predict(sents[1])
('Tommy Wiseau', 'PER')
Extracting entities from the Chinese translation of:
Xuetao Cao was head of the Chinese Academy of Medical Sciences and is the current president of Nankai University.
ner = snlp.NER('zh')
ner.predict('曹雪涛曾任中国医学科学院院长,现任南开大学校长。')
[('曹雪涛', 'PER'), ('中国医学科学院', 'ORG'), ('南开大学', 'ORG')]
Discovered entities with English translations:
The snlp.sent_tokenize
can also be used with Chinese documents:
document = """这是关于史密斯博士的第一句话。第二句话是关于琼斯先生的。"""
for idx, sent in enumerate(snlp.sent_tokenize(document)):
print('sentence #%d: %s' % (idx+1, sent))
sentence #1: 这是关于史密斯博士的第一句话。 sentence #2: 第二句话是关于琼斯先生的。
Extracting entities from the Russian translation of:
Katerina Tikhonova, the youngest daughter of Russian President Vladimir Putin, was appointed head of a new artificial intelligence institute at Moscow State University.
ner = snlp.NER('ru')
russian_sentence = """Катерина Тихонова, младшая дочь президента России Владимира Путина,
была назначена руководителем нового института искусственного интеллекта в МГУ."""
ner.predict(russian_sentence)
[('Катерина Тихонова', 'PER'), ('России', 'LOC'), ('Владимира Путина', 'PER'), ('МГУ', 'ORG')]
Discovered entities with English translations:
ShallowNLP makes it easy to build a text classifier with minimal computational resources. ShallowNLP includes the following sklearn-based text classification models: a non-neural version of NBSVM, Logistic Regression, and Linear SVM with SGD training (SGDClassifier). Logistic regression is the default classifier. For these examples, we will use NBSVM.
A classifier can be trained with minimal effort for both English and Chinese.
We'll use the IMDb movie review dataset available here to build a sentiment analysis model for English.
datadir = r'/home/amaiya/data/aclImdb'
(x_train, y_train, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')
(x_test, y_test, _) = snlp.Classifier.load_texts_from_folder(datadir+'/test', shuffle=False)
print('label names: %s' % (label_names))
clf = snlp.Classifier().fit(x_train, y_train, ctype='nbsvm')
print('validation accuracy: %s%%' % (round(clf.evaluate(x_test, y_test)*100, 2)))
pos_text = 'I loved this movie because it was hilarious.'
neg_text = 'I hated this movie because it was boring.'
print('prediction for "%s": %s (pos)' % (pos_text, clf.predict(pos_text)))
print('prediction for "%s": %s (neg)' % (neg_text, clf.predict(neg_text)))
label names: ['neg', 'pos'] validation accuracy: 92.03% prediction for "I loved this movie because it was hilarious.": 1 (pos) prediction for "I hated this movie because it was boring.": 0 (neg)
We'll use the hotel review dataset available here to build a sentiment analysis model for Chinese.
datadir = '/home/amaiya/data/ChnSentiCorp_htl_ba_6000'
(texts, labels, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')
print('label names: %s' % (label_names))
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(texts, labels, test_size=0.1, random_state=42)
clf = snlp.Classifier().fit(x_train, y_train, ctype='nbsvm')
print('validation accuracy: %s%%' % (round(clf.evaluate(x_test, y_test)*100, 2)))
pos_text = '我喜欢这家酒店,因为它很干净。' # I loved this hotel because it was very clean.
neg_text = '我讨厌这家酒店,因为它很吵。' # I hated this hotel because it was noisy.
print('prediction for "%s": %s' % (pos_text, clf.predict(pos_text)))
print('prediction for "%s": %s' % (neg_text, clf.predict(neg_text)))
Building prefix dict from the default dictionary ... I0304 17:09:57.303427 140636486911808 __init__.py:111] Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache I0304 17:09:57.305477 140636486911808 __init__.py:131] Loading model from cache /tmp/jieba.cache
detected encoding: GB18030 Decoding with GB18030 failed 1st attempt - using GB18030 with skips skipped 118 lines (0.3%) due to character decoding errors label names: ['neg', 'pos']
Loading model cost 0.640 seconds. I0304 17:09:57.945362 140636486911808 __init__.py:163] Loading model cost 0.640 seconds. Prefix dict has been built succesfully. I0304 17:09:57.946959 140636486911808 __init__.py:164] Prefix dict has been built succesfully.
validation accuracy: 91.55% prediction for "我喜欢这家酒店,因为它很干净。": 1 prediction for "我讨厌这家酒店,因为它很吵。": 0
The hyperparameters of a particular classifier can be tuned using the grid_search
method. Let's tune the C hyperparameter of a Logistic Regression model to see what is the best value for this dataset.
# setup data
datadir = r'/home/amaiya/data/aclImdb'
(x_train, y_train, label_names) = snlp.Classifier.load_texts_from_folder(datadir+'/train')
(x_test, y_test, _) = snlp.Classifier.load_texts_from_folder(datadir+'/test', shuffle=False)
# initialize a model to optimize
clf = snlp.Classifier()
clf.create_model('logreg', x_train)
# create parameter space for values of C
parameters = {'clf__C': (1e0, 1e-1, 1e-2)}
# tune
clf.grid_search(parameters, x_train[:5000], y_train[:5000], n_jobs=-1)
clf__C: 1.0
It looks like a value of 1.0
is best. We can then re-create the model with this hyperparameter value and proceed to train normally:
clf.create_model('logreg', x_train, hp_dict={'C':1.0})
clf.fit(x_train, y_train)
clf.evaluate(x_test, y_test)
Here we will show some simple searches over multi-language documents.
document1 ="""
Hello there,
Hope this email finds you well.
Are you available to talk about our meeting?
If so, let us plan to schedule the meeting
at the Hefei National Laboratory for Physical Sciences at the Microscale.
As I always say: живи сегодня надейся на завтра
Sincerely,
John Doe
合肥微尺度国家物理科学实验室
"""
document2 ="""
This is a random document with Arabic about our meeting.
عش اليوم الأمل ليوم غد
Bye for now.
"""
docs = [document1, document2]
The search
function returns a list of documents that match query. Each entry shows:
snlp.search(['physical sciences', 'meeting', 'Arabic'], docs, keys=['doc1', 'doc2'])
[('doc1', 'physical sciences', 1), ('doc1', 'meeting', 2), ('doc2', 'meeting', 1), ('doc2', 'Arabic', 1)]
The search
function returns a list of documents that match query. Each entry shows:
snlp.search('合肥微尺度国家物理科学实验室', docs, keys=['doc1', 'doc2'])
[('doc1', '合肥微尺度国家物理科学实验室', 7)]
For Chinese, the number of word hits is the number of words in the query that appear in the document. Seven of the words in the string 合肥微尺度国家物理科学实验室 were found in doc1
.
for result in snlp.search('عش اليوم الأمل ليوم غد', docs, keys=['doc1', 'doc2']):
print("doc id:%s"% (result[0]))
print('query:%s' % (result[1]))
print('# of matches in document:%s' % (result[2]))
doc id:doc2 query:عش اليوم الأمل ليوم غد # of matches in document:1
snlp.search('сегодня надейся на завтра', docs, keys=['doc1', 'doc2'])
[('doc1', 'сегодня надейся на завтра', 1)]
snlp.find_chinese(document1)
['合肥微尺度国家物理科学实验室']
snlp.find_russian(document1)
['живи', 'сегодня', 'надейся', 'на', 'завтра']
snlp.find_arabic(document2)
['عش', 'اليوم', 'الأمل', 'ليوم', 'غد']