%reload_ext autoreload
%autoreload 2
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor
For our test document, let's download the ktrain ArXiv paper and use the TextExtractor
module to extract text.
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')
print(f"# of words in downloaded paper: {len(text.split())}")
# of words in downloaded paper: 4551
Let's first use ngrams
as the candidate generator, which is comparatively fast:
kwe = KeywordExtractor()
%%time
kwe.extract_keywords(text, candidate_generator='ngrams')
CPU times: user 396 ms, sys: 19.8 ms, total: 416 ms Wall time: 415 ms
[('machine learning', 0.10548523206751055), ('step', 0.06751054852320675), ('learning rate', 0.046413502109704644), ('arxiv preprint', 0.046413502109704644), ('text classification', 0.03375527426160337), ('augmented machine', 0.02531645569620253), ('open-domain question-answering', 0.02531645569620253), ('augmented machine learning', 0.02531645569620253), ('bert', 0.02109704641350211), ('low-code library', 0.02109704641350211)]
If we use noun_phrases
as the candidate generator instead, quality improves slightly at the expense of a longer running time.
%%time
kwe.extract_keywords(text, candidate_generator='noun_phrases')
CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s Wall time: 1.04 s
[('machine learning', 0.0784313725490196), ('text classification', 0.049019607843137254), ('image classification', 0.049019607843137254), ('exact answers', 0.0392156862745098), ('augmented machine learning', 0.0392156862745098), ('graph data', 0.029411764705882353), ('node classification', 0.029411764705882353), ('entity recognition', 0.029411764705882353), ('code example', 0.029411764705882353), ('index documents', 0.029411764705882353)]
The extract_keywords
method has many other parameters to control the output. For instance, you can control the number of words in keyphrases with the ngram_range
parameter. Here, we extract 3-word keyphrases:
kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))
[('augmented machine learning', 0.07017543859649122), ('a. s. maiya', 0.05263157894736842), ('optimal learning rate', 0.03508771929824561), ('natural language questions', 0.03508771929824561), ('support text data', 0.017543859649122806), ('learning rate schedules', 0.017543859649122806), ('machine learning model', 0.017543859649122806), ('unsupervised topic modeling', 0.017543859649122806), ('large text corpus', 0.017543859649122806), ('social media accounts', 0.017543859649122806)]
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')
kwe = KeywordExtractor()
kwe.extract_keywords(text, candidate_generator='noun_phrases')
[('machine learning', 0.0784313725490196), ('text classification', 0.049019607843137254), ('image classification', 0.049019607843137254), ('exact answers', 0.0392156862745098), ('augmented machine learning', 0.0392156862745098), ('graph data', 0.029411764705882353), ('node classification', 0.029411764705882353), ('entity recognition', 0.029411764705882353), ('code example', 0.029411764705882353), ('index documents', 0.029411764705882353)]
text = """
监督学习是学习一个函数的机器学习任务
根据样本输入-输出对将输入映射到输出。他推导出一个
函数来自由一组训练示例组成的标记训练数据。
在监督学习中,每个示例都是由一个输入对象组成的对
(通常是一个向量)和一个期望的输出值(也称为监控信号)。
监督学习算法分析训练数据并产生推断函数,
可用于映射新示例。最佳方案将允许
算法来正确确定不可见实例的类标签。这需要
学习算法从训练数据泛化到新情况
“合理”的方式(见归纳偏差)。
"""
kwe = KeywordExtractor(lang='zh')
kwe.extract_keywords(text)
[('监督 学习', 0.06), ('训练 数据', 0.06), ('学习 算法', 0.04), ('机器 学习', 0.02), ('学习 任务', 0.02), ('样本 输入', 0.02), ('输入 输出', 0.02), ('输入 映射', 0.02), ('自由 一组', 0.02), ('一组 训练', 0.02)]
text = """L'apprentissage supervisé est la tâche d'apprentissage automatique consistant à apprendre une fonction qui
mappe une entrée à une sortie sur la base d'exemples de paires entrée-sortie. Il en déduit une
fonction à partir de données d'entraînement étiquetées constituées d'un ensemble d'exemples d'entraînement.
En apprentissage supervisé, chaque exemple est une paire composée d'un objet d'entrée
(généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision).
Un algorithme d'apprentissage supervisé analyse les données d'apprentissage et produit une fonction inférée,
qui peut être utilisé pour cartographier de nouveaux exemples. Un scénario optimal permettra
algorithme pour déterminer correctement les étiquettes de classe pour les instances invisibles. Cela nécessite
l'algorithme d'apprentissage pour généraliser à partir des données d'entraînement à des situations inédites dans un
manière « raisonnable » (voir biais inductif)."""
kwe = KeywordExtractor(lang='fr')
kwe.extract_keywords(text)
[("données d'entraînement", 0.0392156862745098), ("l'apprentissage supervisé", 0.0196078431372549), ("tâche d'apprentissage", 0.0196078431372549), ("d'apprentissage automatique", 0.0196078431372549), ('automatique consistant', 0.0196078431372549), ("base d'exemples", 0.0196078431372549), ('paires entrée-sortie', 0.0196078431372549), ("d'entraînement étiquetées", 0.0196078431372549), ('étiquetées constituées', 0.0196078431372549), ("constituées d'un", 0.0196078431372549)]
The following languages are supported:
from ktrain.text.kw.core import SUPPORTED_LANGS
for k,v in SUPPORTED_LANGS.items():
print(k,v)
en english ar arabic az azerbaijani da danish nl dutch fi finnish fr french de german el greek hu hungarian id indonesian it italian kk kazakh ne nepali no norwegian pt portuguese ro romanian ru russian sl slovene es spanish sv swedish tg tajik tr turkish zh chinese
The KeywordExtractor
is a already fast. With parallelization, keyphrase extraction can easily scale to a large number of documents.
text = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
docs = [text] * 10000
kwe = KeywordExtractor()
We can process these 10,000 documents using 8 processors in only a few seconds:
%%time
from joblib import Parallel, delayed
results = Parallel(n_jobs=8)(delayed(kwe.extract_keywords)(doc) for doc in docs)
CPU times: user 2.19 s, sys: 225 ms, total: 2.42 s Wall time: 9.51 s
print(f'# of results is {len(results)}')
results[0]
# of results is 10000
[('supervised learning', 0.07317073170731707), ('training data', 0.07317073170731707), ('learning algorithm', 0.04878048780487805), ('machine learning', 0.024390243902439025), ('learning task', 0.024390243902439025), ('output based', 0.024390243902439025), ('example input-output', 0.024390243902439025), ('input-output pairs', 0.024390243902439025), ('labeled training', 0.024390243902439025), ('data consisting', 0.024390243902439025)]