In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

Text Extraction

As of v0.28.x, ktrain includes the TextExtractor class allows you to easily extract text from various file formats such as PDFs and MS Word documents.

In [2]:
!wget https://aclanthology.org/N19-1423.pdf -O /tmp/bert_paper.pdf
--2021-10-13 14:43:00--  https://aclanthology.org/N19-1423.pdf
Resolving aclanthology.org (aclanthology.org)... 174.138.37.75
Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 786279 (768K) [application/pdf]
Saving to: ‘/tmp/bert_paper.pdf’

/tmp/bert_paper.pdf 100%[===================>] 767.85K  --.-KB/s    in 0.08s   

2021-10-13 14:43:00 (9.90 MB/s) - ‘/tmp/bert_paper.pdf’ saved [786279/786279]

In [3]:
from ktrain.text import TextExtractor
te = TextExtractor()

Extract text into single string variable:

In [4]:
rawtext = te.extract('/tmp/bert_paper.pdf')
print(rawtext[:1000])
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin

Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout}@google.com

Abstract
We introduce a new language representation model called BERT, which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from
unlabeled text by jointly conditioning on both
left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer
to create state-of-the-art models for a wide
range of tasks, such as question answering and
language inference, without substantial taskspecific architecture modifications.
BERT is conceptually simple and empirically
powerful. It obtains new state-of-the-art results on eleven natural language proces

Extract text and split into sentences:

In [5]:
sentences = te.extract('/tmp/bert_paper.pdf', return_format='sentences')
print(sentences[:5])
['BERT : Pre training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin', 'Ming Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin , mingweichang , kentonl , kristout } @google.com', 'Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers .', 'Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers .', 'As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications .']

Extract text and split into paragraphs:

In [6]:
paragraphs = te.extract('/tmp/bert_paper.pdf', return_format='paragraphs')
print("%s paragraphs" % (len(paragraphs)))
print('Third paragraph from paper is:\n')
print(paragraphs[2])
495 paragraphs
Third paragraph from paper is:

Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers . Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers . As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications . BERT is conceptually simple and empirically powerful . It obtains new state of the art results on eleven natural language processing tasks , including pushing the GLUE score to 80.5 % ( 7.7 % point absolute improvement ) , Multi NLI accuracy to 86.7 % ( 4.6 % absolute improvement ) , SQu AD v1.1 question answering Test F1 to 93.2 ( 1.5 point absolute improvement ) and SQu AD v2.0 Test F1 to 83.1 ( 5.1 point absolute improvement ) .

You can also feed the TextExtractor strings to simply split them into lists of sentences or paragraphs:

In [7]:
two_sentences = 'This is the first sentence.  This is the second sentence.'
te.extract(text=two_sentences, return_format='sentences')
Out[7]:
['This is the first sentence .', 'This is the second sentence .']
In [ ]: