%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
As of v0.28.x, ktrain includes the TextExtractor
class allows you to easily extract text from various file formats such as PDFs and MS Word documents.
!wget https://aclanthology.org/N19-1423.pdf -O /tmp/bert_paper.pdf
--2021-10-13 14:43:00-- https://aclanthology.org/N19-1423.pdf Resolving aclanthology.org (aclanthology.org)... 174.138.37.75 Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 786279 (768K) [application/pdf] Saving to: ‘/tmp/bert_paper.pdf’ /tmp/bert_paper.pdf 100%[===================>] 767.85K --.-KB/s in 0.08s 2021-10-13 14:43:00 (9.90 MB/s) - ‘/tmp/bert_paper.pdf’ saved [786279/786279]
from ktrain.text import TextExtractor
te = TextExtractor()
rawtext = te.extract('/tmp/bert_paper.pdf')
print(rawtext[:1000])
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com Abstract We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language proces
sentences = te.extract('/tmp/bert_paper.pdf', return_format='sentences')
print(sentences[:5])
['BERT : Pre training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin', 'Ming Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin , mingweichang , kentonl , kristout } @google.com', 'Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers .', 'Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers .', 'As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications .']
paragraphs = te.extract('/tmp/bert_paper.pdf', return_format='paragraphs')
print("%s paragraphs" % (len(paragraphs)))
print('Third paragraph from paper is:\n')
print(paragraphs[2])
495 paragraphs Third paragraph from paper is: Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers . Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers . As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications . BERT is conceptually simple and empirically powerful . It obtains new state of the art results on eleven natural language processing tasks , including pushing the GLUE score to 80.5 % ( 7.7 % point absolute improvement ) , Multi NLI accuracy to 86.7 % ( 4.6 % absolute improvement ) , SQu AD v1.1 question answering Test F1 to 93.2 ( 1.5 point absolute improvement ) and SQu AD v2.0 Test F1 to 83.1 ( 5.1 point absolute improvement ) .
You can also feed the TextExtractor
strings to simply split them into lists of sentences or paragraphs:
two_sentences = 'This is the first sentence. This is the second sentence.'
te.extract(text=two_sentences, return_format='sentences')
['This is the first sentence .', 'This is the second sentence .']