This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.
*Note:* To run this notebook you will need to have access to GPU. If you are using colab, you will need to install cdQA
by executing !pip install cdqa
in a cell.
import os
import pandas as pd
from ast import literal_eval
from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model
/Users/andre.farias/python3.7.0/lib/python3.7/site-packages/tqdm/autonotebook/__init__.py:18: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console) " (e.g. in jupyter console)", TqdmExperimentalWarning)
# Download model
download_model(model='bert-squad_1.1', dir='./models')
Downloading trained model...
# Download pdf files from BNP Paribas public news
def download_pdf():
import os
import wget
directory = './data/pdf/'
models_url = [
'https://invest.bnpparibas.com/documents/1q19-pr-12648',
'https://invest.bnpparibas.com/documents/4q18-pr-18000',
'https://invest.bnpparibas.com/documents/4q17-pr'
]
print('\nDownloading PDF files...')
if not os.path.exists(directory):
os.makedirs(directory)
for url in models_url:
wget.download(url=url, out=directory)
download_pdf()
Downloading PDF files...
df = pdf_converter(directory_path='./data/pdf/')
df.head()
2019-07-20 15:43:22,713 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /var/folders/fy/3wb1p_ms5r3g97jm4y93pqd40000gn/T/tika-server.jar. 2019-07-20 15:43:34,191 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /var/folders/fy/3wb1p_ms5r3g97jm4y93pqd40000gn/T/tika-server.jar.md5. 2019-07-20 15:43:34,617 [MainThread ] [WARNI] Failed to see startup log message; retrying...
title | paragraphs | |
---|---|---|
0 | 4q17-pr.pdf | [GOOD START OF THE 2020 PLAN * COST OF RISK... |
1 | 4q18-pr2.pdf | [SIGNIFICANT PROGRESS IN THE DIGITAL TRANSFORM... |
2 | 1q19-pr-12648.pdf | [The business of BNP Paribas was up this quart... |
cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib', max_df=1.0)
# Fit Retriever to documents
cdqa_pipeline.fit_retriever(X=df)
QAPipeline(reader=BertQA(bert_model='bert-base-uncased', do_lower_case=True, fp16=False, gradient_accumulation_steps=1, learning_rate=3e-05, local_rank=-1, loss_scale=0, max_answer_length=30, n_best_size=20, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=2, output_dir=None, predict_batch_size=8, seed=42, server_ip='', server_port='', train_batch_size=12, verbose_logging=False, version_2_with_negative=False, warmup_proportion=0.1))
query = 'How many contracts did BNP Paribas Cardif sell in 2019?'
prediction = cdqa_pipeline.predict(query)
3it [00:00, 170.06it/s] The pre-trained model you are loading is an uncased model but you have set `do_lower_case` to False. We are setting `do_lower_case=True` for you but you may want to check this behavior.
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))
query: How many contracts did BNP Paribas Cardif sell in 2019? answer: 140,000 title: 1q19-pr-12648.pdf paragraph: Insurance recorded a good level of activity with in particular the good performance of the international Savings and Protection Insurance businesses and the good development of the new property and casualty insurance offering in the FRB network via Cardif IARD4 (close to 140,000 contracts sold at the end of March 2019). The business committed to energy transition with a target of 3.5 billion euros in green investments by the end of 2020.