Notebook [2]: Using the PDF converter

This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.

Note: To run this notebook you will need to have access to GPU. If you are using colab, you will need to install cdQA by executing !pip install cdqa in a cell.

In [1]:
import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model
/Users/andre.farias/python3.7.0/lib/python3.7/site-packages/tqdm/autonotebook/__init__.py:18: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  " (e.g. in jupyter console)", TqdmExperimentalWarning)

Download pre-trained reader model and PDF files

In [2]:
# Download model
download_model(model='bert-squad_1.1', dir='./models')
Downloading trained model...
In [3]:
# Download pdf files from BNP Paribas public news
def download_pdf():
    import os
    import wget
    directory = './data/pdf/'
    models_url = [
      'https://invest.bnpparibas.com/documents/1q19-pr-12648',
      'https://invest.bnpparibas.com/documents/4q18-pr-18000',
      'https://invest.bnpparibas.com/documents/4q17-pr'
    ]

    print('\nDownloading PDF files...')

    if not os.path.exists(directory):
        os.makedirs(directory)
    for url in models_url:
        wget.download(url=url, out=directory)

download_pdf()
Downloading PDF files...

Convert the PDF files into a DataFrame for cdQA pipeline

In [4]:
df = pdf_converter(directory_path='./data/pdf/')
df.head()
2019-07-20 15:43:22,713 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /var/folders/fy/3wb1p_ms5r3g97jm4y93pqd40000gn/T/tika-server.jar.
2019-07-20 15:43:34,191 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /var/folders/fy/3wb1p_ms5r3g97jm4y93pqd40000gn/T/tika-server.jar.md5.
2019-07-20 15:43:34,617 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
Out[4]:
title paragraphs
0 4q17-pr.pdf [GOOD START OF THE 2020 PLAN * COST OF RISK...
1 4q18-pr2.pdf [SIGNIFICANT PROGRESS IN THE DIGITAL TRANSFORM...
2 1q19-pr-12648.pdf [The business of BNP Paribas was up this quart...

Instantiate the cdQA pipeline from a pre-trained reader model

In [7]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)

# Fit Retriever to documents
cdqa_pipeline.fit_retriever(df=df)
Out[7]:
QAPipeline(reader=BertQA(bert_model='bert-base-uncased', do_lower_case=True,
                         fp16=False, gradient_accumulation_steps=1,
                         learning_rate=3e-05, local_rank=-1, loss_scale=0,
                         max_answer_length=30, n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=2,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_port='', train_batch_size=12,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1))

Execute a query

In [8]:
query = 'How many contracts did BNP Paribas Cardif sell in 2019?'
prediction = cdqa_pipeline.predict(query)
3it [00:00, 170.06it/s]
The pre-trained model you are loading is an uncased model but you have set `do_lower_case` to False. We are setting `do_lower_case=True` for you but you may want to check this behavior.

Explore predictions

In [9]:
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))
query: How many contracts did BNP Paribas Cardif sell in 2019?
answer: 140,000
title: 1q19-pr-12648.pdf
paragraph: Insurance recorded a good level of activity with in particular the good performance of the international Savings and Protection Insurance businesses and the good development of the new property and casualty insurance offering in the FRB network via Cardif IARD4 (close to 140,000 contracts sold at the end of March 2019). The business committed to energy transition with a target of 3.5 billion euros in green investments by the end of 2020.