Notebook

In [1]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0" 

Building an End-to-End Question-Answering System With BERT¶

In this notebook, we build a practical, end-to-end Question-Answering (QA) system with BERT in rougly 3 lines of code. We will treat a corpus of text documents as a knowledge base to which we can ask questions and retrieve exact answers using BERT. This goes beyond simplistic keyword searches.

For this example, we will use the 20 Newsgroup dataset as the text corpus. As a collection of newsgroup postings which contains an abundance of opinions and debates, the corpus is not ideal as a knowledgebase. It is better to use fact-based documents such as Wikipedia articles or even news articles. However, this dataset will suffice for this example.

Let us begin by loading the dataset into an array using scikit-learn and importing ktrain modules.

In [2]:

# load 20newsgroups datset into an array
from sklearn.datasets import fetch_20newsgroups
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)
docs = newsgroups_train.data +  newsgroups_test.data

In [3]:

import ktrain
from ktrain import text

STEP 1: Index the Documents¶

We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist.

Since the newsgroup postings are small and fit in memory, we wil set commit_every to a large value to speed up the indexing process. This means results will not be written until the end. If you experience issues, you can lower this value.

In [4]:

INDEXDIR = '/tmp/myindex'

In [5]:

text.SimpleQA.initialize_index(INDEXDIR)
text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs),
                              multisegment=True, procs=4, # these args speed up indexing
                               breakup_docs=True         # this slows indexing but speeds up answer retrieval
                              )

For documents sets that are too large to be loaded into a Python list, you can use SimpleQA.index_from_folder, which will crawl a folder and index all plain text documents (e.g.,, .txt files) by default. If your documents are in formats like .pdf, .docx, or .pptx, you can supply the use_text_extraction=True argument to index_from_folder, which will use the textract package to extract text from different file types and index this text into the search engine for answer rerieval. You can also manually convert them to .txt files with the ktrain.text.textutils.extract_copy or tools like Apache Tika or textract.

Speeding Up Indexing¶

By default, index_from_list and index_from_folder use a single processor (procs=1) with each processor using a maximum of 256MB of memory (limitmb=256) and merging results into a single segment (multisegment=False). These values can be changed to speedup indexing as arguments to index_from_list or index_from_folder. See the whoosh documentation for more information on these parameters and how to use them to speedup indexing. In this case, we've used multisegment=True and procs=4.

Speeding Up Answer Retrieval¶

Note that larger documents will cause inferences in STEP 3 (see below) to be very slow. If your dataset consists of larger documents (e.g., long articles), we recommend breaking them up into pages (e.g., splitting the original PDF using something like pdfseparate) or splitting them into paragraphs (paragraphs are probably preferrable). The latter can be done with ktrain using:

ktrain.text.textutils.paragraph_tokenize(document, join_sentences=True)

If you supply breakup_docs=True in the cell above, this will be done automatically. Note that breakup_docs=True will slightly slow indexing (i.e., STEP 1), but speed up answer retrieval (i.e., STEP 3 below). A second way to speed up answer-retrieval is to increase batch_size in STEP 3 if using a GPU, which will be discussed later.

The above steps need to only be performed once. Once an index is already created, you can skip this step and proceed directly to STEP 2 to begin using your system.

STEP 2: Create a QA instance¶

Next, we create a QA instance. This step will automatically download the BERT SQuAD model if it does not already exist on your system.

In [6]:

qa = text.SimpleQA(INDEXDIR)

That's it! In roughly 3 lines of code, we have built an end-to-end QA system that can now be used to generate answers to questions. Let's ask our system some questions.

STEP 3: Ask Questions¶

We will invoke the ask method to issue questions to the text corpus we indexed and retrieve answers. We will also use the qa.display method to nicely display the top 5 results in this Jupyter notebook. The answers are inferred using a BERT model fine-tuned on the SQuAD dataset. The model will comb through paragraphs and sentences to find candidate answers. By default, ask currently uses a batch_size of 8, but, if necessary, you can experiment with lowering it by setting the batch_size parameter. On a CPU, for instance, you may want to try batch_size=1.

Note also that the 20 Newsgroup Dataset covers events in the early to mid 1990s, so references to recent events will not exist.

Space Question¶

In [7]:

answers = qa.ask('When did the Cassini probe launch?')
qa.display_answers(answers[:5])

	Candidate Answer	Context	Confidence	Document Reference
0	in october of 1997	cassini is scheduled for launch aboard a titan iv / centaur in october of 1997 .	0.819032	59
1	on january 26,1962	ranger 3, launched on january 26,1962 , was intended to land an instrument capsule on the surface of the moon, but problems during the launch caused the probe to miss the moon and head into solar orbit.	0.151229	8525
2	- 10 / 06 / 97	key scheduled dates for the cassini mission (vvejga trajectory)-------------------------------------------------------------10 / 06 / 97-titan iv / centaur launch 04 / 21 / 98-venus 1 gravity assist 06 / 20 / 99-venus 2 gravity assist 08 / 16 / 99-earth gravity assist 12 / 30 / 00-jupiter gravity assist 06 / 25 / 04-saturn arrival 01 / 09 / 05-titan probe release 01 / 30 / 05-titan probe entry 06 / 25 / 08-end of primary mission (schedule last updated 7 / 22 / 92) - 10 / 06 / 97	0.029694	59
3	* 98	cassini * * * * * * * * * * * * * * * * * * 98 ,115 * * * *	0.000026	5356
4	the latter part of the 1990s	scheduled for launch in the latter part of the 1990s , the craf and cassini missions are a collaborative project of nasa, the european space agency and the federal space agencies of germany and italy, as well as the united states air force and the department of energy.	0.000017	18684

As you can see, the top candidate answer indicates that the Cassini space probe was launched in October of 1997, which appears to be correct. The correct answer will not always be the top answer, but it is in this case.

Note that, since we used index_from_list to index documents, the last column (i.e., Document Reference) shows the list index associated with the newsgroup posting containing the answer, which can be used to peruse the entire document containing the answer. If using index_from_folder to index documents, the last column will show the relative path and filename of the document. The Document Reference values can be customized by supplying a references parameter to index_from_list.

To see the text of the document that contains the top answer, uncomment and execute the following line (it's a comparatively long post).

In [8]:

#print(docs[59])

The 20 Newsgroup dataset contains lots of posts discussing and debating religions like Christianity and Islam, as well. Let's ask a question on this subject.

Religious Question¶

In [9]:

answers = qa.ask('Who was Muhammad?')
qa.display_answers(answers[:5])

	Candidate Answer	Context	Confidence	Document Reference
0	the holy prophet of islam	just a small reminder to all my muslim brothers, did _ ever _ the holy prophet of islam (muhammad pbuh), say to anyone who called himself a muslim :	0.762464	1278
1	the messenger of allah	muhammad is the messenger of allah , and those who are with him are firm against the unbelievers and merciful among each other.	0.201862	4876
2	is the last prophet of islam	muhammad peace and blessings of allah be upon him (saw) is the last prophet of islam .	0.035121	4640
3	either a liar, or he was crazy (a modern day mad mahdi) or he was actually who he said he was	the book says that muhammad was either a liar, or he was crazy (a modern day mad mahdi) or he was actually who he said he was .	0.000293	4934
4	[ mahound ' s	muhammad ' s [ mahound ' s ] integrity is not really impugned in this part of the story, and there ' s no reason to think this was rushdie ' s intent : gibreel, as the archangel, produces the verses (divine and satanic), though he does not know their provenance.	0.000138	15852

Here, we see different views on who Muhammad, the founder of Islam, as debated and discussed in this document set.

Finally, the 20 Newsgroup dataset also contains many groups about computing hardware and software. Let's ask a technical support question.

Technical Question¶

In [10]:

answers = qa.ask('What causes computer images to be too dark?')
qa.display_answers(answers[:5])

	Candidate Answer	Context	Confidence	Document Reference
0	if your viewer does not do gamma correction	if your viewer does not do gamma correction , then linear images will look too dark, and gamma corrected images will ok.	0.937990	13873
1	is gamma correction	this, is gamma correction (or the lack of it).	0.045165	13873
2	so if you just dump your nice linear image out to a crt	so if you just dump your nice linear image out to a crt , the image will look much too dark.	0.010337	13873
3	that small color details	the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.	0.002114	6987
4	that small color details	the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.	0.002114	12344

As you can see, a lack of gamma correction is the top answer.

The `batch_size` Argument to `ask`¶

As of ktrain v0.22.x, the ask method uses batch_size=8 by default, which means 8 question-document pairs are fed to the model at a time. Older versions of ktrain used a batch_size of 1. A batch_size of 8 speeds of answer-retrieval. If you experience an Out-of-Memory (OOM) error, you can reduce the batch size by setting the batch_size argument to ask (e.g., batch_size=1). Reducing batch_size may also be beneficial if ask is being invoked using a CPU instead of GPU.

Deploying the QA System¶

To deploy this system, the only state that needs to be persisted is the search index we initialized and populated in STEP 1. Once a search index is initialized and populated, one can simply re-run from STEP 2.

Using `SimpleQA` as a Simple Search Engine¶

Once an index is created, SimpleQA can also be used as a conventional search engine to perform keyword searches using the search method:

qa.search(' "solar orbit" AND "battery power" ') # find documents that contain both these phrases

See the whoosh documentation for more information on query syntax.

The `index_from_folder` method¶

Earlier, we mentioned the index_from_folder method could be used to index documents of different file types (e.g., .pdf, .docx, .ppt, etc.). Here is a brief code example:

# index documents of different types into a built-in search engine
from ktrain import text
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
                              multisegment=True, procs=4, # these args speed up indexing
                              breakup_docs=True)          # speeds up answer retrieval

# ask questions (setting higher batch size can further speed up answer retrieval)
qa = text.SimpleQA(INDEXDIR)
answers = qa.ask('What is ktrain?', batch_size=8)

# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
#   "ktrain is a low-code platform for machine learning"

Connecting the QA System to an Existing Search Engine¶

In this notebook, we created and populated our own search index of documents. As mentioned above, ktrain uses whoosh internally for this. It is relatively easy to use the ktrain qa module with an existing, pre-populated search engine like Apache Solr or Elastic Search. You can simply subclass the QA class and override the search method:

class MyCustomQA(QA):
    """
   Custom QA Module
    """
    def __init__(self,
                 bert_squad_model='bert-large-uncased-whole-word-masking-finetuned-squad',
                 bert_emb_model='bert-base-uncased'):
        """
        MyCustomQA constructor.  Include other parameters as needed.
        Args:
          bert_squad_model(str): name of BERT SQUAD model to use
          bert_emb_model(str): BERT model to use to generate embeddings for semantic similarity

        """
        super().__init__(bert_squad_model=bert_squad_model, bert_emb_model=bert_emb_model)


    def search(self, query, limit=10, min_context_length=50):
        """
        search index for query
        Args:
          query(str): search query
          limit(int):  number of top search results to return
        Returns:
          list of dicts with keys: reference, rawtext
        """

        # ADD CODE HERE TO QUERY YOUR SEARCH ENGINE
        # The query is the text of the question being asked.
        # This code will find documents that match words in question.

If the back-end search engine is already populated with documents, you can now simply instantiate a QA object and invoke ask normally:

qa = MyCustomQA()
qa.ask('What is the best search engine?')

Note that, as mentioned above, this will work best when documents stored in the search engine are broken into smaller contexts (e.g., paragraphs), if they are not already.

In [ ]: