%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"
In this notebook, we build a practical, end-to-end Question-Answering (QA) system with BERT in rougly 3 lines of code. We will treat a corpus of text documents as a knowledge base to which we can ask questions and retrieve exact answers using BERT. This goes beyond simplistic keyword searches.
For this example, we will use the 20 Newsgroup dataset as the text corpus. As a collection of newsgroup postings which contains an abundance of opinions and debates, the corpus is not ideal as a knowledgebase. It is better to use fact-based documents such as Wikipedia articles or even news articles. However, this dataset will suffice for this example.
Let us begin by loading the dataset into an array using scikit-learn and importing ktrain modules.
# load 20newsgroups datset into an array
from sklearn.datasets import fetch_20newsgroups
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)
docs = newsgroups_train.data + newsgroups_test.data
import ktrain
from ktrain import text
We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist.
Since the newsgroup postings are small and fit in memory, we wil set commit_every
to a large value to speed up the indexing process. This means results will not be written until the end. If you experience issues, you can lower this value.
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs),
multisegment=True, procs=4, # these args speed up indexing
breakup_docs=True # this slows indexing but speeds up answer retrieval
)
For documents sets that are too large to be loaded into a Python list, you can use SimpleQA.index_from_folder
, which will crawl a folder and index all plain text documents (e.g.,, .txt
files) by default. If your documents are in formats like .pdf
, .docx
, or .pptx
, you can supply the use_text_extraction=True
argument to index_from_folder
, which will use the textract package to extract text from different file types and index this text into the search engine for answer rerieval. You can also manually convert them to .txt
files with the ktrain.text.textutils.extract_copy
or tools like Apache Tika or textract.
By default, index_from_list
and index_from_folder
use a single processor (procs=1
) with each processor using a maximum of 256MB of memory (limitmb=256
) and merging results into a single segment (multisegment=False
). These values can be changed to speedup indexing as arguments to index_from_list
or index_from_folder
. See the whoosh documentation for more information on these parameters and how to use them to speedup indexing. In this case, we've used multisegment=True
and procs=4
.
Note that larger documents will cause inferences in STEP 3 (see below) to be very slow. If your dataset consists of larger documents (e.g., long articles), we recommend breaking them up into pages (e.g., splitting the original PDF using something like pdfseparate
) or splitting them into paragraphs (paragraphs are probably preferrable). The latter can be done with ktrain using:
ktrain.text.textutils.paragraph_tokenize(document, join_sentences=True)
If you supply breakup_docs=True
in the cell above, this will be done automatically. Note that breakup_docs=True
will slightly slow indexing (i.e., STEP 1), but speed up answer retrieval (i.e., STEP 3 below). A second way to speed up answer-retrieval is to increase batch_size
in STEP 3 if using a GPU, which will be discussed later.
The above steps need to only be performed once. Once an index is already created, you can skip this step and proceed directly to STEP 2 to begin using your system.
Next, we create a QA instance. This step will automatically download the BERT SQuAD model if it does not already exist on your system.
qa = text.SimpleQA(INDEXDIR)
That's it! In roughly 3 lines of code, we have built an end-to-end QA system that can now be used to generate answers to questions. Let's ask our system some questions.
We will invoke the ask
method to issue questions to the text corpus we indexed and retrieve answers. We will also use the qa.display
method to nicely display the top 5 results in this Jupyter notebook. The answers are inferred using a BERT model fine-tuned on the SQuAD dataset. The model will comb through paragraphs and sentences to find candidate answers. By default, ask
currently uses a batch_size
of 8, but, if necessary, you can experiment with lowering it by setting the batch_size
parameter. On a CPU, for instance, you may want to try batch_size=1
.
Note also that the 20 Newsgroup Dataset covers events in the early to mid 1990s, so references to recent events will not exist.
answers = qa.ask('When did the Cassini probe launch?')
qa.display_answers(answers[:5])
Candidate Answer | Context | Confidence | Document Reference | |
---|---|---|---|---|
0 | in october of 1997 | cassini is scheduled for launch aboard a titan iv / centaur in october of 1997 . |
0.819032 | 59 |
1 | on january 26,1962 | ranger 3, launched on january 26,1962 , was intended to land an instrument capsule on the surface of the moon, but problems during the launch caused the probe to miss the moon and head into solar orbit. |
0.151229 | 8525 |
2 | - 10 / 06 / 97 | key scheduled dates for the cassini mission (vvejga trajectory)-------------------------------------------------------------10 / 06 / 97-titan iv / centaur launch 04 / 21 / 98-venus 1 gravity assist 06 / 20 / 99-venus 2 gravity assist 08 / 16 / 99-earth gravity assist 12 / 30 / 00-jupiter gravity assist 06 / 25 / 04-saturn arrival 01 / 09 / 05-titan probe release 01 / 30 / 05-titan probe entry 06 / 25 / 08-end of primary mission (schedule last updated 7 / 22 / 92) - 10 / 06 / 97 |
0.029694 | 59 |
3 | * 98 | cassini * * * * * * * * * * * * * * * * * * 98 ,115 * * * * |
0.000026 | 5356 |
4 | the latter part of the 1990s | scheduled for launch in the latter part of the 1990s , the craf and cassini missions are a collaborative project of nasa, the european space agency and the federal space agencies of germany and italy, as well as the united states air force and the department of energy. |
0.000017 | 18684 |
As you can see, the top candidate answer indicates that the Cassini space probe was launched in October of 1997, which appears to be correct. The correct answer will not always be the top answer, but it is in this case.
Note that, since we used index_from_list
to index documents, the last column (i.e., Document Reference) shows the list index associated with the newsgroup posting containing the answer, which can be used to peruse the entire document containing the answer. If using index_from_folder
to index documents, the last column will show the relative path and filename of the document. The Document Reference values can be customized by supplying a references
parameter to index_from_list
.
To see the text of the document that contains the top answer, uncomment and execute the following line (it's a comparatively long post).
#print(docs[59])
The 20 Newsgroup dataset contains lots of posts discussing and debating religions like Christianity and Islam, as well. Let's ask a question on this subject.
answers = qa.ask('Who was Muhammad?')
qa.display_answers(answers[:5])
Candidate Answer | Context | Confidence | Document Reference | |
---|---|---|---|---|
0 | the holy prophet of islam | just a small reminder to all my muslim brothers, did _ ever _ the holy prophet of islam (muhammad pbuh), say to anyone who called himself a muslim : |
0.762464 | 1278 |
1 | the messenger of allah | muhammad is the messenger of allah , and those who are with him are firm against the unbelievers and merciful among each other. |
0.201862 | 4876 |
2 | is the last prophet of islam | muhammad peace and blessings of allah be upon him (saw) is the last prophet of islam . |
0.035121 | 4640 |
3 | either a liar, or he was crazy (a modern day mad mahdi) or he was actually who he said he was | the book says that muhammad was either a liar, or he was crazy (a modern day mad mahdi) or he was actually who he said he was . |
0.000293 | 4934 |
4 | [ mahound ' s | muhammad ' s [ mahound ' s ] integrity is not really impugned in this part of the story, and there ' s no reason to think this was rushdie ' s intent : gibreel, as the archangel, produces the verses (divine and satanic), though he does not know their provenance. |
0.000138 | 15852 |
Here, we see different views on who Muhammad, the founder of Islam, as debated and discussed in this document set.
Finally, the 20 Newsgroup dataset also contains many groups about computing hardware and software. Let's ask a technical support question.
answers = qa.ask('What causes computer images to be too dark?')
qa.display_answers(answers[:5])
Candidate Answer | Context | Confidence | Document Reference | |
---|---|---|---|---|
0 | if your viewer does not do gamma correction | if your viewer does not do gamma correction , then linear images will look too dark, and gamma corrected images will ok. |
0.937990 | 13873 |
1 | is gamma correction | this, is gamma correction (or the lack of it). |
0.045165 | 13873 |
2 | so if you just dump your nice linear image out to a crt | so if you just dump your nice linear image out to a crt , the image will look much too dark. |
0.010337 | 13873 |
3 | that small color details | the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark. |
0.002114 | 6987 |
4 | that small color details | the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark. |
0.002114 | 12344 |
As you can see, a lack of gamma correction is the top answer.
batch_size
Argument to ask
¶As of ktrain v0.22.x, the ask
method uses batch_size=8
by default, which means 8 question-document pairs are fed to the model at a time. Older versions of ktrain used a batch_size
of 1. A batch_size
of 8 speeds of answer-retrieval. If you experience an Out-of-Memory (OOM) error, you can reduce the batch size by setting the batch_size
argument to ask
(e.g., batch_size=1
). Reducing batch_size
may also be beneficial if ask
is being invoked using a CPU instead of GPU.
To deploy this system, the only state that needs to be persisted is the search index we initialized and populated in STEP 1. Once a search index is initialized and populated, one can simply re-run from STEP 2.
SimpleQA
as a Simple Search Engine¶Once an index is created, SimpleQA
can also be used as a conventional search engine to perform keyword searches using the search
method:
qa.search(' "solar orbit" AND "battery power" ') # find documents that contain both these phrases
See the whoosh documentation for more information on query syntax.
index_from_folder
method¶Earlier, we mentioned the index_from_folder
method could be used to index documents of different file types (e.g., .pdf
, .docx
, .ppt
, etc.). Here is a brief code example:
# index documents of different types into a built-in search engine
from ktrain import text
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
multisegment=True, procs=4, # these args speed up indexing
breakup_docs=True) # speeds up answer retrieval
# ask questions (setting higher batch size can further speed up answer retrieval)
qa = text.SimpleQA(INDEXDIR)
answers = qa.ask('What is ktrain?', batch_size=8)
# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
# "ktrain is a low-code platform for machine learning"
In this notebook, we created and populated our own search index of documents. As mentioned above, ktrain uses whoosh internally for this. It is relatively easy to use the ktrain qa
module with an existing, pre-populated search engine like Apache Solr or Elastic Search. You can simply subclass the QA
class and override the search
method:
class MyCustomQA(QA):
"""
Custom QA Module
"""
def __init__(self,
bert_squad_model='bert-large-uncased-whole-word-masking-finetuned-squad',
bert_emb_model='bert-base-uncased'):
"""
MyCustomQA constructor. Include other parameters as needed.
Args:
bert_squad_model(str): name of BERT SQUAD model to use
bert_emb_model(str): BERT model to use to generate embeddings for semantic similarity
"""
super().__init__(bert_squad_model=bert_squad_model, bert_emb_model=bert_emb_model)
def search(self, query, limit=10, min_context_length=50):
"""
search index for query
Args:
query(str): search query
limit(int): number of top search results to return
Returns:
list of dicts with keys: reference, rawtext
"""
# ADD CODE HERE TO QUERY YOUR SEARCH ENGINE
# The query is the text of the question being asked.
# This code will find documents that match words in question.
If the back-end search engine is already populated with documents, you can now simply instantiate a QA
object and invoke ask
normally:
qa = MyCustomQA()
qa.ask('What is the best search engine?')
Note that, as mentioned above, this will work best when documents stored in the search engine are broken into smaller contexts (e.g., paragraphs), if they are not already.