#!/usr/bin/env python # coding: utf-8 # In[1]: get_ipython().run_line_magic('reload_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') get_ipython().run_line_magic('matplotlib', 'inline') import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"; os.environ["CUDA_VISIBLE_DEVICES"]="0" # ## Building an End-to-End Question-Answering System With BERT # # In this notebook, we build a practical, end-to-end Question-Answering (QA) system with BERT in rougly 3 lines of code. We will treat a corpus of text documents as a knowledge base to which we can ask questions and retrieve exact answers using [BERT](https://arxiv.org/abs/1810.04805). This goes beyond simplistic keyword searches. # # For this example, we will use the [20 Newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) as the text corpus. As a collection of newsgroup postings which contains an abundance of opinions and debates, the corpus is not ideal as a knowledgebase. It is better to use fact-based documents such as Wikipedia articles or even news articles. However, this dataset will suffice for this example. # # Let us begin by loading the dataset into an array using **scikit-learn** and importing *ktrain* modules. # In[2]: # load 20newsgroups datset into an array from sklearn.datasets import fetch_20newsgroups remove = ('headers', 'footers', 'quotes') newsgroups_train = fetch_20newsgroups(subset='train', remove=remove) newsgroups_test = fetch_20newsgroups(subset='test', remove=remove) docs = newsgroups_train.data + newsgroups_test.data # In[3]: import ktrain from ktrain.text.qa import SimpleQA # ### STEP 1: Index the Documents # # We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist. # # Since the newsgroup postings are small and fit in memory, we wil set `commit_every` to a large value to speed up the indexing process. This means results will not be written until the end. If you experience issues, you can lower this value. # In[4]: INDEXDIR = '/tmp/myindex' # In[5]: SimpleQA.initialize_index(INDEXDIR) SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs), multisegment=True, procs=4, # these args speed up indexing breakup_docs=True # this slows indexing but speeds up answer retrieval ) # For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents (e.g.,, `.txt` files) by default. If your documents are in formats like `.pdf`, `.docx`, or `.pptx`, you can supply the `use_text_extraction=True` argument to `index_from_folder`, which will use the [textract](https://textract.readthedocs.io/en/stable/) package to extract text from different file types and index this text into the search engine for answer rerieval. You can also manually convert them to `.txt` files with the `ktrain.text.textutils.extract_copy` or tools like [Apache Tika](https://tika.apache.org/) or [textract](https://textract.readthedocs.io/en/stable/). # # #### Speeding Up Indexing # By default, `index_from_list` and `index_from_folder` use a single processor (`procs=1`) with each processor using a maximum of 256MB of memory (`limitmb=256`) and merging results into a single segment (`multisegment=False`). These values can be changed to speedup indexing as arguments to `index_from_list` or `index_from_folder`. See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/batch.html) for more information on these parameters and how to use them to speedup indexing. In this case, we've used `multisegment=True` and `procs=4`. # # #### Speeding Up Answer Retrieval # # Note that larger documents will cause inferences in STEP 3 (see below) to be very slow. If your dataset consists of larger documents (e.g., long articles), we recommend breaking them up into pages (e.g., splitting the original PDF using something like `pdfseparate`) or splitting them into paragraphs (paragraphs are probably preferrable). The latter can be done with *ktrain* using: # ```python # ktrain.text.textutils.paragraph_tokenize(document, join_sentences=True) # ``` # If you supply `breakup_docs=True` in the cell above, this will be done automatically. Note that `breakup_docs=True` will slightly **slow indexing** (i.e., STEP 1), but **speed up answer retrieval** (i.e., STEP 3 below). A second way to speed up answer-retrieval is to increase `batch_size` in STEP 3 if using a GPU, which will be discussed later. # # # The above steps need to only be performed once. Once an index is already created, you can skip this step and proceed directly to **STEP 2** to begin using your system. # ### STEP 2: Create a QA instance # # Next, we create a QA instance. This step will automatically download the BERT SQuAD model if it does not already exist on your system. # In[6]: qa = SimpleQA(INDEXDIR) # That's it! In roughly **3 lines of code**, we have built an end-to-end QA system that can now be used to generate answers to questions. Let's ask our system some questions. # ### STEP 3: Ask Questions # # We will invoke the `ask` method to issue questions to the text corpus we indexed and retrieve answers. We will also use the `qa.display` method to nicely display the top 5 results in this Jupyter notebook. The answers are inferred using a BERT model fine-tuned on the SQuAD dataset. The model will comb through paragraphs and sentences to find candidate answers. By default, `ask` currently uses a `batch_size` of 8, but, if necessary, you can experiment with lowering it by setting the `batch_size` parameter. On a CPU, for instance, you may want to try `batch_size=1`. # # Note also that the 20 Newsgroup Dataset covers events in the early to mid 1990s, so references to recent events will not exist. # # #### Space Question # In[7]: answers = qa.ask('When did the Cassini probe launch?') qa.display_answers(answers[:5]) # As you can see, the top candidate answer indicates that the Cassini space probe was launched in October of 1997, which appears to be correct. The correct answer will not always be the top answer, but it is in this case. # # Note that, since we used `index_from_list` to index documents, the last column (i.e., **Document Reference**) shows the list index associated with the newsgroup posting containing the answer, which can be used to peruse the entire document containing the answer. If using `index_from_folder` to index documents, the last column will show the relative path and filename of the document. The **Document Reference** values can be customized by supplying a `references` parameter to `index_from_list`. # # To see the text of the document that contains the top answer, uncomment and execute the following line (it's a comparatively long post). # In[8]: #print(docs[59]) # The 20 Newsgroup dataset contains lots of posts discussing and debating religions like Christianity and Islam, as well. Let's ask a question on this subject. # # #### Religious Question # In[9]: answers = qa.ask('Who was Muhammad?') qa.display_answers(answers[:5]) # Here, we see different views on who Muhammad, the founder of Islam, as debated and discussed in this document set. # # Finally, the 20 Newsgroup dataset also contains many groups about computing hardware and software. Let's ask a technical support question. # # #### Technical Question # In[10]: answers = qa.ask('What causes computer images to be too dark?') qa.display_answers(answers[:5]) # As you can see, a lack of *gamma correction* is the top answer. # # ### The `batch_size` Argument to `ask` # # As of **ktrain v0.22.x**, the `ask` method uses `batch_size=8` by default, which means 8 question-document pairs are fed to the model at a time. Older versions of **ktrain** used a `batch_size` of 1. A `batch_size` of 8 speeds of answer-retrieval. If you experience an Out-of-Memory (OOM) error, you can reduce the batch size by setting the `batch_size` argument to `ask` (e.g., `batch_size=1`). Reducing `batch_size` may also be beneficial if `ask` is being invoked using a **CPU** instead of **GPU**. # # # # ### Deploying the QA System # # To deploy this system, the only state that needs to be persisted is the search index we initialized and populated in **STEP 1**. Once a search index is initialized and populated, one can simply re-run from **STEP 2**. # # # ### Using `SimpleQA` as a Simple Search Engine # Once an index is created, `SimpleQA` can also be used as a conventional search engine to perform keyword searches using the `search` method: # # ```python # qa.search(' "solar orbit" AND "battery power" ') # find documents that contain both these phrases # ``` # See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/querylang.html) for more information on query syntax. # # ### The `index_from_folder` method # # Earlier, we mentioned the `index_from_folder` method could be used to index documents of different file types (e.g., `.pdf`, `.docx`, `.ppt`, etc.). Here is a brief code example: # # ```python # # index documents of different types into a built-in search engine # from ktrain.text.qa import SimpleQA # INDEXDIR = '/tmp/myindex' # SimpleQA.initialize_index(INDEXDIR) # corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files # SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction # multisegment=True, procs=4, # these args speed up indexing # breakup_docs=True) # speeds up answer retrieval # # # ask questions (setting higher batch size can further speed up answer retrieval) # qa = SimpleQA(INDEXDIR) # answers = qa.ask('What is ktrain?', batch_size=8) # # # top answer snippet extracted from https://arxiv.org/abs/2004.10703: # # "ktrain is a low-code platform for machine learning" # # # ``` # # # ### Connecting the QA System to an Existing Search Engine # # In this notebook, we created and populated our own search index of documents. As mentioned above, **ktrain** uses [whoosh](https://whoosh.readthedocs.io/en/latest/querylang.html) internally for this. It is relatively easy to use the **ktrain** `qa` module with an existing, pre-populated search engine like [Apache Solr](https://lucene.apache.org/solr/) or [Elastic Search](https://github.com/elastic/elasticsearch). You can simply subclass the `QA` class and override the `search` method: # # ```python # from ktrain.text.qa import QA # class MyCustomQA(QA): # """ # Custom QA Module # """ # def __init__(self, # bert_squad_model='bert-large-uncased-whole-word-masking-finetuned-squad', # bert_emb_model='bert-base-uncased'): # """ # MyCustomQA constructor. Include other parameters as needed. # Args: # bert_squad_model(str): name of BERT SQUAD model to use # bert_emb_model(str): BERT model to use to generate embeddings for semantic similarity # # """ # super().__init__(bert_squad_model=bert_squad_model, bert_emb_model=bert_emb_model) # # # def search(self, query, limit=10, min_context_length=50): # """ # search index for query # Args: # query(str): search query # limit(int): number of top search results to return # Returns: # list of dicts with keys: reference, rawtext # """ # # # ADD CODE HERE TO QUERY YOUR SEARCH ENGINE # # The query is the text of the question being asked. # # This code will find documents that match words in question. # # ``` # # If the back-end search engine is already populated with documents, you can now simply instantiate a `QA` object and invoke `ask` normally: # # ```python # qa = MyCustomQA() # qa.ask('What is the best search engine?') # ``` # # Note that, as mentioned above, this will work best when documents stored in the search engine are broken into smaller contexts (e.g., paragraphs), if they are not already. # In[ ]: