#!/usr/bin/env python
# coding: utf-8

# In[1]:


get_ipython().run_line_magic('reload_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')
get_ipython().run_line_magic('matplotlib', 'inline')
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 


# # QA-Based Information Extraction
# 
# 
# The latest version of ktrain (v0.28.0), an open-source machine learning library, now includes a “universal” information extractor, which uses a Question-Answering model to extract any information of interest from documents.
# 
# Suppose you have a table (e.g., an Excel spreadsheet) that looks like the DataFrame below. (In this example, each document is a single sentence, but each row can potenially be an entire report with many paragraphs.)

# In[2]:


data = [
'Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .',
'There is a risk of Donald Trump running again in 2024.',
"""This risk was consistent across patients stratified by history of CVD, risk factors 
but no CVD, and neither CVD nor risk factors.""",
"""Risk factors associated with subsequent death include older age, hypertension, diabetes, 
ischemic heart disease, obesity and chronic lung disease; however, sometimes 
there are no obvious risk factors .""",
'Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.',
'His speciality is medical risk assessments, and he is 30 years old.',
"""Results: A total of nine studies including 356 patients were included in this study, 
the mean age was 52.4 years and 221 (62.1%) were male."""]
import pandas as pd
pd.set_option("display.max_colwidth", None)
df = pd.DataFrame(data, columns=['Text'])
df.head(10)


# Let's pretend your boss wants you to extract both the reported risk factors from each document and the sample sizes for the reported studies.  This can easily be accomplished with the `AnswerExtractor` in **ktrain**, a kind of universal information extractor based on a Question-Answering model.

# In[3]:


from ktrain.text import AnswerExtractor
ae = AnswerExtractor()
df = ae.extract(df.Text.values, df, [('What are the risk factors?', 'Risk Factors'), 
                                     ('How many individuals in sample?', 'Sample Size')])
df.head(10)


# As you can see, all that's required is that you phrase the type information you want to extract as a question (e.g., *What are the risk factors?*) and provide a label (e.g., *Risk Factors*).  The above command will return a new DataFrame with additional columns containing the information of interest.

# ### Additional Examples
# 
# QA-based information extraction is surprisingly versatile.  Here, we use it to extract **URLs**, **dates**, and **amounts**.

# In[4]:


data = ["Closing price for Square on October 8th was $238.57, for details - https://finance.google.com",
        """The film "The Many Saints of Newark" was released on 10/01/2021.""",
           "Release delayed until the 1st of October due to COVID-19",
           "Price of Bitcoin fell to forty thousand dollars",
           "Documentation can be found at: amaiya.github.io/causalnlp",
]
df = pd.DataFrame(data, columns=['Text'])
df = ae.extract(df.Text.values, df, [('What is the amount?', 'Amount'),
                                     ('What is the URL?', 'URL'), 
                                     ('What is the date?', 'Date')])
df.head(10)


# For our last example, let's extract universities from a sample of the 20 Newsgroup dataset:

# In[5]:


# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
df = pd.DataFrame(train_b.data[:10], columns=['Text']) # let's examine the first 10 posts
df = ae.extract(df.Text.values, df, [('What is the university?', 'University')])
df.head(10)


# ### Customizing the `AnswerExtractor` to Your Use Case
# 
# If there are false positives (or false negatives), you can adjust the `min_conf` parameter (i.e., minimum confidence threshold) until you’re happy (default is `min_conf=6`).  If `return_conf=True`, then columns showing the confidence scores of each extraction will also be included in the resultant DataFrame.
# 
# If adjusting the confidence threshold is not sufficient to address the false positives and false negatives you're seeing, you can also try fine-tuning the QA model to your custom dataset by providing only a small handful examples:
# 
# **Example:**
# ```python
# data = [
# {"question": "What is the URL?", 
# "context": "Closing price for Square on October 8th was $238.57, for details - https://finance.google.com", 
#  "answers": "https://finance.google.com"},
#  {"question": "What is the URL?", 
#   "context": "HTTP is a protocol for fetching resources.", 
#   "answers": None}, 
# ]
# from ktrain.text import AnswerExtractor
# ae = AnswerExtractor(bert_squad_model='distilbert-base-cased-distilled-squad')
# ae.finetune(data)
# ```
# 
# Note that, by default, the `AnswerExtractor` uses a `bert-large-*` model that requires a lot of memory to train.  If fine-tuning, you may want to switch to a smaller model like DistilBERT, as shown in the example above.
# 
# Finally, the `finetune` method accepts other parameters such as `batch_size` and `max_seq_length` that you can adjust depending on your speed requirements, dataset characteristics, and system resources.

# In[ ]: