In [1]:
%reload_ext autoreload
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";


# QA-Based Information Extraction¶

As of v0.28.x, ktrain now includes a “universal” information extractor, which uses a Question-Answering model to extract any information of interest from documents.

Suppose you have a table (e.g., an Excel spreadsheet) that looks like the DataFrame below. (In this example, each document is a single sentence, but each row can potenially be an entire report with many paragraphs.)

In [2]:
data = [
'Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .',
'There is a risk of Donald Trump running again in 2024.',
"""This risk was consistent across patients stratified by history of CVD, risk factors
but no CVD, and neither CVD nor risk factors.""",
"""Risk factors associated with subsequent death include older age, hypertension, diabetes,
ischemic heart disease, obesity and chronic lung disease; however, sometimes
there are no obvious risk factors .""",
'Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.',
'His speciality is medical risk assessments, and he is 30 years old.',
"""Results: A total of nine studies including 356 patients were included in this study,
the mean age was 52.4 years and 221 (62.1%) were male."""]
import pandas as pd
pd.set_option("display.max_colwidth", None)
df = pd.DataFrame(data, columns=['Text'])

Out[2]:
Text
0 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .
1 There is a risk of Donald Trump running again in 2024.
2 This risk was consistent across patients stratified by history of CVD, risk factors \nbut no CVD, and neither CVD nor risk factors.
3 Risk factors associated with subsequent death include older age, hypertension, diabetes, \nischemic heart disease, obesity and chronic lung disease; however, sometimes \nthere are no obvious risk factors .
4 Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.
5 His speciality is medical risk assessments, and he is 30 years old.
6 Results: A total of nine studies including 356 patients were included in this study, \nthe mean age was 52.4 years and 221 (62.1%) were male.

Let's pretend your boss wants you to extract both the reported risk factors from each document and the sample sizes for the reported studies. This can easily be accomplished with the AnswerExtractor in ktrain, a kind of universal information extractor based on a Question-Answering model.

In [3]:
from ktrain.text.qa import AnswerExtractor
df = ae.extract(df.Text.values, df, [('What are the risk factors?', 'Risk Factors'),
('How many individuals in sample?', 'Sample Size')])

Out[3]:
Text Risk Factors Sample Size
0 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) . sex, obesity, genetic factors and mechanical factors None
1 There is a risk of Donald Trump running again in 2024. None None
2 This risk was consistent across patients stratified by history of CVD, risk factors \nbut no CVD, and neither CVD nor risk factors. neither CVD nor risk factors None
3 Risk factors associated with subsequent death include older age, hypertension, diabetes, \nischemic heart disease, obesity and chronic lung disease; however, sometimes \nthere are no obvious risk factors . older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease None
4 Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia. sex (male), age (≥60), and severe pneumonia None
5 His speciality is medical risk assessments, and he is 30 years old. None None
6 Results: A total of nine studies including 356 patients were included in this study, \nthe mean age was 52.4 years and 221 (62.1%) were male. None 356

As you can see, all that's required is that you phrase the type information you want to extract as a question (e.g., What are the risk factors?) and provide a label (e.g., Risk Factors). The above command will return a new DataFrame with additional columns containing the information of interest.

QA-based information extraction is surprisingly versatile. Here, we use it to extract URLs, dates, and amounts.

In [4]:
data = ["Closing price for Square on October 8th was $238.57, for details - https://finance.yahoo.com", """The film "The Many Saints of Newark" was released on 10/01/2021.""", "Release delayed until the 1st of October due to COVID-19", "Price of Bitcoin fell to forty thousand dollars", "Documentation can be found at: amaiya.github.io/causalnlp", ] df = pd.DataFrame(data, columns=['Text']) df = ae.extract(df.Text.values, df, [('What is the amount?', 'Amount'), ('What is the URL?', 'URL'), ('What is the date?', 'Date')]) df.head(10)  Out[4]: Text Amount URL Date 0 Closing price for Square on October 8th was$238.57, for details - https://finance.yahoo.com 238.57 https://finance.yahoo.com October 8th
1 The film "The Many Saints of Newark" was released on 10/01/2021. None None 10/01/2021
2 Release delayed until the 1st of October due to COVID-19 None None 1st of October
3 Price of Bitcoin fell to forty thousand dollars forty thousand dollars None None
4 Documentation can be found at: amaiya.github.io/causalnlp None amaiya.github.io/causalnlp None

For our last example, let's extract universities from a sample of the 20 Newsgroup dataset:

In [5]:
# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
df = pd.DataFrame(train_b.data[:10], columns=['Text']) # let's examine the first 10 posts
df = ae.extract(df.Text.values, df, [('What is the university?', 'University')])

Out[5]:
Text University
0 From: [email protected] (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format. We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance. Michael.\n-- \nMichael Collier (Programmer) The Computer Unit,\nEmail: [email protected] The City University,\nTel: 071 477-8000 x3769 London,\nFax: 071 477-8565 EC1V 0HB.\n The City University
1 From: [email protected] (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the problem:\n\tI have a rectangular mesh in the uv domain, i.e the mesh is a \n\tmapping of a 3d Bezier patch into 2d. The area in this domain\n\twhich is inside a trimming loop had to be rendered. The trimming\n\tloop is a set of 2d Bezier curve segments.\n\tFor the sake of notation: the mesh is made up of cells.\n\n\tMy problem is this :\n\tThe trimming area has to be split up into individual smaller\n\tcells bounded by the trimming curve segments. If a cell\n\tis wholly inside the area...then it is output as a whole ,\n\telse it is trivially rejected. \n\n\tDoes any body know how thiss can be done, or is there any algo. \n\tsomewhere for doing this.\n\n\tAny help would be appreciated.\n\n\tThanks, \n\tAni.\n-- \nTo get irritated is human, to stay cool, divine.\n University Of Kentucky
3 From: [email protected] (M.M. Zwart)\nSubject: catholic church poland\nOrganization: Faculteit der Letteren, Rijksuniversiteit Groningen, NL\nLines: 10\n\nHello,\n\nI'm writing a paper on the role of the catholic church in Poland after 1989. \nCan anyone tell me more about this, or fill me in on recent books/articles(\nin english, german or french). Most important for me is the role of the \nchurch concerning the abortion-law, religious education at schools,\nbirth-control and the relation church-state(government). Thanx,\n\n Masja,\n"M.M.Zwart"<[email protected]>\n Rijksuniversiteit Groningen
4 From: [email protected]sc.ncr.com (stanly)\nSubject: Re: Elder Brother\nOrganization: NCR Corp., Columbia SC\nLines: 15\n\nIn article <[email protected]> [email protected] writes:\n>In article <[email protected]> [email protected]\n>Matt. 22:9-14 'Go therefore to the main highways, and as many as you find\n>there, invite to the wedding feast.'...\n\n>hmmmmmm. Sounds like your theology and Christ's are at odds. Which one am I \n>to believe?\n\nIn this parable, Jesus tells the parable of the wedding feast. "The kingdom\nof heaven is like unto a certain king which made a marriage for his son".\nSo the wedding clothes were customary, and "given" to those who "chose" to\nattend. This man "refused" to wear the clothes. The wedding clothes are\nequalivant to the "clothes of righteousness". When Jesus died for our sins,\nthose "clothes" were then provided. Like that man, it is our decision to\nput the clothes on.\n rutgers
5 From: [email protected] (Virgilio (Dean) B. Velasco Jr.)\nSubject: Re: The arrogance of Christians\nOrganization: Case Western Reserve Univ. Cleveland, Ohio (USA)\nLines: 28\n\nIn article <[email protected]> [email protected] (Steve Hayes) writes:\n\n>A similar analogy might be a medical doctor who believes that a blood \n>transfusion is necessary to save the life of a child whose parents are \n>Jehovah's Witnesses and so have conscientious objections to blood \n>transfusion. The doctor's efforts to persuade them to agree to a blood \n>transfusion could be perceived to be arrogant in precisely the same way as \n>Christians could be perceived to be arrogant.\n\n>The truth or otherwise of the belief that a blood transfusion is necessary \n>to save the life of the child is irrelevant here. What matters is that the \n>doctor BELIEVES it to be true, and could be seen to be trying to foce his \n>beliefs on the parents, and this could well be perceived as arrogance.\n\nLet me carry that a step further. Most doctors would not claim to be \ninfallible. Indeed, they would generally admit that they could conceivably\nbe wrong, e.g. that in this case, a blood tranfusion might not turn out to \nbe necessary after all. However, the doctors would have enough confidence\nand conviction to claim, out of genuine concern, that is IS necessary. As\nfallible human beings, they must acknowledge the possibility that they are\nwrong. However, they would also say that such doubts are not reasonable,\nand stand by their convictions.\n\n-- \nVirgilio "Dean" Velasco Jr, Department of Electrical Eng'g and Applied Physics \n\t CWRU graduate student, roboticist-in-training and Q wannabee\n "Bullwinkle, that man's intimidating a referee!" | My boss is a \n "Not very well. He doesn't look like one at all!" | Jewish carpenter.\n Case Western Reserve Univ. Cleveland, Ohio
6 From: [email protected] (joseph dale fisher)\nSubject: Re: anger\nOrganization: Indiana University\nLines: 34\n\nIn article <[email protected]> [email protected] writes:\n>>Paul Conditt writes:\n[insert deletion of Paul's and Aaron's discourse on anger, ref Galatians\n5:19-20]\n>\n>I don't know why it is so obvious. We are not speaking of acts of the \n>flesh. We are just speaking of emotions. Emotions are not of themselves\n>moral or immoral, good or bad. Emotions just are. The first step is\n>not to label his emotion as good or bad or to numb ourselves so that\n>we hide our true feelings, it is to accept ourselves as we are, as God\n>accepts us. \n\nOh, but they definitely can be. Please look at Colossians 3:5-10 and\nEphesians 4:25-27. Emotions can be controlled and God puts very strong\nemphasis on self-control, otherwise, why would he have Paul write to\nTimothy so much about making sure to teach self-control? \n\n[insert deletion of remainder of paragraph]\n\n>\n>Re-think it, Aaron. Don't be quick to judge. He has forgiven those with\n>AIDS, he has dealt with and taken responsibility for his feelings and made\n>appropriate choices for action on such feelings. He has not given in to\n>his anger.\n\nPlease, re-think and re-read for yourself, Joe. Again, the issue is\nself-control especially over feelings and actions, for our actions stem\nfrom our feelings in many instances. As for God giving in to his anger,\nthat comes very soon.\n\n>\n>Joe Moore\n\nJoe Fisher\n None
8 From: [email protected] (Gordon Banks)\nSubject: Re: Blindsight\nReply-To: [email protected] (Gordon Banks)\nOrganization: Univ. of Pittsburgh Computer Science\nLines: 18\n\nIn article <[email protected]> [email protected] (John Werner) writes:\n>In article <[email protected]tt.UUCP>, [email protected] (Gordon Banks) wrote:\n>> \n>> Explain. I thought there were 3 types of cones, equivalent to RGB.\n>\n>You're basically right, but I think there are just 2 types. One is\n>sensitive to red and green, and the other is sensitive to blue and yellow. \n>This is why the two most common kinds of color-blindness are red-green and\n>blue-yellow.\n>\n\nYes, I remember that now. Well, in that case, the cones are indeed\ncolor sensitive, contrary to what the original respondent had claimed.\n-- \n----------------------------------------------------------------------------\nGordon Banks N3JXP | "Skepticism is the chastity of the intellect, and\[email protected] | it is shameful to surrender it too soon." \n----------------------------------------------------------------------------\n Univ. of Pittsburgh

### Customizing the AnswerExtractor to Your Use Case¶

If there are false positives (or false negatives), you can adjust the min_conf parameter (i.e., minimum confidence threshold) until you’re happy (default is min_conf=6). If return_conf=True, then columns showing the confidence scores of each extraction will also be included in the resultant DataFrame.

If adjusting the confidence threshold is not sufficient to address the false positives and false negatives you're seeing, you can also try fine-tuning the QA model to your custom dataset by providing only a small handful examples:

Example:

data = [
{"question": "What is the URL?",
"context": "Closing price for Square on October 8th was \$238.57, for details - https://finance.yahoo.com",
{"question": "What is the URL?",
"context": "HTTP is a protocol for fetching resources.",

Note that, by default, the AnswerExtractor uses a bert-large-* model that requires a lot of memory to train. If fine-tuning, you may want to switch to a smaller model like DistilBERT, as shown in the example above.
Finally, the finetune method accepts other parameters such as batch_size and max_seq_length that you can adjust depending on your speed requirements, dataset characteristics, and system resources.