In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

QA-Based Information Extraction

As of v0.28.x, ktrain now includes a “universal” information extractor, which uses a Question-Answering model to extract any information of interest from documents.

Suppose you have a table (e.g., an Excel spreadsheet) that looks like the DataFrame below. (In this example, each document is a single sentence, but each row can potenially be an entire report with many paragraphs.)

In [2]:
data = [
'Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .',
'There is a risk of Donald Trump running again in 2024.',
"""This risk was consistent across patients stratified by history of CVD, risk factors 
but no CVD, and neither CVD nor risk factors.""",
"""Risk factors associated with subsequent death include older age, hypertension, diabetes, 
ischemic heart disease, obesity and chronic lung disease; however, sometimes 
there are no obvious risk factors .""",
'Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.',
'His speciality is medical risk assessments, and he is 30 years old.',
"""Results: A total of nine studies including 356 patients were included in this study, 
the mean age was 52.4 years and 221 (62.1%) were male."""]
import pandas as pd
pd.set_option("display.max_colwidth", None)
df = pd.DataFrame(data, columns=['Text'])
df.head(10)
Out[2]:
Text
0 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .
1 There is a risk of Donald Trump running again in 2024.
2 This risk was consistent across patients stratified by history of CVD, risk factors \nbut no CVD, and neither CVD nor risk factors.
3 Risk factors associated with subsequent death include older age, hypertension, diabetes, \nischemic heart disease, obesity and chronic lung disease; however, sometimes \nthere are no obvious risk factors .
4 Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.
5 His speciality is medical risk assessments, and he is 30 years old.
6 Results: A total of nine studies including 356 patients were included in this study, \nthe mean age was 52.4 years and 221 (62.1%) were male.

Let's pretend your boss wants you to extract both the reported risk factors from each document and the sample sizes for the reported studies. This can easily be accomplished with the AnswerExtractor in ktrain, a kind of universal information extractor based on a Question-Answering model.

In [3]:
from ktrain.text.qa import AnswerExtractor
ae = AnswerExtractor()
df = ae.extract(df.Text.values, df, [('What are the risk factors?', 'Risk Factors'), 
                                     ('How many individuals in sample?', 'Sample Size')])
df.head(10)
Out[3]:
Text Risk Factors Sample Size
0 Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) . sex, obesity, genetic factors and mechanical factors None
1 There is a risk of Donald Trump running again in 2024. None None
2 This risk was consistent across patients stratified by history of CVD, risk factors \nbut no CVD, and neither CVD nor risk factors. neither CVD nor risk factors None
3 Risk factors associated with subsequent death include older age, hypertension, diabetes, \nischemic heart disease, obesity and chronic lung disease; however, sometimes \nthere are no obvious risk factors . older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease None
4 Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia. sex (male), age (≥60), and severe pneumonia None
5 His speciality is medical risk assessments, and he is 30 years old. None None
6 Results: A total of nine studies including 356 patients were included in this study, \nthe mean age was 52.4 years and 221 (62.1%) were male. None 356

As you can see, all that's required is that you phrase the type information you want to extract as a question (e.g., What are the risk factors?) and provide a label (e.g., Risk Factors). The above command will return a new DataFrame with additional columns containing the information of interest.

Additional Examples

QA-based information extraction is surprisingly versatile. Here, we use it to extract URLs, dates, and amounts.

In [4]:
data = ["Closing price for Square on October 8th was $238.57, for details - https://finance.yahoo.com",
        """The film "The Many Saints of Newark" was released on 10/01/2021.""",
           "Release delayed until the 1st of October due to COVID-19",
           "Price of Bitcoin fell to forty thousand dollars",
           "Documentation can be found at: amaiya.github.io/causalnlp",
]
df = pd.DataFrame(data, columns=['Text'])
df = ae.extract(df.Text.values, df, [('What is the amount?', 'Amount'),
                                     ('What is the URL?', 'URL'), 
                                     ('What is the date?', 'Date')])
df.head(10)
Out[4]:
Text Amount URL Date
0 Closing price for Square on October 8th was $238.57, for details - https://finance.yahoo.com 238.57 https://finance.yahoo.com October 8th
1 The film "The Many Saints of Newark" was released on 10/01/2021. None None 10/01/2021
2 Release delayed until the 1st of October due to COVID-19 None None 1st of October
3 Price of Bitcoin fell to forty thousand dollars forty thousand dollars None None
4 Documentation can be found at: amaiya.github.io/causalnlp None amaiya.github.io/causalnlp None

For our last example, let's extract universities from a sample of the 20 Newsgroup dataset:

In [5]:
# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
df = pd.DataFrame(train_b.data[:10], columns=['Text']) # let's examine the first 10 posts
df = ae.extract(df.Text.values, df, [('What is the university?', 'University')])
df.head(10)
Out[5]:
Text University
0 From: [email protected] (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format. We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance. Michael.\n-- \nMichael Collier (Programmer) The Computer Unit,\nEmail: [email protected] The City University,\nTel: 071 477-8000 x3769 London,\nFax: 071 477-8565 EC1V 0HB.\n The City University
1 From: [email protected] (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the problem:\n\tI have a rectangular mesh in the uv domain, i.e the mesh is a \n\tmapping of a 3d Bezier patch into 2d. The area in this domain\n\twhich is inside a trimming loop had to be rendered. The trimming\n\tloop is a set of 2d Bezier curve segments.\n\tFor the sake of notation: the mesh is made up of cells.\n\n\tMy problem is this :\n\tThe trimming area has to be split up into individual smaller\n\tcells bounded by the trimming curve segments. If a cell\n\tis wholly inside the area...then it is output as a whole ,\n\telse it is trivially rejected. \n\n\tDoes any body know how thiss can be done, or is there any algo. \n\tsomewhere for doing this.\n\n\tAny help would be appreciated.\n\n\tThanks, \n\tAni.\n-- \nTo get irritated is human, to stay cool, divine.\n University Of Kentucky
2 From: [email protected] (Darin Johnson)\nSubject: Re: harrassed at work, could use some prayers\nOrganization: =CSE Dept., U.C. San Diego\nLines: 63\n\n(Well, I'll email also, but this may apply to other people, so\nI'll post also.)\n\n>I've been working at this company for eight years in various\n>engineering jobs. I'm female. Yesterday I counted and realized that\n>on seven different occasions I've been sexually harrassed at this\n>company.\n\n>I dreaded coming back to work today. What if my boss comes in to ask\n>me some kind of question...\n\nYour boss should be the person bring these problems to. If he/she\ndoes not seem to take any action, keep going up higher and higher.\nSexual harrassment does not need to be tolerated, and it can be an\nenormous emotional support to discuss this with someone and know that\nthey are trying to do something about it. If you feel you can not\ndiscuss this with your boss, perhaps your company has a personnel\ndepartment that can work for you while preserving your privacy. Most\ncompanies will want to deal with this problem because constant anxiety\ndoes seriously affect how effectively employees do their jobs.\n\nIt is unclear from your letter if you have done this or not. It is\nnot inconceivable that management remains ignorant of employee\nproblems/strife even after eight years (it's a miracle if they do\nnotice). Perhaps your manager did not bring to the attention of\nhigher ups? If the company indeed does seem to want to ignore the\nentire problem, there may be a state agency willing to fight with\nyou. (check with a lawyer, a women's resource center, etc to find out)\n\nYou may also want to discuss this with your paster, priest, husband,\netc. That is, someone you know will not be judgemental and that is\nsupportive, comforting, etc. This will bring a lot of healing.\n\n>So I returned at 11:25, only to find that ever single\n>person had already left for lunch. They left at 11:15 or so. No one\n>could be bothered to call me at the other building, even though my\n>number was posted.\n\nThis happens to a lot of people. Honest. I believe it may seem\nto be due to gross insensitivity because of the feelings you are\ngoing through. People in offices tend to be more insensitive while\nworking than they normally are (maybe it's the hustle or stress or...)\nI've had this happen to me a lot, often because they didn't realize\nmy car was broken, etc. Then they will come back and wonder why I\ndidn't want to go (this would tend to make me stop being angry at\nbeing ignored and make me laugh). Once, we went off without our\nboss, who was paying for the lunch :-)\n\n>For this\n>reason I hope good Mr. Moderator allows me this latest indulgence.\n\nWell, if you can't turn to the computer for support, what would\nwe do? (signs of the computer age :-)\n\nIn closing, please don't let the hateful actions of a single person\nharm you. They are doing it because they are still the playground\nbully and enjoy seeing the hurt they cause. And you should not\naccept the opinions of an imbecile that you are worthless - much\nwiser people hold you in great esteem.\n-- \nDarin Johnson\[email protected]\n - Luxury! In MY day, we had to make do with 5 bytes of swap...\n U.C.San Diego
3 From: [email protected] (M.M. Zwart)\nSubject: catholic church poland\nOrganization: Faculteit der Letteren, Rijksuniversiteit Groningen, NL\nLines: 10\n\nHello,\n\nI'm writing a paper on the role of the catholic church in Poland after 1989. \nCan anyone tell me more about this, or fill me in on recent books/articles(\nin english, german or french). Most important for me is the role of the \nchurch concerning the abortion-law, religious education at schools,\nbirth-control and the relation church-state(government). Thanx,\n\n Masja,\n"M.M.Zwart"<[email protected]>\n Rijksuniversiteit Groningen
4 From: [email protected]sc.ncr.com (stanly)\nSubject: Re: Elder Brother\nOrganization: NCR Corp., Columbia SC\nLines: 15\n\nIn article <[email protected]> [email protected] writes:\n>In article <[email protected]> [email protected]\n>Matt. 22:9-14 'Go therefore to the main highways, and as many as you find\n>there, invite to the wedding feast.'...\n\n>hmmmmmm. Sounds like your theology and Christ's are at odds. Which one am I \n>to believe?\n\nIn this parable, Jesus tells the parable of the wedding feast. "The kingdom\nof heaven is like unto a certain king which made a marriage for his son".\nSo the wedding clothes were customary, and "given" to those who "chose" to\nattend. This man "refused" to wear the clothes. The wedding clothes are\nequalivant to the "clothes of righteousness". When Jesus died for our sins,\nthose "clothes" were then provided. Like that man, it is our decision to\nput the clothes on.\n rutgers
5 From: [email protected] (Virgilio (Dean) B. Velasco Jr.)\nSubject: Re: The arrogance of Christians\nOrganization: Case Western Reserve Univ. Cleveland, Ohio (USA)\nLines: 28\n\nIn article <[email protected]> [email protected] (Steve Hayes) writes:\n\n>A similar analogy might be a medical doctor who believes that a blood \n>transfusion is necessary to save the life of a child whose parents are \n>Jehovah's Witnesses and so have conscientious objections to blood \n>transfusion. The doctor's efforts to persuade them to agree to a blood \n>transfusion could be perceived to be arrogant in precisely the same way as \n>Christians could be perceived to be arrogant.\n\n>The truth or otherwise of the belief that a blood transfusion is necessary \n>to save the life of the child is irrelevant here. What matters is that the \n>doctor BELIEVES it to be true, and could be seen to be trying to foce his \n>beliefs on the parents, and this could well be perceived as arrogance.\n\nLet me carry that a step further. Most doctors would not claim to be \ninfallible. Indeed, they would generally admit that they could conceivably\nbe wrong, e.g. that in this case, a blood tranfusion might not turn out to \nbe necessary after all. However, the doctors would have enough confidence\nand conviction to claim, out of genuine concern, that is IS necessary. As\nfallible human beings, they must acknowledge the possibility that they are\nwrong. However, they would also say that such doubts are not reasonable,\nand stand by their convictions.\n\n-- \nVirgilio "Dean" Velasco Jr, Department of Electrical Eng'g and Applied Physics \n\t CWRU graduate student, roboticist-in-training and Q wannabee\n "Bullwinkle, that man's intimidating a referee!" | My boss is a \n "Not very well. He doesn't look like one at all!" | Jewish carpenter.\n Case Western Reserve Univ. Cleveland, Ohio
6 From: [email protected] (joseph dale fisher)\nSubject: Re: anger\nOrganization: Indiana University\nLines: 34\n\nIn article <[email protected]> [email protected] writes:\n>>Paul Conditt writes:\n[insert deletion of Paul's and Aaron's discourse on anger, ref Galatians\n5:19-20]\n>\n>I don't know why it is so obvious. We are not speaking of acts of the \n>flesh. We are just speaking of emotions. Emotions are not of themselves\n>moral or immoral, good or bad. Emotions just are. The first step is\n>not to label his emotion as good or bad or to numb ourselves so that\n>we hide our true feelings, it is to accept ourselves as we are, as God\n>accepts us. \n\nOh, but they definitely can be. Please look at Colossians 3:5-10 and\nEphesians 4:25-27. Emotions can be controlled and God puts very strong\nemphasis on self-control, otherwise, why would he have Paul write to\nTimothy so much about making sure to teach self-control? \n\n[insert deletion of remainder of paragraph]\n\n>\n>Re-think it, Aaron. Don't be quick to judge. He has forgiven those with\n>AIDS, he has dealt with and taken responsibility for his feelings and made\n>appropriate choices for action on such feelings. He has not given in to\n>his anger.\n\nPlease, re-think and re-read for yourself, Joe. Again, the issue is\nself-control especially over feelings and actions, for our actions stem\nfrom our feelings in many instances. As for God giving in to his anger,\nthat comes very soon.\n\n>\n>Joe Moore\n\nJoe Fisher\n None
7 From: [email protected] (Jacquelin Aldridge)\nSubject: Re: Teenage acne\nOrganization: NETCOM On-line Communication Services (408 241-9760 guest)\nLines: 57\n\[email protected] (Pat Churchill) writes:\n\n\n>My 14-y-o son has the usual teenage spotty chin and greasy nose. I\n>bought him Clearasil face wash and ointment. I think that is probably\n>enough, along with the usual good diet. However, he is on at me to\n>get some product called Dalacin T, which used to be a\n>doctor's-prescription only treatment but is not available over the\n>chemist's counter. I have asked a couple of pharmacists who say\n>either his acne is not severe enough for Dalacin T, or that Clearasil\n>is OK. I had the odd spots as a teenager, nothing serious. His\n>father was the same, so I don't figure his acne is going to escalate\n>into something disfiguring. But I know kids are senstitive about\n>their appearance. I am wary because a neighbour's son had this wierd\n>malady that was eventually put down to an overdose of vitamin A from\n>acne treatment. I want to help - but with appropriate treatment.\n\n>My son also has some scaliness around the hairline on his scalp. Sort\n>of teenage cradle cap. Any pointers/advice on this? We have tried a\n>couple of anti dandruff shampoos and some of these are inclined to\n>make the condition worse, not better.\n\n>Shall I bury the kid till he's 21 :)\n\n:) No...I was one of the lucky ones. Very little acne as a teenager. I\ndidn't have any luck with clearasil. Even though my skin gets oily it\nreally only gets miserable pimples when it's dry. \n\nFrequent lukewarm water rinses on the face might help. Getting the scalp\nthing under control might help (that could be as simple as submerging under\nthe bathwater till it's softened and washing it out). Taking a one a day\nvitamin/mineral might help. I've heard iodine causes trouble and that it \nis used in fast food restaurants to sterilize equipment which might be\nwhere the belief that greasy foods cause acne came from. I notice grease \non my face, not immediately removed will cause acne (even from eating\nmeat).\n\nKeeping hair rinse, mousse, dip, and spray off the face will help. Warm\nwater bath soaks or cloths on the face to soften the oil in the pores will\nhelp prevent blackheads. Body oil is hydrophilic, loves water and it\nsoftens and washes off when it has a chance. That's why hair goes limp with\noilyness. \n\nBecoming convinced that the best thing to do with\na whitehead is leave it alone will save him days of pimple misery. Any\nprying of black or whiteheads can cause infections, the red spots of\npimples. Usually a whitehead will break naturally in a day and there won't\nbe an infection afterwards.\n\nTell him that it's normal to have some pimples but the cosmetic industry\nmakes it's money off of selling people on the idea that they are an\nincredible defect to be hidden at any cost (even that of causing more pimples). \n\n\n-Jackie-\n\n\n None
8 From: [email protected] (Gordon Banks)\nSubject: Re: Blindsight\nReply-To: [email protected] (Gordon Banks)\nOrganization: Univ. of Pittsburgh Computer Science\nLines: 18\n\nIn article <[email protected]> [email protected] (John Werner) writes:\n>In article <[email protected]tt.UUCP>, [email protected] (Gordon Banks) wrote:\n>> \n>> Explain. I thought there were 3 types of cones, equivalent to RGB.\n>\n>You're basically right, but I think there are just 2 types. One is\n>sensitive to red and green, and the other is sensitive to blue and yellow. \n>This is why the two most common kinds of color-blindness are red-green and\n>blue-yellow.\n>\n\nYes, I remember that now. Well, in that case, the cones are indeed\ncolor sensitive, contrary to what the original respondent had claimed.\n-- \n----------------------------------------------------------------------------\nGordon Banks N3JXP | "Skepticism is the chastity of the intellect, and\[email protected] | it is shameful to surrender it too soon." \n----------------------------------------------------------------------------\n Univ. of Pittsburgh
9 From: [email protected] (Marlena Libman)\nSubject: Need advice with doctor-patient relationship problem\nOrganization: University of Southern California, Los Angeles, CA\nLines: 64\nNNTP-Posting-Host: hsc.usc.edu\n\nI need advice with a situation which occurred between me and a physican\nwhich upset me. I saw this doctor for a problem with recurring pain.\nHe suggested medication and a course of treatment, and told me that I\nneed to call him 7 days after I begin the medication so that he may\nmonitor its effectiveness, as well as my general health.\n\nI did exactly as he asked, and made the call (reaching his secretary).\nI explained to her that I was following up at the doctor's request,\nand that I was worried because the pain episodes were becoming more\nfrequent and the medication did not seem effective.\n\nThe doctor called me back, and his first words were, "Whatever you want,\nyou'd better make it quick. I'm very busy and don't have time to chit-\nchat with you!" I told him I was simply following his instructions to\ncall on the 7th day to status him, and that I was feeling worse. I \nthen asked if perhaps there was a better time for us to talk when he\nhad more time. He responded, "Just spit it out now because no time is\na good time." (Said in a raised voice.) I started to feel upset and\ntried to explain quickly what was going on with my condition but my\nnervousness interfered with my choice of words and I kind of stuttered\nand then said "well, never mind" and he said he'll talk to various\ncolleagues about other medications and he'll call me some other time.\n\nThis doctor called me that evening and said because I didn't express\nmyself well, he was confused about what I wanted. At this point I\nwas pretty upset and I told him (in an amazingly polite voice considering\nhow angry I felt) that his earlier manner had hurt my feelings. He told\nme that he just doesn't have time to "rap with patients" and thought\nthat was what I wanted. I told him that to assume I was calling to\n"rap" was insulting, and said again that I was just following through\non his orders. He responded that he resented the implication that he \nfelt I was making that he was not interested in learning about what his\npatients have to say about their condition status. He then gave me\nthis apology: "I am sorry that there was a miscommunication and you\nmistakenly thought I was insulting. I am not trying to insult you\nbut I am not that knowledgeable about pain, and I don't have a lot of\ntime to deal with that." He then told me to call him the next day\nfor further instructions on how do deal with my pain and medication.\n\nI am still upset and have not yet called.\n\nMy questions: (1) Should I continue to have this doctor manage my care?\n(2) Since I am in pain off and on, I realize that this may cause me to\nbe more anxietous so am I perhaps over-reacting or overly sensitive?\nIf this doctor refers me to his colleague who knows more about the type\nof pain I have, he still wants me to status him on my condition but\nnow I am afraid to call him.\n\n\t\t\t--Marlena\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n University of Southern California

Customizing the AnswerExtractor to Your Use Case

If there are false positives (or false negatives), you can adjust the min_conf parameter (i.e., minimum confidence threshold) until you’re happy (default is min_conf=6). If return_conf=True, then columns showing the confidence scores of each extraction will also be included in the resultant DataFrame.

If adjusting the confidence threshold is not sufficient to address the false positives and false negatives you're seeing, you can also try fine-tuning the QA model to your custom dataset by providing only a small handful examples:

Example:

data = [
{"question": "What is the URL?", 
"context": "Closing price for Square on October 8th was $238.57, for details - https://finance.yahoo.com", 
 "answers": "https://finance.yahoo.com"},
 {"question": "What is the URL?", 
  "context": "HTTP is a protocol for fetching resources.", 
  "answers": None}, 
]
from ktrain.text.qa import AnswerExtractor
ae = AnswerExtractor(bert_squad_model='distilbert-base-cased-distilled-squad')
ae.finetune(data)

Note that, by default, the AnswerExtractor uses a bert-large-* model that requires a lot of memory to train. If fine-tuning, you may want to switch to a smaller model like DistilBERT, as shown in the example above.

Finally, the finetune method accepts other parameters such as batch_size and max_seq_length that you can adjust depending on your speed requirements, dataset characteristics, and system resources.

In [ ]: