Notebook

🔎 Answering Questions on Financial Texts¶

📜One of the latests biggest outcomes in NLP are Language Models and their ability to answer questions, expressed in natural language.

*While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020...

... We reported an operating loss of approxiamtely $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019 ...*

- What is the profit increase?
- What was the decline in revenue?
- What was the operation loss in 2020?
- What was the operation loss in 2019?

📜

Question Answeering (QA) uses specific Language Models trained to carry out Natural Language Inference (NLI)

NLI works as follows:

Given a text as a Premise (P);
Given a hypotheses (H) as a question to be solved;
- Then, we ask the Language Model is H is entailed, contradicted or not related in P.

Although we are not getting into the maths of it, it's basically done by using a Language Model to encode P, H and then carry out sentence similarity operations.

📌 Applications of NLI: The basics¶

The most straight-forward, retrieving answers to natural language questions.

Type 1: Open-book questions, where you give the text (P) to the model.
Type 2: Close-book questions, where you just use the pretrained Language Model capabilities, learn on texts during training time.

📌 Applications of NLI: Zero-shot¶

At John Snow Labs, we have developed our own annotators based on NLI, to not only carry out Question Answering, but using QA to:

Retrieve Entities, also know as Zero-shot NER;
Retrieve Relations, also known as Zero-shot Relation Extraction;

✔️ How we achieve Zero-shot NER With QA?¶

Given a Question Q, for example, What was the profit increase in 2017?, and given the text P In 2017, the Company reported a profit decline of $4 million dollars compared to 2016 we:

Generate Hypotheses H with the tokens of the text
- The profit increase in 2017 was 2017: contradiction
- The profit increase in 2017 was Company: contradiction
- The profit increase in 2017 was ...: contradiction
- The profit increase in 2017 was $4: entailment
- The profit increase in 2017 was million: entailment
We check all the H towards P to see if they are entailed. If so, we return them as NER entity. If several tokens in a row return entailed, we check if they can be part of the same chunk.

Let's take a look at some examples of applications of QA to Financial Texts.

🎬 Installation¶

In [ ]:

! pip install -q johnsnowlabs

🔗 Automatic Installation¶

Using my.johnsnowlabs.com SSO

In [2]:

from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

🔗 Manual downloading¶

If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

Go to my.johnsnowlabs.com
Download your license
Upload it using the following command

In [ ]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Install it

In [ ]:

nlp.install()

📌 Starting¶

In [5]:

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7162 (6).json
👌 Launched cpu optimized session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.2, running on ⚡ PySpark==3.1.2

🔎 Open Book Questions¶

In [ ]:

! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt

In [ ]:

with open('cdns-20220101.html.txt', 'r') as f:
  cadence_sec10k = f.read()

Let's take a random piece of text from our 10-K filing...

In [ ]:

random_piece = cadence_sec10k[135000:144000]
print(random_piece)

necessary, on commercially reasonable terms or at all and, even if successful, those alternative actions may not allow us to meet our scheduled debt service obligations. The agreement governing our revolving credit facility restricts our ability to dispose of assets and use the proceeds from those dispositions and may also restrict our ability to raise debt or equity capital to be used to repay other indebtedness when it becomes due. We may not be able to consummate those dispositions or to obtain proceeds in an amount sufficient to meet any debt service obligations then due.
In addition, we conduct a substantial portion of our operations through our subsidiaries, none of which are currently guarantors of our indebtedness. Accordingly, repayment of our indebtedness is dependent on the generation of cash flow by our subsidiaries and their ability to make such cash available to us, by dividend, debt repayment or otherwise. Our subsidiaries do not have any obligation to pay amounts due on our indebtedness or to make funds available for that purpose. Our subsidiaries may not be able to, or may not be permitted to, make distributions to enable us to make payments in respect of our indebtedness. Each subsidiary is a distinct legal entity, and, under certain circumstances, legal and contractual restrictions may limit our ability to obtain cash from our subsidiaries. In the event that we do not receive distributions from our subsidiaries, we may be unable to make required principal and interest payments on our indebtedness.
24
Table of Contents
If we cannot make scheduled payments on our debt, we will be in default and holders of our debt could declare all outstanding principal and interest to be due and payable, the lenders under our revolving credit facility could terminate their commitments to loan money and we could be forced into bankruptcy or liquidation. In addition, a material default on our indebtedness could suspend our eligibility to register securities using certain registration statement forms under SEC guidelines that permit incorporation by reference of substantial information regarding us, potentially hindering our ability to raise capital through the issuance of our securities and increasing our costs of registration.
Despite our current level of indebtedness, we and our subsidiaries may incur substantially more debt. This could further exacerbate the risks to our financial condition described above.
We and our subsidiaries may incur significant additional indebtedness in the future. Although the agreement governing our revolving credit facility contains restrictions on the incurrence of additional indebtedness, these restrictions are subject to a number of qualifications and exceptions, and the additional indebtedness incurred in compliance with these restrictions could be substantial. If we incur any additional indebtedness that ranks equally with the 2024 Notes, then subject to any collateral arrangements we may enter into, the holders of that debt will be entitled to share ratably in any proceeds distributed in connection with any insolvency, liquidation, reorganization, dissolution or other winding up of our company.
Our variable rate indebtedness subjects us to interest rate risk, which could cause our debt service obligations to increase significantly.
Borrowings under our revolving credit facility are at variable rates of interest and expose us to interest rate risk. If interest rates were to increase, our debt service obligations on our variable rate indebtedness would increase even though the amount borrowed remained the same, and our net income and cash flows, including cash available for servicing our indebtedness, would correspondingly decrease. In the future, we may enter into interest rate swaps that involve the exchange of floating for fixed rate interest payments in order to reduce interest rate volatility. However, we may not maintain interest rate swaps with respect to all of our variable rate indebtedness, and any swaps we enter into may not fully mitigate our interest rate risk.
Our revolving credit facility utilizes, at our option, either (1) LIBOR, plus a margin of between 0.750% and 1.250%, determined by reference to the credit rating of our unsecured debt, or (2) base rate plus a margin of 0.000% to 0.250%, determined by reference to the credit rating of our unsecured debt, to calculate the amount of accrued interest on any borrowings. Regulators in certain jurisdictions including the United Kingdom and the United States have begun to phase out the use of LIBOR, ceasing publication for certain tenors of the U.S. dollar (and other) LIBOR at the end of 2021, with plans to cease publication for the remaining tenors of U.S. dollar LIBOR beginning June 30, 2023. Our revolving credit facility contains provisions that contemplate the transition from LIBOR under specified events; however, the transition from LIBOR to a new replacement benchmark remains uncertain at this time and the consequences of such developments cannot be entirely predicted, but could result in an increase in the cost of our borrowings under our existing credit facility and any future borrowings.
In addition, our revolving credit facility uses a pricing grid based on our credit ratings. If our credit ratings are downgraded or other negative action is taken, the interest rate payable by us under our revolving credit facility would increase. Credit rating downgrades could also restrict our ability to obtain additional financing in the future and affect the terms of any such financing.
Various factors could increase our future borrowing costs or reduce our access to capital, including a lowering or withdrawal of the ratings assigned to us and our 2024 Notes by credit rating agencies.
We may in the future seek additional financing for a variety of reasons, and our future borrowing costs, terms and access to capital could be affected by factors including the condition of the debt and equity markets, the condition of the economy generally, prevailing interest rates, our level of indebtedness, our credit rating and our business and financial condition. In addition, the 2024 Notes currently have an investment grade credit rating, which could be lowered or withdrawn entirely by a credit rating agency based on adverse changes to circumstances relating to the basis of the credit rating. Consequently, real or anticipated changes in our credit ratings will generally affect the market value of the 2024 Notes. Any future lowering of the credit ratings of the 2024 Notes likely would make it more difficult or more expensive for us to obtain additional debt financing.

Item 1B. Unresolved Staff Comments
None.

Item 2. Properties
We own land and buildings at our headquarters located in San Jose, California. We also own buildings in India. As of January 1, 2022, the total square footage of our owned buildings was approximately 1,010,000.
We lease additional facilities in the United States and various other countries. We may sublease certain of these facilities where space is not fully utilized.
We believe that these facilities are adequate for our current needs and that suitable additional or substitute space will be available as needed to accommodate any expansion of our operations.

25
Table of Contents
Item 3. Legal Proceedings
From time to time, we are involved in various disputes and legal proceedings that arise in the ordinary course of business. These include disputes and legal proceedings related to intellectual property, indemnification obligations, mergers and acquisitions, licensing, contracts, customers, products, distribution and other commercial arrangements and employee relations matters. At least quarterly, we review the status of each significant matter and assess its potential financial exposure. If the potential loss from any claim or legal proceeding is considered probable and the amount or the range of loss can be estimated, we accrue a liability for the estimated loss. Legal proceedings are subject to uncertainties, and the outcomes are difficult to predict. Because of such uncertainties, accruals are based on our judgments using the best information available at the time. As additional information becomes available, we reassess the potential liability related to pending claims and legal proceedings and may revise estimates.

Item 4. Mine Safety Disclosures
Not applicable.

26
Table of Contents
PART II.

Item 5. Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
Our common stock is traded on the Nasdaq Global Select Market under the symbol CDNS. As of February 5, 2022, we had 384 registered stockholders and approximately 340,000 beneficial owners of our common stock.

Stockholder Return Performance Graph
The following graph compares the cumulative 5-year total stockholder return on our common stock relative to the cumulative total return of the Nasdaq Composite Index,

Items 2,3, and 5 seem good to ask questions about them!

In [ ]:

item2 = """We own land and buildings at our headquarters located in San Jose, California. We also own buildings in India. As of January 1, 2022, the total square footage of our owned buildings was approximately 1,010,000.
We lease additional facilities in the United States and various other countries. We may sublease certain of these facilities where space is not fully utilized."""

item3 = """From time to time, we are involved in various disputes and legal proceedings that arise in the ordinary course of business. These include disputes and legal proceedings related to intellectual property, indemnification obligations, mergers and acquisitions, licensing, contracts, customers, products, distribution and other commercial arrangements and employee relations matters. At least quarterly, we review the status of each significant matter and assess its potential financial exposure. If the potential loss from any claim or legal proceeding is considered probable and the amount or the range of loss can be estimated, we accrue a liability for the estimated loss. Legal proceedings are subject to uncertainties, and the outcomes are difficult to predict. Because of such uncertainties, accruals are based on our judgments using the best information available at the time. As additional information becomes available, we reassess the potential liability related to pending claims and legal proceedings and may revise estimates."""

item5 = """Our common stock is traded on the Nasdaq Global Select Market under the symbol CDNS. As of February 5, 2022, we had 384 registered stockholders and approximately 340,000 beneficial owners of our common stock."""

🚀 Let's create a pipeline¶

We will use a RoBerta based QA model named finqa_roberta

📜To do that, we use in our pipelines:

a MultiDocumentAssembler, which puts together questions (Q to create H) and context (P).
a BertForQuestionAnswering pretrained model.

🚀IMPORTANT: We highly recommend to use setCaseSensitive(False) to prevent uppercase to be managed as proper nouns and possibly trigger OOV.

In [ ]:

documentAssembler = nlp.MultiDocumentAssembler()\
  .setInputCols(["question", "context"])\
  .setOutputCols(["document_question", "document_context"])

spanClassifier = nlp.BertForQuestionAnswering.pretrained("finqa_bert","en", "finance/models") \
  .setInputCols(["document_question", "document_context"]) \
  .setOutputCol("answer") \
  .setCaseSensitive(False)

qa_pipeline = nlp.Pipeline().setStages([
  documentAssembler,
  spanClassifier
])

finqa_bert download started this may take some time.
Approximate size to download 389 MB
[OK!]

In [ ]:

P = item2

Q = [
     "Where are the headquarters?",
     "What is the total square footage?",
     "In which countries do they lease facilities?"
]

Q_P = [ [q, P] for q in Q]

example = spark.createDataFrame(Q_P).toDF("question", "context")

example.show()

+--------------------+--------------------+
|            question|             context|
+--------------------+--------------------+
|Where are the hea...|We own land and b...|
|What is the total...|We own land and b...|
|In which countrie...|We own land and b...|
+--------------------+--------------------+

In [ ]:

result = qa_pipeline.fit(example).transform(example)

result.select('question', 'answer.result', 'answer').show(truncate=False)

+--------------------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                    |result                 |answer                                                                                                                                                                 |
+--------------------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Where are the headquarters?                 |[San Jose , California]|[{chunk, 0, 20, San Jose , California, {chunk -> 0, start_score -> 0.8019189, score -> 0.83842176, end -> 20, start -> 17, end_score -> 0.8749246, sentence -> 0}, []}]|
|What is the total square footage?           |[1 , 010 , 000]        |[{chunk, 0, 12, 1 , 010 , 000, {chunk -> 0, start_score -> 0.66597635, score -> 0.7811918, end -> 55, start -> 50, end_score -> 0.89640725, sentence -> 0}, []}]       |
|In which countries do they lease facilities?|[United States]        |[{chunk, 0, 12, United States, {chunk -> 0, start_score -> 0.5888994, score -> 0.52136713, end -> 64, start -> 63, end_score -> 0.4538349, sentence -> 0}, []}]        |
+--------------------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In [ ]:

P = item5

Q = [
     "Where is their common stock traded?",
     "Which is the trading symbol?"
]

Q_P = [ [q, P] for q in Q]

example = spark.createDataFrame(Q_P).toDF("question", "context")

result = qa_pipeline.fit(example).transform(example)

result.select('question', 'answer.result', 'answer').show(truncate=False)

+-----------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                           |result                       |answer                                                                                                                                                                        |
+-----------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Where is their common stock traded?|[Nasdaq Global Select Market]|[{chunk, 0, 26, Nasdaq Global Select Market, {chunk -> 0, start_score -> 0.30269945, score -> 0.41721502, end -> 21, start -> 16, end_score -> 0.5317306, sentence -> 0}, []}]|
|Which is the trading symbol?       |[CDNS]                       |[{chunk, 0, 3, CDNS, {chunk -> 0, start_score -> 0.8779542, score -> 0.8447887, end -> 25, start -> 24, end_score -> 0.8116232, sentence -> 0}, []}]                          |
+-----------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In [ ]:

P = item3

Q = [
     "What kind of disputes or legal proceedings related to?"
]

Q_P = [ [q, P] for q in Q]

example = spark.createDataFrame(Q_P).toDF("question", "context")

result = qa_pipeline.fit(example).transform(example)

result.select('question', 'answer.result', 'answer').show(truncate=False)

+------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                              |result                                                                                                                                                                                                         |answer                                                                                                                                                                                                                                                                                                                                                           |
+------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What kind of disputes or legal proceedings related to?|[intellectual property , indemnification obligations , mergers and acquisitions , licensing , contracts , customers , products , distribution and other commercial arrangements and employee relations matters]|[{chunk, 0, 204, intellectual property , indemnification obligations , mergers and acquisitions , licensing , contracts , customers , products , distribution and other commercial arrangements and employee relations matters, {chunk -> 0, start_score -> 0.63349277, score -> 0.56178546, end -> 71, start -> 43, end_score -> 0.4900781, sentence -> 0}, []}]|
+------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

🔎 Automatic Question Generation¶

Now the question is ... is there a way to generate the questions automatically?

The answer is simple: YES, there is!

We have several ways to generate a series of questions, given for examplee:

A SUBJECT of a sentence;
An ACTION (verb);

More specifically, there are three ways:

Using the grammatical information (Part of Speech and Dependency Tree);
Using NER / Contextual Parser or other method to retrieve SUBJECT and VERB

Check the notebook "Automatic Question Generation" for examples of how to do it.

🔎 Table Question Answering¶

For table question answering we have a specific notebok you will find in this workshop. Feel free to check it out too!

But it the meantime, a small spoiler...

🔎 1. From csv files¶

Let's create a csv file with information about clients and agreements.

In [ ]:

import pandas as pd

df_data = { 
    "header" : ['client name', 'last operation year', 'last operation amount', 'document'],
    "rows" : [    
 ['John Smith', '2007', '$200000', 'NDA'],
 ['Jack Gordon', '2017', '$10000',  'Credit Agreement'],
 ['Mary Lean', '2001', '$120000', 'License Agreement'],
 ['Jessica James', '2022', '$1200000', 'Purchase Agreement'],
]
}


df = pd.DataFrame(df_data['rows'], columns=df_data['header'])

df.to_csv('table.csv', index=False)

In [ ]:

df_data

Out[ ]:

{'header': ['client name',
  'last operation year',
  'last operation amount',
  'document'],
 'rows': [['John Smith', '2007', '$200000', 'NDA'],
  ['Jack Gordon', '2017', '$10000', 'Credit Agreement'],
  ['Mary Lean', '2001', '$120000', 'License Agreement'],
  ['Jessica James', '2022', '$1200000', 'Purchase Agreement']]}

In [ ]:

df

Out[ ]:

	client name	last operation year	last operation amount	document
0	John Smith	2007	$200000	NDA
1	Jack Gordon	2017	$10000	Credit Agreement
2	Mary Lean	2001	$120000	License Agreement
3	Jessica James	2022	$1200000	Purchase Agreement

In [ ]:

import json
json.dumps(df_data)

Out[ ]:

'{"header": ["client name", "last operation year", "last operation amount", "document"], "rows": [["John Smith", "2007", "$200000", "NDA"], ["Jack Gordon", "2017", "$10000", "Credit Agreement"], ["Mary Lean", "2001", "$120000", "License Agreement"], ["Jessica James", "2022", "$1200000", "Purchase Agreement"]]}'

Now, some questions...

In [ ]:

queries = [
    "Who signed an NDA?",
    "Who operated last time in 2022?", 
    "What is the total amount of operations?",
    "Which year a Credit Agreement was signed?",
]

Now, we will use the following specific components:

A MultiDocumentAssembler, to put together the questions and the table in json format
A TableAssembler to assemble the table from a json

In [ ]:

data = spark.createDataFrame([
        [json.dumps(df_data), " ".join(queries)]
    ]).toDF("table_json", "questions")

In [ ]:

data.show()

+--------------------+--------------------+
|          table_json|           questions|
+--------------------+--------------------+
|{"header": ["clie...|Who signed an NDA...|
+--------------------+--------------------+

In [ ]:

document_assembler = nlp.MultiDocumentAssembler() \
    .setInputCols("table_json", "questions") \
    .setOutputCols("document_table", "document_questions")

text_splitter = finance.TextSplitter() \
    .setInputCols(["document_questions"]) \
    .setOutputCol("questions")

table_assembler = nlp.TableAssembler()\
    .setInputCols(["document_table"])\
    .setOutputCol("table")

Last component is TapasForQuestionAnswering, which will carry out the inference process

In [ ]:

tapas = nlp.TapasForQuestionAnswering.pretrained("table_qa_tapas_base_finetuned_wtq", "en")\
    .setInputCols(["questions", "table"])\
    .setOutputCol("answers")

table_qa_tapas_base_finetuned_wtq download started this may take some time.
Approximate size to download 394.7 MB
[OK!]

Now the pipeline looks as follows:

In [ ]:

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter,
    table_assembler,
    tapas
])

And this is the result on fit/transform:

In [ ]:

model = pipeline.fit(data)
res = model\
    .transform(data)\
    .selectExpr("explode(answers) AS answer")\
    .select("answer")
res.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|answer                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 10, John Smith, {question -> Who signed an NDA?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 1.0}, []}                                                                                            |
|{chunk, 0, 13, Jessica James, {question -> Who operated last time in 2022?, aggregation -> NONE, cell_positions -> [0, 3], cell_scores -> 1.0}, []}                                                                            |
|{chunk, 0, 41, COUNT($200000, $10000, $120000, $1200000), {question -> What is the total amount of operations?, aggregation -> COUNT, cell_positions -> [2, 0], [2, 1], [2, 2], [2, 3], cell_scores -> 1.0, 1.0, 1.0, 1.0}, []}|
|{chunk, 0, 4, 2017, {question -> Which year a Credit Agreement was signed?, aggregation -> NONE, cell_positions -> [1, 1], cell_scores -> 1.0}, []}                                                                            |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In [ ]:

from pyspark.sql import functions as F
res.select("answer.metadata.question", F.expr('answer.result as answer'), F.expr('answer.metadata["aggregation"] as metadata')).show(truncate=False)

+-----------------------------------------+-----------------------------------------+--------+
|question                                 |answer                                   |metadata|
+-----------------------------------------+-----------------------------------------+--------+
|Who signed an NDA?                       |John Smith                               |NONE    |
|Who operated last time in 2022?          |Jessica James                            |NONE    |
|What is the total amount of operations?  |COUNT($200000, $10000, $120000, $1200000)|COUNT   |
|Which year a Credit Agreement was signed?|2017                                     |NONE    |
+-----------------------------------------+-----------------------------------------+--------+

🔎 2. From tables in scanned documents¶

You will need Visual NLP, another licensed product of JSL, to extract tables from documents.

The result will be just a csv, so you can apply the same code exposed above after you extract the table from your documents.

Check the notebook Financial_Visual_Document_Understanding for more details. In the meantime, a small spoiler...

🔎 Flan-T5 Question Answering¶

FLAN-T5 model is a state-of-the-art language model developed by Google AI that utilizes the T5 architecture for text generation tasks. The model is an encoder-decoder model that has been pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.

During the training phase, FLAN-T5 was fed a large corpus of text data and was trained to predict missing words in an input text via a fill-in-the-blank style objective. This process is repeated multiple times until the model has learned to generate text that is similar to the input data.

Once trained, FLAN-T5 can be used to perform a variety of NLP tasks, such as text generation, language translation, sentiment analysis, and text classification.

What are a few Use-cases?

FLAN-T5 has a few potential use-cases:

Text Generation: FLAN-T5 can be used to generate text based on a prompt or input. This is ideal for content creation and creative writing including writing fiction, poetry, news articles, or product descriptions. The model can be fine-tuned for specific writing styles or genres to improve the quality of the output.
Text Classification: FLAN-T5 can be used to classify text into different categories, such as spam or non-spam, positive or negative, or topics such as politics, sports, or entertainment. This can be useful for a variety of applications, such as content moderation, customer support, or personalized recommendations.
Text Summarization: FLAN-T5 can be fine-tuned to generate concise summaries of long articles and documents, making it ideal for news aggregation and information retrieval.
Sentiment Analysis: FLAN-T5 can be used to analyze the sentiment of text, such as online reviews, news articles, or social media posts. This can help businesses to understand how their products or services are being received, and to make informed decisions based on this data.
Question-Answering: FLAN-T5 can be fine-tuned to answer questions in a conversational manner, making it ideal for customer service and support.
Translation: FLAN-T5 can be fine-tuned to perform machine translation, making it ideal for multilingual content creation and localization.
Chatbots and Conversational AI: FLAN-T5 can be used to create conversational AI systems that can respond to user input in a natural and engaging manner. The model can be trained to handle a wide range of topics and respond in a conversational tone that is appropriate for the target audience.

finqa_flant5_finetuned¶

This finqa_flant5_finetuned Question Answering model has been fine-tuned on FLANT5 using finance data. This model provides powerful and efficient solution for accurately answering finance questions and delivering insightful information in the finance domain.

In [6]:

document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

fin_qa = finance.QuestionAnswering.pretrained("finqa_flant5_finetuned","en","finance/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")\
    .setMaxNewTokens(100)\
    .setOutputCol("answer")

pipeline = nlp.Pipeline(stages=[document_assembler, fin_qa])

empty_data = spark.createDataFrame([["",""]]).toDF("question", "context")

model = pipeline.fit(empty_data)

finqa_flant5_finetuned download started this may take some time.
[OK!]

In [12]:

context = """Our business strategy has been to develop data processing and product technologies that can displace intermediaries within the online advertising ecosystem, while cultivating relationships that can provide access to media spend (advertisers) and media inventory (websites). In this regard, we have proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by launching a SaaS version of the IntentKey in 2021. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners where the cash generated from the business can be used to accelerate growth of the IntentKey. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. Our business strategy is focused on providing differentiation through the AI analytics and data products we own and protect through patents. For the marketing and advertising industries we serve, this strategy aligns with the components of the value chain that are the principal drivers of value to our clients. As part of our growth strategy, we evaluate acquisition candidates from time to time as opportunities arise with a focus on companies that have either advertisers or advertising relationships we do not possess or publishers or publishing partners who have content we do not possess."""

questions = ["""What are the key components of the business strategy described?""",
             """What is the immediate strategy for scaling the IntentKey platform?""",
             """How does the company aim to provide differentiation in the market?"""]

Q_P = [ [q, context] for q in questions]

data = spark.createDataFrame(Q_P).toDF("question", "context")

data.show(truncate = 60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                                    question|                                                     context|
+------------------------------------------------------------+------------------------------------------------------------+
|What are the key components of the business strategy desc...|Our business strategy has been to develop data processing...|
|What is the immediate strategy for scaling the IntentKey ...|Our business strategy has been to develop data processing...|
|How does the company aim to provide differentiation in th...|Our business strategy has been to develop data processing...|
+------------------------------------------------------------+------------------------------------------------------------+

In [13]:

result = model.transform(data)

result.select('question', 'answer.result').show(truncate=False)

+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                          |result                                                                                                                                                                                                                                                                                                            |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What are the key components of the business strategy described?   |[The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ]|
|What is the immediate strategy for scaling the IntentKey platform?|[The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ]                                               |
|How does the company aim to provide differentiation in the market?|[The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ]                                                                                                                                                                                   |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Using LightPipeline¶

In [42]:

context = """Our business strategy has been to develop data processing and product technologies that can displace intermediaries within the online advertising ecosystem, while cultivating relationships that can provide access to media spend (advertisers) and media inventory (websites). In this regard, we have proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by launching a SaaS version of the IntentKey in 2021. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners where the cash generated from the business can be used to accelerate growth of the IntentKey. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. Our business strategy is focused on providing differentiation through the AI analytics and data products we own and protect through patents. For the marketing and advertising industries we serve, this strategy aligns with the components of the value chain that are the principal drivers of value to our clients. As part of our growth strategy, we evaluate acquisition candidates from time to time as opportunities arise with a focus on companies that have either advertisers or advertising relationships we do not possess or publishers or publishing partners who have content we do not possess."""

questions = ["""What are the key components of the business strategy described?""",
             """What is the immediate strategy for scaling the IntentKey platform?""",
             """How does the company aim to provide differentiation in the market?"""]

light_model = nlp.LightPipeline(model)

all_result = []

for q in range(len(questions)):
  light_result = light_model.annotate([questions[q]],[context])
  all_result.append(light_result)

all_result

Out[42]:

[[{'document_question': ['What are the key components of the business strategy described?'],
   'document_context': ['Our business strategy has been to develop data processing and product technologies that can displace intermediaries within the online advertising ecosystem, while cultivating relationships that can provide access to media spend (advertisers) and media inventory (websites). In this regard, we have proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by launching a SaaS version of the IntentKey in 2021. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners where the cash generated from the business can be used to accelerate growth of the IntentKey. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. Our business strategy is focused on providing differentiation through the AI analytics and data products we own and protect through patents. For the marketing and advertising industries we serve, this strategy aligns with the components of the value chain that are the principal drivers of value to our clients. As part of our growth strategy, we evaluate acquisition candidates from time to time as opportunities arise with a focus on companies that have either advertisers or advertising relationships we do not possess or publishers or publishing partners who have content we do not possess.'],
   'answer': ['The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ']}],
 [{'document_question': ['What is the immediate strategy for scaling the IntentKey platform?'],
   'document_context': ['Our business strategy has been to develop data processing and product technologies that can displace intermediaries within the online advertising ecosystem, while cultivating relationships that can provide access to media spend (advertisers) and media inventory (websites). In this regard, we have proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by launching a SaaS version of the IntentKey in 2021. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners where the cash generated from the business can be used to accelerate growth of the IntentKey. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. Our business strategy is focused on providing differentiation through the AI analytics and data products we own and protect through patents. For the marketing and advertising industries we serve, this strategy aligns with the components of the value chain that are the principal drivers of value to our clients. As part of our growth strategy, we evaluate acquisition candidates from time to time as opportunities arise with a focus on companies that have either advertisers or advertising relationships we do not possess or publishers or publishing partners who have content we do not possess.'],
   'answer': ['The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ']}],
 [{'document_question': ['How does the company aim to provide differentiation in the market?'],
   'document_context': ['Our business strategy has been to develop data processing and product technologies that can displace intermediaries within the online advertising ecosystem, while cultivating relationships that can provide access to media spend (advertisers) and media inventory (websites). In this regard, we have proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by launching a SaaS version of the IntentKey in 2021. We have both direct and indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the immediate strategy is to maintain the business at current levels by working with existing partners where the cash generated from the business can be used to accelerate growth of the IntentKey. For the IntentKey platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. Our business strategy is focused on providing differentiation through the AI analytics and data products we own and protect through patents. For the marketing and advertising industries we serve, this strategy aligns with the components of the value chain that are the principal drivers of value to our clients. As part of our growth strategy, we evaluate acquisition candidates from time to time as opportunities arise with a focus on companies that have either advertisers or advertising relationships we do not possess or publishers or publishing partners who have content we do not possess.'],
   'answer': ['The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ']}]]

In [44]:

import textwrap

context = textwrap.fill(all_result[0][0]['document_context'][0], width=120)

print("➤ Context: \n{}".format(context))
print("\n")

for q in range(len(questions)):

  question = textwrap.fill(all_result[q][0]['document_question'][0], width=120)

  answer = textwrap.fill(all_result[q][0]['answer'][0], width=120)

  print("➤ Question: \n{}".format(question))
  print("\n")
  print("➤ Answer: \n{}".format(answer))
  print("\n")

➤ Context: 
Our business strategy has been to develop data processing and product technologies that can displace intermediaries
within the online advertising ecosystem, while cultivating relationships that can provide access to media spend
(advertisers) and media inventory (websites). In this regard, we have proprietary demand (media spend) and supply side
(media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and
data management technologies, and advertising fraud detection technologies. We have both direct and indirect
relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick platform, the
immediate strategy is to maintain the business at current levels by working with existing partners. For the IntentKey
platform, the immediate strategy is to scale through the hiring of additional sales professionals, growing existing
accounts and expanding the market size by launching a SaaS version of the IntentKey in 2021. We have both direct and
indirect relationships at some of the largest media buyers and/or consolidators in the industry. For the ValidClick
platform, the immediate strategy is to maintain the business at current levels by working with existing partners where
the cash generated from the business can be used to accelerate growth of the IntentKey. For the IntentKey platform, the
immediate strategy is to scale through the hiring of additional sales professionals, growing existing accounts and
expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. Our business
strategy is focused on providing differentiation through the AI analytics and data products we own and protect through
patents. For the marketing and advertising industries we serve, this strategy aligns with the components of the value
chain that are the principal drivers of value to our clients. As part of our growth strategy, we evaluate acquisition
candidates from time to time as opportunities arise with a focus on companies that have either advertisers or
advertising relationships we do not possess or publishers or publishing partners who have content we do not possess.


➤ Question: 
What are the key components of the business strategy described?


➤ Answer: 
The key components of the business strategy described are proprietary demand (media spend) and supply side (media
inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data
management technologies, and advertising fraud detection technologies. . .


➤ Question: 
What is the immediate strategy for scaling the IntentKey platform?


➤ Answer: 
The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales
professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the
IntentKey beginning in 2021.


➤ Question: 
How does the company aim to provide differentiation in the market?


➤ Answer: 
The company aims to provide differentiation through the AI analytics and data products they own and protect through
patents.