Notebook

🔎 Understanding Financial Entities In Context¶

📜 Assertion Status, or Understanding Financial Entities in Context, is an NLP atsk in carge of analyzing NER entities, extracted with:

NER models;
ContextualParser;

📜 and their surroundings (usually a sentence, but it could take bigger spans too) to assert different conditions / status on the entities, as:

If an entity is negated in the context;
If the context talks about past, present, or future;
If the entity is said to be hypothetical / possible, or certaing;
etc.

📜 The exposed above are just some examples, since the applications of Assertion DL models can be expanded to whatever to many other scenarios where you need to:

Disambiguate entities from the context.
Subclassify or specify an entity depending on context:

🚀Examples:

Is an ORG mentioned to be a COMPETITOR or part of the SUPPLY_CHAIN (or none of them)?
Is an ACQUIRED_COMPANY mentioned to be acquired TOTALLY or PARTIALLY acquired in the context?
etc

Let's see which pretrained models we have and how to train custom ones!

🎬 Installation¶

In [ ]:

! pip install -q johnsnowlabs

🔗 Automatic Installation¶

Using my.johnsnowlabs.com SSO

In [ ]:

from johnsnowlabs import *

# nlp.install(force_browser=True)

🔗 Manual downloading¶

If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

Go to my.johnsnowlabs.com
Download your license
Upload it using the following command

In [ ]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Install it

In [ ]:

nlp.install()

📌 Starting¶

In [ ]:

from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [ ]:

from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

📚 Understanding time from context¶

✔️ Past Experiences of a Role (C-level management)¶

Let's start with a small example: analyzing whether a ROLE of a person in a company is mentioned to be a past or present.

📜For that, we need:

An NER model. We will use finner_bert_roles which uses bert_embeddings_sec_bert_base embeddings to extract ROLE entities;
An Assertion Model which detects time. We will use finassertiondl_past_roles which is a very specific one, to detect time in ROLE entities.

In [ ]:

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

tokenClassifier = finance.BertForTokenClassification.pretrained("finner_bert_roles","en","finance/models")\
    .setInputCols("token", "document")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["ROLE"])

assertion = finance.AssertionDLModel.pretrained("finassertiondl_past_roles", "en", "finance/models")\
    .setInputCols(["document", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    tokenizer,
    embeddings,
    tokenClassifier,
    ner_converter,
    assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

✔️ Example sentences extracted from 10K filings¶

In [ ]:

sample_texts = ["""From January 2009 to November 2017, Mr. Tan worked as the Managing Director of Cadence""",
                """Jane S. Smith works as a Computer Engineer and Product Lead at Globalize Cloud Services""",
                """Mrs. Johansson has been apointed CEO and President of Mileways""",
                """Tom Martin worked as Cadence's CTO until 2010""",
                """Mrs. Charles was before Managing Director at a big consultancy company""",
                """We are happy to announce that Mary Leigh joins Elephant as Web Designer and UX/UI Developer"""]

✔️ We extract with LightPipelines¶

In [ ]:

import pandas as pd

chunks=[]
entities=[]
status=[]

for i in range(len(sample_texts)):
  light_result = light_model.fullAnnotate(sample_texts[i])[0]

  for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

In [ ]:

df

Out[ ]:

	chunks	entities	assertion
0	Director	ROLE	PAST
1	Computer Engineer	ROLE	NO_PAST
2	Product Lead	ROLE	NO_PAST
3	CEO	ROLE	NO_PAST
4	President	ROLE	NO_PAST
5	Cadence's CTO	ROLE	PAST
6	Managing Director	ROLE	PAST
7	Web Designer	ROLE	NO_PAST
8	UX/UI Developer	ROLE	NO_PAST

✔️ Visualization of Assertion Status¶

In [ ]:

for i in range(len(sample_texts)):
    
    light_result = light_model.fullAnnotate(sample_texts[i])[0]
    
    vis = nlp.viz.AssertionVisualizer()

    vis.display(light_result, 'ner_chunk', 'assertion')

From January 2009 to November 2017, Mr. Tan worked as the Managing Director ROLEPAST of Cadence

Jane S. Smith works as a Computer Engineer ROLENO_PAST and Product Lead ROLENO_PAST at Globalize Cloud Services

Mrs. Johansson has been apointed CEO ROLENO_PAST and President ROLENO_PAST of Mileways

Tom Martin worked as Cadence's CTO ROLEPAST until 2010

📚 Bigger example: Asserting time in a 10-K filing¶

📜Now let's go bigger. We will use one 10K filing, extract several pages and apply assertion status to detect time for:

PER (people)
ORG (organizations)
ROLE (roles of those people in that or past organizations)

📜For that, we need:

An NER model. We will use finner_org_per_role_date which uses bert_embeddings_sec_bert_base embeddings to extract PERSON, ORG and ROLE entities;
An Assertion Model which detects time. We will use finassertion_time which is a generic time assertion model, to detect time on the previously mentioned entities.

🚀Please keep in mind that you can use this model also in other entities, but the performance may degrade since it was not trained on other kind of entities.

In [ ]:

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt

In [ ]:

import requests
URL = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt"
response = requests.get(URL)

cadence_sec10k = response.content.decode('utf-8')

In [ ]:

document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("pages")\
    .setCustomBounds(["Table of Contents"])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

In [ ]:

#fit: trains, configures and prepares the pipeline for inference. 

sdf = spark.createDataFrame([[ cadence_sec10k ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)

In [ ]:

%%time

#transforms: executes inference on a fit pipeline
res = fit.transform(sdf)

res.show()

+--------------------+--------------------+--------------------+
|                text|            document|               pages|
+--------------------+--------------------+--------------------+
|Table of Contents...|[{document, 0, 34...|[{document, 18, 4...|
|Table of Contents...|[{document, 0, 34...|[{document, 4087,...|
|Table of Contents...|[{document, 0, 34...|[{document, 4215,...|
|Table of Contents...|[{document, 0, 34...|[{document, 5504,...|
|Table of Contents...|[{document, 0, 34...|[{document, 11617...|
|Table of Contents...|[{document, 0, 34...|[{document, 13985...|
|Table of Contents...|[{document, 0, 34...|[{document, 20001...|
|Table of Contents...|[{document, 0, 34...|[{document, 26059...|
|Table of Contents...|[{document, 0, 34...|[{document, 31638...|
|Table of Contents...|[{document, 0, 34...|[{document, 36733...|
|Table of Contents...|[{document, 0, 34...|[{document, 42440...|
|Table of Contents...|[{document, 0, 34...|[{document, 47053...|
|Table of Contents...|[{document, 0, 34...|[{document, 48328...|
|Table of Contents...|[{document, 0, 34...|[{document, 53745...|
|Table of Contents...|[{document, 0, 34...|[{document, 59341...|
|Table of Contents...|[{document, 0, 34...|[{document, 65403...|
|Table of Contents...|[{document, 0, 34...|[{document, 72330...|
|Table of Contents...|[{document, 0, 34...|[{document, 77951...|
|Table of Contents...|[{document, 0, 34...|[{document, 84131...|
|Table of Contents...|[{document, 0, 34...|[{document, 89718...|
+--------------------+--------------------+--------------------+
only showing top 20 rows

CPU times: user 48 ms, sys: 13.8 ms, total: 61.7 ms
Wall time: 6.41 s

In [ ]:

%%time

import json

lp = nlp.LightPipeline(fit)

json_res = lp.annotate(cadence_sec10k)

print(json.dumps(json_res, indent=4))

In [ ]:

pages = [json_res['pages'][i] for i in range(13)]

In [ ]:

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols("sentence", "token", "embeddings")\
    .setOutputCol("ner")

chunk_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

assertion = finance.AssertionDLModel.pretrained("finassertion_time", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
nlpPipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner,
    chunk_converter,
    assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

lp = nlp.LightPipeline(model)

🚀Let's start identifying time using finassertion_time. As in previous notebooks, we will be using a SEC 10K filing.

In [ ]:

from johnsnowlabs import viz

texts = [pages[12]]

res = lp.fullAnnotate(texts)

vis = viz.AssertionVisualizer()

for r in res:
  vis.display(r, 'ner_chunk', 'assertion')

INFORMATION ABOUT OUR EXECUTIVE OFFICERS
The following table provides information regarding our executive officers as of February 22, 2022:
Name
Age
Positions and Offices
Anirudh Devgan PERSONPAST
52
President ROLEPAST and Chief Executive Officer ROLEPAST
John M. Wall PERSONPAST
51
Senior Vice President ROLEPAST and Chief Financial Officer ROLEPAST
Thomas P. Beckley PERSONPAST
64
Senior Vice President ROLEPAST and General Manager ROLEPAST of the Custom IC and PCB Group
Paul Cunningham PERSONPAST
44
Senior Vice President ROLEPAST and General Manager ROLEPAST of the System and Verification Group
Alinka Flaminia PERSONPAST
60
Senior Vice President ROLEPAST , Chief Legal Officer ROLEPAST and Corporate Secretary ROLEPAST
Chin-Chi Teng PERSONPAST
56
Senior Vice President ROLEPAST and General Manager ROLEPAST of the Digital and Signoff Group
Neil Zaman PERSONPAST
53
Senior Vice President ROLEPAST and Chief Revenue Officer ROLEPAST
Our executive officers are appointed by the Board of Directors and serve at the discretion of the Board of Directors.
ANIRUDH DEVGAN PERSONPAST has served as Chief Executive Officer ROLEPAST of Cadence ORGPAST since December 2021 DATEPAST and President ROLEPAST of Cadence ORGPAST since November 2017 DATEPAST . From May 2012 DATEPAST to November 2017 DATEPAST , Dr. Devgan PERSONPAST held several positions at Cadence ORGPAST , most recently as Executive Vice President ROLEPAST , Research and Development from March 2017 DATEPAST to November 2017 DATEPAST and Senior Vice President ROLEPAST , Research and Development from November 2013 DATEPAST to March 2017 DATEPAST . Prior to joining Cadence ORGPAST , from May 2005 DATEPAST to March 2012 DATEPAST , Dr. Devgan PERSONPAST served as Corporate Vice President ROLEPAST and General Manager ROLEPAST of the Custom Design Business Unit at Magma Design Automation, Inc., an EDA company. Dr. Devgan PERSONPRESENT has a B.Tech. in electrical engineering from the Indian Institute of Technology ORGPAST , Delhi, and an M.S. and Ph.D. in electrical and computer engineering from Carnegie Mellon University ORGPRESENT .
JOHN M. WALL PERSONPRESENT has served as Senior Vice President ROLEPAST and Chief Financial Officer ROLEPAST of Cadence ORGPAST since October 2017 DATEPAST . From October 2000 DATEPAST to September 2017 DATEPAST , Mr. Wall PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST and Corporate Controller ROLEPAST from April 2016 DATEPAST to October 2017 DATEPAST , Vice President ROLEPAST , Finance and Operations, Worldwide Revenue Accounting and Sales Finance from 2015 DATEPAST to 2016 DATEPAST and Vice President ROLEPAST , Finance and Operations, EMEA and Worldwide Revenue Accounting from 2005 DATEPAST to 2015 DATEPAST . Mr. Wall PERSONPAST has an NCBS from the Institute of Technology, Tralee and is a Fellow ROLEPAST of the Association of Chartered Certified Accountants ORGPAST .
THOMAS P. BECKLEY PERSONPRESENT has served as Senior Vice President ROLEPRESENT and General Manager ROLEPRESENT of the Custom IC and PCB Group of Cadence ORGPRESENT since 2018 DATEPRESENT . From September 2012 DATEPAST to September 2018 DATEPAST , Mr. Beckley PERSONPAST served as Senior Vice President ROLEPAST , Research and Development of Cadence ORGPAST . From April 2004 DATEPAST to September 2012 DATEPAST , Mr. Beckley PERSONPAST served as Corporate Vice President ROLEPAST , Research and Development of Cadence ORGPAST . Prior to joining Cadence ORGPAST , Mr. Beckley PERSONPAST served as President ROLEPAST and Chief Executive Officer ROLEPAST of Neolinear, Inc ORGPAST ., a developer of auto-interactive and automated analog/RF tools and solutions for mixed-signal design that was acquired by Cadence ORGPAST in April 2004 DATEPAST . Mr. Beckley PERSONPRESENT has a B.S. in mathematics and physics from Kalamazoo College and an M.B.A. from Vanderbilt University.
PAUL CUNNINGHAM PERSONPAST has served as Senior Vice President ROLEPAST and General Manager ROLEPAST of the System and Verification Group since March 2021 DATEPAST . From August 2011 DATEPAST to March 2021 DATEPAST , Mr. Cunningham PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST of the System Verification Group beginning January 2018 DATEPAST . Prior to joining Cadence ORGPAST , Mr. Cunningham PERSONPAST was co-founder ROLEPAST and Chief Executive Officer ROLEPAST of Azuro, Inc ORGPAST ., a clock concurrent optimization company, that Cadence ORGPAST acquired in July 2011 DATEPAST . Mr. Cunningham PERSONPRESENT has an M.A. and Ph.D. in computer science from the University of Cambridge ORGPRESENT in the United Kingdom.
ALINKA FLAMINIA PERSONPAST has served as Senior Vice President ROLEPAST , Chief Legal Officer ROLEPAST and Corporate Secretary ROLEPAST of Cadence ORGPAST since June 2020 DATEPAST . Prior to joining Cadence ORGPAST , Ms. Flaminia PERSONPAST served as Senior Vice President ROLEPAST , General Counsel and Corporate Secretary ROLEPAST of Mellanox Technologies Ltd ORGPAST ., a supplier of intelligent interconnect solutions, from September 2016 DATEPAST until its acquisition by NVIDIA Corporation ORGPAST in April 2020 DATEPAST . She also served as General Counsel ROLEPAST and Corporate Secretary ROLEPAST of PMC-Sierra, Inc ORGPAST ., a semiconductor company, from 2007 DATEPAST until its acquisition by Microsemi Corporation ORGPAST in 2016 DATEPAST . Ms. Flaminia PERSONPAST has a B.A. from Yale University, and a J.D. from Colorado University, School of Law.
CHIN-CHI TENG PERSONPAST has served as Senior Vice President ROLEPAST and General Manager ROLEPAST of the Digital and Signoff Group of Cadence ORGPAST since September 2018 DATEPAST . From January 2002 DATEPAST to September 2018 DATEPAST , Dr. Teng PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST , Research and Development from June 2015 DATEPAST to September 2018 DATEPAST , and Vice President ROLEPAST , Research and Development from March 2009 DATEPAST to June 2015 DATEPAST . Dr. Teng PERSONPRESENT has a B.S. in electrical engineering from the National Taiwan University ORGPRESENT and a Ph.D. in electrical and computer engineering from the University of Illinois, Urbana-Champaign.
NEIL ZAMAN PERSONPAST has served as Chief Revenue Officer ROLEPAST since October 2020 DATEPAST and as Senior Vice President ROLEPAST , Worldwide Field Operations since September 2015 DATEPAST . From October 1999 DATEPAST to September 2015 DATEPAST , Mr. Zaman PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST , North America Field Operations. Prior to joining Cadence ORGPAST , Mr. Zaman PERSONPAST held positions at Phoenix Technologies Ltd ORGPAST ., a developer of core system software, and IBM Corporation ORGPAST , a technology and consulting company. Mr. Zaman PERSONPRESENT has a B.S. in finance from California State University, Hayward.
10

📚 Identify COMPETITORS in a Text with Assertion Status¶

This model uses Assertion Status to identify if a PRODUCT or an ORG is mentioned to be a COMPETITOR. By default, if nothing is mentioned, it returns NO_COMPETITOR.

Again, this is a model uses the context around PRODUCT or ORGANIZATION to further subclassify them.

For that, we need:

An NER model. We will use finner_org_prod_alias which uses bert_embeddings_sec_bert_base embeddings to extract ORPG, PRODUCT and ALIAS entities;
An Assertion Model which detects time. We will use finassertion_competitors which retrieves if a company or product is a COMPETITOR or NO_COMPETITOR

🚀Please keep in mind that you can use this model also in other entities, but the performance may be affected

In [ ]:

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Text Splitter annotator, processes various sentences per line
text_splitter =  finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    text_splitter,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ])

empty_df = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_df)

light_model = nlp.LightPipeline(model)

✔️ Some examples¶

In [ ]:

sample_text = """Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."""

In [ ]:

data = spark.createDataFrame([[sample_text]]).toDF("text")

In [ ]:

result = model.transform(data)

In [ ]:

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+

✔️ Quick inference with LightPipeline¶

In [ ]:

import pandas as pd

light_result = light_model.fullAnnotate(sample_text)[0]

chunks=[]
entities=[]
status=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

df

Out[ ]:

	chunks	entities	assertion
0	McAfee LLC	ORG	COMPETITOR
1	Broadcom Inc	ORG	COMPETITOR

✔️ Visualization of Assertion Status (`COMPETITOR` example)¶

In [ ]:

# from sparknlp_display import AssertionVisualizer

vis = nlp.viz.AssertionVisualizer()

vis.display(light_result, 'ner_chunk', 'assertion')

Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC ORGCOMPETITOR and Broadcom Inc ORGCOMPETITOR .

📚 Writing a Generic Assertion + NER Function¶

You can generalize and retrieve components or full pipelines using functions.

This is an example of how you can achieve that.

In [ ]:

def get_base_pipeline(embeddings):

    documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    textSplitter = finance.TextSplitter()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

    tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

    embeddings = nlp.BertEmbeddings.pretrained(embeddings, "en") \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("embeddings")

    base_pipeline = nlp.Pipeline(stages=[
                        document_assembler,
                        textSplitter,
                        tokenizer,
                        embeddings])

    return base_pipeline


def get_assertion (embeddings, ner_model, assertion_model):

    ner = finance.NerModel.pretrained(ner_model, "en", "finance/models")\
        .setInputCols(["sentence", "token", "embeddings"]) \
        .setOutputCol("ner")

    ner_converter = nlp.NerConverter() \
        .setInputCols(["sentence", "token", "ner"]) \
        .setOutputCol("ner_chunk")
    
    assertion = finance.AssertionDLModel.pretrained(assertion_model, "en", "finance/models")\
        .setInputCols(["sentence", "ner_chunk", "embeddings"])\
        .setOutputCol("assertion")
      
    base_model = get_base_pipeline(embeddings)

    nlpPipeline = nlp.Pipeline(stages=[
        base_model,
        ner,
        ner_converter,
        assertion])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)
    
    light_model = nlp.LightPipeline(model)
    
    return light_model

✔️ Quick inference with LightPipeline¶

In [ ]:

sample_text = """EDH combines our Cloudera Data Warehouse, Cloudera Operational DB, and Cloudera Data Science with our SDX technology."""

embeddings = "bert_embeddings_sec_bert_base"

ner_model = "finner_orgs_prods_alias"

assertion_model = "finassertion_competitors"

light_result = get_assertion(embeddings, ner_model, assertion_model).fullAnnotate(sample_text)[0]

chunks=[]
entities=[]
status=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):

    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

finner_orgs_prods_alias download started this may take some time.
[OK!]
finassertion_competitors download started this may take some time.
[OK!]
bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]

In [ ]:

df

Out[ ]:

	chunks	entities	assertion
0	EDH	ORG	NO_COMPETITOR
1	Cloudera Data Warehouse	PRODUCT	NO_COMPETITOR
2	Cloudera Operational DB	PRODUCT	NO_COMPETITOR
3	Cloudera Data Science	PRODUCT	NO_COMPETITOR
4	SDX	PRODUCT	NO_COMPETITOR

✔️ Visualization of Assertion Status (`NO_COMPETITOR` example)¶

In [ ]:

vis = nlp.viz.AssertionVisualizer()

vis.display(light_result, 'ner_chunk', 'assertion')

EDH ORGNO_COMPETITOR combines our Cloudera Data Warehouse PRODUCTNO_COMPETITOR , Cloudera Operational DB PRODUCTNO_COMPETITOR , and Cloudera Data Science PRODUCTNO_COMPETITOR with our SDX PRODUCTNO_COMPETITOR technology.

🔎 Identify `Negation` in context¶

This model uses Assertion Status to identify if an ORG or PRODUCT is followed by a negation particle in the context.

Again, this is a model uses the context around PRODUCT or ORGANIZATION to further subclassify them.

For that, we need:

An NER model. We will use finner_orgs_prods_alias which uses bert_embeddings_sec_bert_base embeddings to extract PERSON, PRODUCT and ALIAS entities;
An Assertion Model which detects negation. We will use finassertion_negation which retrieves if an entity is present in a positive or negative context.

🚀Please keep in mind that you can use this model also in other entities, but the performance may be affected

🔎 Understanding Financial Entities In Context¶

🎬 Installation¶

🔗 Automatic Installation¶

🔗 Manual downloading¶

📌 Starting¶

📚 Understanding time from context¶

✔️ Past Experiences of a Role (C-level management)¶

✔️ Example sentences extracted from 10K filings¶

✔️ We extract with LightPipelines¶

✔️ Visualization of Assertion Status¶

📚 Bigger example: Asserting time in a 10-K filing¶

📚 Identify COMPETITORS in a Text with Assertion Status¶

✔️ Some examples¶

✔️ Quick inference with LightPipeline¶

✔️ Visualization of Assertion Status (`COMPETITOR` example)¶

📚 Writing a Generic Assertion + NER Function¶

✔️ Quick inference with LightPipeline¶

✔️ Visualization of Assertion Status (`NO_COMPETITOR` example)¶

🔎 Identify `Negation` in context¶

📌 Quick inference with LightPipeline¶

📌 Visualization of Assertion Status¶

	chunks	entities	assertion
0	EDH	ORG	positive
1	Cloudera Data Warehouse	PRODUCT	positive
2	Cloudera Operational DB	PRODUCT	positive
3	Cloudera Data Science	PRODUCT	positive
4	SDX	PRODUCT	positive

	chunks	entities	assertion
0	Whatsapp	ORG	negative
1	Meta	ORG	positive
2	Synapsis INC	ORG	negative
3	X Engineering, Inc	ORG	positive

🔎 Understanding Financial Entities In Context¶

🎬 Installation¶

🔗 Automatic Installation¶

🔗 Manual downloading¶

📌 Starting¶

📚 Understanding time from context¶

✔️ Past Experiences of a Role (C-level management)¶

✔️ Example sentences extracted from 10K filings¶

✔️ We extract with LightPipelines¶

✔️ Visualization of Assertion Status¶

📚 Bigger example: Asserting time in a 10-K filing¶

📚 Identify COMPETITORS in a Text with Assertion Status¶

✔️ Some examples¶

✔️ Quick inference with LightPipeline¶

✔️ Visualization of Assertion Status (COMPETITOR example)¶

📚 Writing a Generic Assertion + NER Function¶

✔️ Quick inference with LightPipeline¶

✔️ Visualization of Assertion Status (NO_COMPETITOR example)¶

🔎 Identify Negation in context¶

📌 Quick inference with LightPipeline¶

📌 Visualization of Assertion Status¶

✔️ Visualization of Assertion Status (`COMPETITOR` example)¶

✔️ Visualization of Assertion Status (`NO_COMPETITOR` example)¶

🔎 Identify `Negation` in context¶