📜 Assertion Status, or Understanding Financial Entities in Context, is an NLP atsk in carge of analyzing NER entities, extracted with:
📜 and their surroundings (usually a sentence, but it could take bigger spans too) to assert different conditions / status on the entities, as:
📜 The exposed above are just some examples, since the applications of Assertion DL models can be expanded to whatever to many other scenarios where you need to:
🚀Examples:
Let's see which pretrained models we have and how to train custom ones!
! pip install -q johnsnowlabs
Using my.johnsnowlabs.com SSO
from johnsnowlabs import *
# nlp.install(force_browser=True)
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
nlp.install()
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
Let's start with a small example: analyzing whether a ROLE of a person in a company is mentioned to be a past
or present
.
📜For that, we need:
finner_bert_roles
which uses bert_embeddings_sec_bert_base
embeddings to extract ROLE
entities;finassertiondl_past_roles
which is a very specific one, to detect time in ROLE
entities.document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
tokenClassifier = finance.BertForTokenClassification.pretrained("finner_bert_roles","en","finance/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = finance.NerConverterInternal() \
.setInputCols(["document", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["ROLE"])
assertion = finance.AssertionDLModel.pretrained("finassertiondl_past_roles", "en", "finance/models")\
.setInputCols(["document", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
tokenClassifier,
ner_converter,
assertion
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)
sample_texts = ["""From January 2009 to November 2017, Mr. Tan worked as the Managing Director of Cadence""",
"""Jane S. Smith works as a Computer Engineer and Product Lead at Globalize Cloud Services""",
"""Mrs. Johansson has been apointed CEO and President of Mileways""",
"""Tom Martin worked as Cadence's CTO until 2010""",
"""Mrs. Charles was before Managing Director at a big consultancy company""",
"""We are happy to announce that Mary Leigh joins Elephant as Web Designer and UX/UI Developer"""]
import pandas as pd
chunks=[]
entities=[]
status=[]
for i in range(len(sample_texts)):
light_result = light_model.fullAnnotate(sample_texts[i])[0]
for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
chunks.append(n.result)
entities.append(n.metadata['entity'])
status.append(m.result)
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})
df
chunks | entities | assertion | |
---|---|---|---|
0 | Director | ROLE | PAST |
1 | Computer Engineer | ROLE | NO_PAST |
2 | Product Lead | ROLE | NO_PAST |
3 | CEO | ROLE | NO_PAST |
4 | President | ROLE | NO_PAST |
5 | Cadence's CTO | ROLE | PAST |
6 | Managing Director | ROLE | PAST |
7 | Web Designer | ROLE | NO_PAST |
8 | UX/UI Developer | ROLE | NO_PAST |
for i in range(len(sample_texts)):
light_result = light_model.fullAnnotate(sample_texts[i])[0]
vis = nlp.viz.AssertionVisualizer()
vis.display(light_result, 'ner_chunk', 'assertion')
📜Now let's go bigger. We will use one 10K filing, extract several pages and apply assertion status to detect time for:
PER
(people)ORG
(organizations)ROLE
(roles of those people in that or past organizations)📜For that, we need:
finner_org_per_role_date
which uses bert_embeddings_sec_bert_base
embeddings to extract PERSON
, ORG
and ROLE
entities;finassertion_time
which is a generic time assertion model, to detect time on the previously mentioned entities.🚀Please keep in mind that you can use this model also in other entities, but the performance may degrade since it was not trained on other kind of entities.
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt
import requests
URL = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt"
response = requests.get(URL)
cadence_sec10k = response.content.decode('utf-8')
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
text_splitter = finance.TextSplitter() \
.setInputCols(["document"]) \
.setOutputCol("pages")\
.setCustomBounds(["Table of Contents"])\
.setUseCustomBoundsOnly(True)\
.setExplodeSentences(True)
nlp_pipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter])
#fit: trains, configures and prepares the pipeline for inference.
sdf = spark.createDataFrame([[ cadence_sec10k ]]).toDF("text")
fit = nlp_pipeline.fit(sdf)
%%time
#transforms: executes inference on a fit pipeline
res = fit.transform(sdf)
res.show()
+--------------------+--------------------+--------------------+ | text| document| pages| +--------------------+--------------------+--------------------+ |Table of Contents...|[{document, 0, 34...|[{document, 18, 4...| |Table of Contents...|[{document, 0, 34...|[{document, 4087,...| |Table of Contents...|[{document, 0, 34...|[{document, 4215,...| |Table of Contents...|[{document, 0, 34...|[{document, 5504,...| |Table of Contents...|[{document, 0, 34...|[{document, 11617...| |Table of Contents...|[{document, 0, 34...|[{document, 13985...| |Table of Contents...|[{document, 0, 34...|[{document, 20001...| |Table of Contents...|[{document, 0, 34...|[{document, 26059...| |Table of Contents...|[{document, 0, 34...|[{document, 31638...| |Table of Contents...|[{document, 0, 34...|[{document, 36733...| |Table of Contents...|[{document, 0, 34...|[{document, 42440...| |Table of Contents...|[{document, 0, 34...|[{document, 47053...| |Table of Contents...|[{document, 0, 34...|[{document, 48328...| |Table of Contents...|[{document, 0, 34...|[{document, 53745...| |Table of Contents...|[{document, 0, 34...|[{document, 59341...| |Table of Contents...|[{document, 0, 34...|[{document, 65403...| |Table of Contents...|[{document, 0, 34...|[{document, 72330...| |Table of Contents...|[{document, 0, 34...|[{document, 77951...| |Table of Contents...|[{document, 0, 34...|[{document, 84131...| |Table of Contents...|[{document, 0, 34...|[{document, 89718...| +--------------------+--------------------+--------------------+ only showing top 20 rows CPU times: user 48 ms, sys: 13.8 ms, total: 61.7 ms Wall time: 6.41 s
%%time
import json
lp = nlp.LightPipeline(fit)
json_res = lp.annotate(cadence_sec10k)
print(json.dumps(json_res, indent=4))
pages = [json_res['pages'][i] for i in range(13)]
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols("sentence", "token", "embeddings")\
.setOutputCol("ner")
chunk_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
assertion = finance.AssertionDLModel.pretrained("finassertion_time", "en", "finance/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner,
chunk_converter,
assertion
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
lp = nlp.LightPipeline(model)
🚀Let's start identifying time using finassertion_time
. As in previous notebooks, we will be using a SEC 10K filing.
from johnsnowlabs import viz
texts = [pages[12]]
res = lp.fullAnnotate(texts)
vis = viz.AssertionVisualizer()
for r in res:
vis.display(r, 'ner_chunk', 'assertion')
This model uses Assertion Status to identify if a PRODUCT or an ORG is mentioned to be a COMPETITOR
. By default, if nothing is mentioned, it returns NO_COMPETITOR
.
Again, this is a model uses the context around PRODUCT
or ORGANIZATION
to further subclassify them.
For that, we need:
finner_org_prod_alias
which uses bert_embeddings_sec_bert_base
embeddings to extract ORPG
, PRODUCT
and ALIAS
entities;finassertion_competitors
which retrieves if a company or product is a COMPETITOR
or NO_COMPETITOR
🚀Please keep in mind that you can use this model also in other entities, but the performance may be affected
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
# Text Splitter annotator, processes various sentences per line
text_splitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")\
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model,
ner_converter,
assertion
])
empty_df = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_df)
light_model = nlp.LightPipeline(model)
sample_text = """Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = model.transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
.select(F.expr("cols['1']['sentence']").alias("sent_id"),
F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['2']").alias("assertion")).show(truncate=False)
+-------+------------+---------+----------+ |sent_id|chunk |ner_label|assertion | +-------+------------+---------+----------+ |0 |McAfee LLC |ORG |COMPETITOR| |0 |Broadcom Inc|ORG |COMPETITOR| +-------+------------+---------+----------+
import pandas as pd
light_result = light_model.fullAnnotate(sample_text)[0]
chunks=[]
entities=[]
status=[]
for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
chunks.append(n.result)
entities.append(n.metadata['entity'])
status.append(m.result)
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})
df
chunks | entities | assertion | |
---|---|---|---|
0 | McAfee LLC | ORG | COMPETITOR |
1 | Broadcom Inc | ORG | COMPETITOR |
COMPETITOR
example)¶# from sparknlp_display import AssertionVisualizer
vis = nlp.viz.AssertionVisualizer()
vis.display(light_result, 'ner_chunk', 'assertion')
You can generalize and retrieve components or full pipelines using functions.
This is an example of how you can achieve that.
def get_base_pipeline(embeddings):
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
textSplitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained(embeddings, "en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
base_pipeline = nlp.Pipeline(stages=[
document_assembler,
textSplitter,
tokenizer,
embeddings])
return base_pipeline
def get_assertion (embeddings, ner_model, assertion_model):
ner = finance.NerModel.pretrained(ner_model, "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
assertion = finance.AssertionDLModel.pretrained(assertion_model, "en", "finance/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings"])\
.setOutputCol("assertion")
base_model = get_base_pipeline(embeddings)
nlpPipeline = nlp.Pipeline(stages=[
base_model,
ner,
ner_converter,
assertion])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)
return light_model
sample_text = """EDH combines our Cloudera Data Warehouse, Cloudera Operational DB, and Cloudera Data Science with our SDX technology."""
embeddings = "bert_embeddings_sec_bert_base"
ner_model = "finner_orgs_prods_alias"
assertion_model = "finassertion_competitors"
light_result = get_assertion(embeddings, ner_model, assertion_model).fullAnnotate(sample_text)[0]
chunks=[]
entities=[]
status=[]
for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
chunks.append(n.result)
entities.append(n.metadata['entity'])
status.append(m.result)
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})
finner_orgs_prods_alias download started this may take some time. [OK!] finassertion_competitors download started this may take some time. [OK!] bert_embeddings_sec_bert_base download started this may take some time. Approximate size to download 390.4 MB [OK!]
df
chunks | entities | assertion | |
---|---|---|---|
0 | EDH | ORG | NO_COMPETITOR |
1 | Cloudera Data Warehouse | PRODUCT | NO_COMPETITOR |
2 | Cloudera Operational DB | PRODUCT | NO_COMPETITOR |
3 | Cloudera Data Science | PRODUCT | NO_COMPETITOR |
4 | SDX | PRODUCT | NO_COMPETITOR |
NO_COMPETITOR
example)¶vis = nlp.viz.AssertionVisualizer()
vis.display(light_result, 'ner_chunk', 'assertion')
Negation
in context¶This model uses Assertion Status to identify if an ORG or PRODUCT is followed by a negation particle
in the context.
Again, this is a model uses the context around PRODUCT
or ORGANIZATION
to further subclassify them.
For that, we need:
finner_orgs_prods_alias
which uses bert_embeddings_sec_bert_base
embeddings to extract PERSON
, PRODUCT
and ALIAS
entities;finassertion_negation
which retrieves if an entity is present in a positive
or negative
context.🚀Please keep in mind that you can use this model also in other entities, but the performance may be affected
sample_text = """EDH combines our Cloudera Data Warehouse, Cloudera Operational DB, and Cloudera Data Science with our SDX technology."""
embeddings = "bert_embeddings_sec_bert_base"
ner_model = "finner_orgs_prods_alias"
assertion_model = "finassertion_negation"
light_result = get_assertion(embeddings, ner_model, assertion_model).fullAnnotate(sample_text)[0]
finner_orgs_prods_alias download started this may take some time. [OK!] finassertion_negation download started this may take some time. [OK!] bert_embeddings_sec_bert_base download started this may take some time. Approximate size to download 390.4 MB [OK!]
chunks=[]
entities=[]
status=[]
for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
chunks.append(n.result)
entities.append(n.metadata['entity'])
status.append(m.result)
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})
df
chunks | entities | assertion | |
---|---|---|---|
0 | EDH | ORG | positive |
1 | Cloudera Data Warehouse | PRODUCT | positive |
2 | Cloudera Operational DB | PRODUCT | positive |
3 | Cloudera Data Science | PRODUCT | positive |
4 | SDX | PRODUCT | positive |
sample_text = """Whatsapp did not borrow funds from Meta for its capital needs. Synapsis INC will not be considered as eligible for X Engineering, Inc. supplier financing program."""
light_result = get_assertion(embeddings, ner_model, assertion_model).fullAnnotate(sample_text)[0]
finner_orgs_prods_alias download started this may take some time. [OK!] finassertion_negation download started this may take some time. [OK!] bert_embeddings_sec_bert_base download started this may take some time. Approximate size to download 390.4 MB [OK!]
chunks=[]
entities=[]
status=[]
for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
chunks.append(n.result)
entities.append(n.metadata['entity'])
status.append(m.result)
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})
df
chunks | entities | assertion | |
---|---|---|---|
0 | ORG | negative | |
1 | Meta | ORG | positive |
2 | Synapsis INC | ORG | negative |
3 | X Engineering, Inc | ORG | positive |
vis = nlp.viz.AssertionVisualizer()
vis.display(light_result, 'ner_chunk', 'assertion')