🎬 Installation¶

In [ ]:

! pip install -q johnsnowlabs

🔗 Automatic Installation¶

Using my.johnsnowlabs.com SSO

In [ ]:

from johnsnowlabs import nlp, finance, viz

# nlp.install(force_browser=True)

🔗 Manual downloading¶

If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

Go to my.johnsnowlabs.com
Download your license
Upload it using the following command

In [ ]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Install it

In [ ]:

nlp.install()

📌 Starting¶

In [ ]:

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
👌 Launched cpu optimized session with with: 🚀Spark-NLP==4.2.8, 💊Spark-Healthcare==4.2.8, running on ⚡ PySpark==3.1.2

🔎 Financial Relation Extraction(RE)¶

Financial relation extraction is a process of automatically extracting structured information from unstructured text data related to finance and economics. This can be done using natural language processing (NLP) techniques, such as named entity recognition and relation extraction.

Some examples of financial relation extraction include extracting information about companies and their financial performance, such as revenue, profits, and debt, as well as information about financial markets and economic indicators, such as stock prices and exchange rates.

✔ Pretrained Relation Extraction Models for Finance¶

📚Here are the list of pretrained Relation Extraction models:

Relation Extraction Models

index	model
1	Financial Relation Extraction on Earning Calls (Small)
2	Financial Relation Extraction on 10K filings (Small)
3	Financial Relation Extraction (Tickers)
4	Financial Relation Extraction (Acquisitions / Subsidiaries)
5	Financial Relation Extraction (Work Experience, Medium)
6	Financial Relation Extraction (Work Experience, Small)
7	Financial Relation Extraction (Alias)
8	Financial Zero-shot Relation Extraction

✔ Common Componennts¶

📚This pipeline will:

Split Text into Sentences
Split Sentences into Words
Use Financial Text Embeddings, trained on SEC documents, to obtain numerical semantic representation of words

These components are common for all the pipelines we will use.

In [ ]:

def get_generic_base_pipeline():
  """Common components used in all pipelines"""
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  text_splitter = finance.TextSplitter()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
  
  tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  base_pipeline = nlp.Pipeline(stages=[
      document_assembler,
      text_splitter,
      tokenizer,
      embeddings
  ])

  return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]

In [ ]:

# Text Classifier
def get_text_classification_pipeline(model):
  """This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
  It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document
  the management roles and experiences are mentioned"""
  document_assembler = nlp.DocumentAssembler() \
       .setInputCol("text") \
       .setOutputCol("document")

  embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

  classifier = nlp.ClassifierDLModel.pretrained(model, "en", "finance/models")\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("category")

  nlpPipeline = nlp.Pipeline(stages=[
      document_assembler, 
      embeddings,
      classifier])
  
  return nlpPipeline

In [ ]:

import pandas as pd

def get_relations_df (results, col='relations'):
  """Shows a Dataframe with the relations extracted by Spark NLP"""
  rel_pairs=[]
  for rel in results[0][col]:
      rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

  rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

  return rel_df

✔ NER and Relation Extraction¶

NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.

Let's suppose we want to extract information about Acquisitions and Subsidiaries. If we don't know where that information is in the document, we can use Text Classifiers to find it.

✔ Using Text Classification to find Relevant Parts of the Document: Acquisitions and Subsidiaries¶

To check the SEC 10K Summary page, we have a specific model called "finclf_acquisitions_item"

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

📌 Sample Texts from Cadence Design System¶

Examples taken from publicly available information about Cadence in SEC's Edgar database here and Wikipedia

In [ ]:

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt

In [ ]:

with open('cdns-20220101.html.txt', 'r') as f:
  cadence_sec10k = f.read()
print(cadence_sec10k[:100])

Table of Contents
UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
__________

In [ ]:

pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])

UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K 
_____________________________________  
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended January 1, 2022 
OR
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to_________.

Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:
None
Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.  
 Yes  
☒
    No  
☐
Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act.  
 Yes 
☐    
No  
☒
Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days.  
 Yes  
☒
    No  
☐
Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). 
 Yes  
☒
    No  
☐
Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.
Large Accelerated Filer
☒
Accelerated Filer
☐
Non-accelerated Filer
☐
Smaller Reporting Company
☐
Emerging Growth Company
☐
If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.  
☐
Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. 
☒
Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). 
 Yes 
☐ 
No  
☒
The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000.
On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding.
DOCUMENTS INCORPORATED BY REFERENCE
Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.

In [ ]:

# Some examples
candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[67]]]

In [ ]:

classification_pipeline = get_text_classification_pipeline('finclf_acquisitions_item')

df = spark.createDataFrame(candidates).toDF("text")

model = classification_pipeline.fit(df)

result = model.transform(df)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_acquisitions_item download started this may take some time.
Approximate size to download 21.3 MB
[OK!]

In [ ]:

result.select('category.result').show()

+--------------+
|        result|
+--------------+
|       [other]|
|       [other]|
|       [other]|
|[acquisitions]|
+--------------+

📌 Acquisitions, Subsidiaries and Former Names¶

📚Let's use some NER models to obtain information about Organizations and Dates, and understand if:

An ORG was acquired by another ORG
An ORG is a subsidiary of another ORG
An ORG name is an alias / abbreviation / acronym / etc of another ORG

We will use the deteceted page[67] as input

In [ ]:

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_acq")\
    .setPredictionThreshold(0.1)

annotation_merger = finance.AnnotationMerger()\
    .setInputCols("relations_acq", "relations_alias")\
    .setOutputCol("relations")

nlpPipeline = nlp.Pipeline(stages=[
        generic_base_pipeline,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        annotation_merger])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

finner_sec_dates download started this may take some time.
[OK!]
finner_orgs_prods_alias download started this may take some time.
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
finre_acquisitions_subsidiaries_md download started this may take some time.
[OK!]

In [ ]:

sample_text = pages[67].replace("“", "\"").replace("”", "\"")

In [ ]:

result = light_model.fullAnnotate(sample_text)

rel_df = get_relations_df(result)

rel_df

Out[ ]:

	relation	entity1	entity1_begin	entity1_end	chunk1	entity2	entity2_begin	entity2_end	chunk2	confidence
0	has_acquisition_date	ORG	440	446	Cadence	DATE	427	437	fiscal 2020	0.99945384
1	has_acquisition_date	ORG	490	504	AWR Corporation	DATE	427	437	fiscal 2020	0.99891853
2	was_acquired_by	ORG	490	504	AWR Corporation	ORG	440	446	Cadence	0.99111485
3	was_acquired_by	ORG	518	540	Integrand Software, Inc	ORG	440	446	Cadence	0.99635243
4	was_acquired_by	ORG	518	540	Integrand Software, Inc	ORG	490	504	AWR Corporation	0.94192755
5	other	ORG	1210	1212	AWR	ORG	1218	1226	Integrand	0.9999858
6	other	ORG	1229	1235	Cadence	DATE	1358	1367	nine years	0.996561
7	other	ORG	1905	1907	AWR	ORG	1913	1921	Integrand	0.9999651
8	has_acquisition_date	ORG	1955	1961	Cadence	DATE	2007	2017	fiscal 2020	0.99776745
9	other	DATE	2219	2229	fiscal 2021	ORG	2322	2330	Cadence’s	0.99219704
10	other	DATE	2235	2245	fiscal 2020	ORG	2322	2330	Cadence’s	0.99703074
11	other	DATE	2539	2549	fiscal 2021	ORG	2598	2606	Cadence’s	0.94122887
12	other	DATE	2552	2555	2020	ORG	2598	2606	Cadence’s	0.96238184
13	other	DATE	2560	2563	2019	ORG	2598	2606	Cadence’s	0.9658956
14	other	DATE	3191	3222	the third quarter of fiscal 2021	ORG	3262	3270	Cadence’s	0.5690664

In [ ]:

rel_df = rel_df[(rel_df["relation"] != "other") & (rel_df["relation"] != "no_rel")]

rel_df

Out[ ]:

	relation	entity1	entity1_begin	entity1_end	chunk1	entity2	entity2_begin	entity2_end	chunk2	confidence
0	has_acquisition_date	ORG	440	446	Cadence	DATE	427	437	fiscal 2020	0.99945384
1	has_acquisition_date	ORG	490	504	AWR Corporation	DATE	427	437	fiscal 2020	0.99891853
2	was_acquired_by	ORG	490	504	AWR Corporation	ORG	440	446	Cadence	0.99111485
3	was_acquired_by	ORG	518	540	Integrand Software, Inc	ORG	440	446	Cadence	0.99635243
4	was_acquired_by	ORG	518	540	Integrand Software, Inc	ORG	490	504	AWR Corporation	0.94192755
8	has_acquisition_date	ORG	1955	1961	Cadence	DATE	2007	2017	fiscal 2020	0.99776745

📌 Visualize Results¶

In [ ]:

from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["other", "no_rel"], show_relations=True)