🎬 Installation¶

In [ ]:

! pip install -q johnsnowlabs

🔗 Automatic Installation¶

Using my.johnsnowlabs.com SSO

In [ ]:

from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

🔗 Manual downloading¶

If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

Go to my.johnsnowlabs.com
Download your license
Upload it using the following command

In [ ]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Install it

In [ ]:

nlp.install()

📌 Starting¶

In [ ]:

spark = nlp.start()

🔎 Financial Data Augmentation with Chunk Mappers¶

🚀 About Data Augmentation¶

Data Augmentation is the process of increase an extracted datapoint with external sources.

For example, let's suppose I work with a document which mentions the company Amazon. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.

In the document, we can extract a company name using NER as an Organization, but that's all the information available about the company in that document.

Well, with Data Augmentation, we can use external sources, as SEC Edgar, Crunchbase, Nasdaq or even Wikipedia, to enrich the company with much more information, allowing us to take better decisions.

Let's see how to do it.

📌 Sample Texts from Cadence Design System¶

Examples taken from publicly available information about Cadence in SEC's Edgar database here and Wikipedia

In [ ]:

! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt

In [ ]:

with open('cdns-20220101.html.txt', 'r') as f:
  cadence_sec10k = f.read()
print(cadence_sec10k[:300])

Table of Contents
UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K 
_____________________________________  
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year en

In [ ]:

pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])

UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K 
_____________________________________  
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended January 1, 2022 
OR
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to_________.

Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:
None
Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.  
 Yes  
☒
    No  
☐
Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act.  
 Yes 
☐    
No  
☒
Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days.  
 Yes  
☒
    No  
☐
Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). 
 Yes  
☒
    No  
☐
Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.
Large Accelerated Filer
☒
Accelerated Filer
☐
Non-accelerated Filer
☐
Smaller Reporting Company
☐
Emerging Growth Company
☐
If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.  
☐
Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. 
☒
Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). 
 Yes 
☐ 
No  
☒
The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000.
On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding.
DOCUMENTS INCORPORATED BY REFERENCE
Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.

📌 Step 1: Using Text Classification to find Relevant Parts of the Document: 10K Summary¶

In this case, we know page 0 is always the page with summary information about the company. However, let's suppose we don't know it. We can use Page Classification.

To check the SEC 10K Summary page, we have a specific model called "finclf_form_10k_summary_item"

In [ ]:

# Text Classifier
# This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
  
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

use_embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

classifier = finance.ClassifierDLModel.pretrained("finclf_form_10k_summary_item", "en", "finance/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")

nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    use_embeddings,
    classifier])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_form_10k_summary_item download started this may take some time.
[OK!]

In [ ]:

df = spark.createDataFrame([[pages[0]]]).toDF("text")

result = model.transform(df).cache()

In [ ]:

result.select('category.result').show()

+------------------+
|            result|
+------------------+
|[form_10k_summary]|
+------------------+

📌 Step 2: Named Entity Recognition on 10K Summary¶

Main component to carry out information extraction and extract entities from texts.

This time we will use a model trained to extract many entities from 10K summaries.

In [ ]:

textSplitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
    document_assembler,
    textSplitter,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_sec_10k_summary download started this may take some time.
[OK!]

✅ We use LightPipeline to get the result¶

In [ ]:

import pandas as pd

ner_result = light_model.fullAnnotate(pages[0])

chunks = []
entities = []
begin = []
end = []

for n in ner_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'entities':entities})

df.head(20)

Out[ ]:

	chunks	begin	end	entities
0	January 1, 2022	287	301	FISCAL_YEAR
1	000-15867	476	484	CFN
2	CADENCE DESIGN SYSTEMS, INC	527	553	ORG
3	Delaware	650	657	STATE
4	00-0000000	661	670	IRS
5	2655 Seely Avenue, Building 5,\nSan Jose,\nCal...	772	822	ADDRESS
6	(408)\n-943-1234	886	900	PHONE
7	Common Stock	1098	1109	TITLE_CLASS
8	$0.01	1112	1116	TITLE_CLASS_VALUE
9	CDNS	1138	1141	TICKER
10	Nasdaq Global Select Market	1143	1169	STOCK_EXCHANGE
11	Common Stock	3799	3810	TITLE_CLASS
12	$0.01	3813	3817	TITLE_CLASS_VALUE
13	Cadence Design Systems, Inc	3931	3957	ORG

Alright! CADENCE DESIGN SYSTEMS, INC has been detected as an organization.

Now, let's augment CADENCE DESIGN SYSTEMS, INC with more information about the company, given that there are no more details in the SEC10K form I can use.

But before augmenting, there is a very important step we need to carry out: Company Name Normalization

🚀We will continue this notebook in 10.1.Data_Augmentation_with_ChunkMappers.ipynb

In [ ]: