! pip install -q johnsnowlabs
Using my.johnsnowlabs.com SSO
from johnsnowlabs import nlp, finance
# nlp.install(force_browser=True)
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
nlp.install()
spark = nlp.start()
Data Augmentation is the process of increase an extracted datapoint with external sources.
For example, let's suppose I work with a document which mentions the company Amazon. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.
In the document, we can extract a company name using NER as an Organization, but that's all the information available about the company in that document.
Well, with Data Augmentation, we can use external sources, as SEC Edgar, Crunchbase, Nasdaq or even Wikipedia, to enrich the company with much more information, allowing us to take better decisions.
Let's see how to do it.
! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt
with open('cdns-20220101.html.txt', 'r') as f:
cadence_sec10k = f.read()
print(cadence_sec10k[:300])
Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 _____________________________________ FORM 10-K _____________________________________ (Mark One) ☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year en
pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 _____________________________________ FORM 10-K _____________________________________ (Mark One) ☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended January 1, 2022 OR ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from _________ to_________. Commission file number 000-15867 _____________________________________ CADENCE DESIGN SYSTEMS, INC. (Exact name of registrant as specified in its charter) ____________________________________ Delaware 00-0000000 (State or Other Jurisdiction ofIncorporation or Organization) (I.R.S. EmployerIdentification No.) 2655 Seely Avenue, Building 5, San Jose, California 95134 (Address of Principal Executive Offices) (Zip Code) (408) -943-1234 (Registrant’s Telephone Number, including Area Code) Securities registered pursuant to Section 12(b) of the Act: Title of Each Class Trading Symbol(s) Names of Each Exchange on which Registered Common Stock, $0.01 par value per share CDNS Nasdaq Global Select Market Securities registered pursuant to Section 12(g) of the Act: None Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. Yes ☒ No ☐ Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. Yes ☐ No ☒ Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes ☒ No ☐ Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes ☒ No ☐ Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act. Large Accelerated Filer ☒ Accelerated Filer ☐ Non-accelerated Filer ☐ Smaller Reporting Company ☐ Emerging Growth Company ☐ If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. ☐ Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. ☒ Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). Yes ☐ No ☒ The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000. On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding. DOCUMENTS INCORPORATED BY REFERENCE Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.
In this case, we know page 0 is always the page with summary information about the company. However, let's suppose we don't know it. We can use Page Classification.
To check the SEC 10K Summary page, we have a specific model called "finclf_form_10k_summary_item"
# Text Classifier
# This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
use_embeddings = nlp.UniversalSentenceEncoder.pretrained()\
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
classifier = finance.ClassifierDLModel.pretrained("finclf_form_10k_summary_item", "en", "finance/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
use_embeddings,
classifier])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
tfhub_use download started this may take some time. Approximate size to download 923.7 MB [OK!] finclf_form_10k_summary_item download started this may take some time. [OK!]
df = spark.createDataFrame([[pages[0]]]).toDF("text")
result = model.transform(df).cache()
result.select('category.result').show()
+------------------+ | result| +------------------+ |[form_10k_summary]| +------------------+
Main component to carry out information extraction and extract entities from texts.
This time we will use a model trained to extract many entities from 10K summaries.
textSplitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
textSplitter,
tokenizer,
embeddings,
ner_model,
ner_converter,
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)
bert_embeddings_sec_bert_base download started this may take some time. Approximate size to download 390.4 MB [OK!] finner_sec_10k_summary download started this may take some time. [OK!]
import pandas as pd
ner_result = light_model.fullAnnotate(pages[0])
chunks = []
entities = []
begin = []
end = []
for n in ner_result[0]['ner_chunk']:
begin.append(n.begin)
end.append(n.end)
chunks.append(n.result)
entities.append(n.metadata['entity'])
df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'entities':entities})
df.head(20)
chunks | begin | end | entities | |
---|---|---|---|---|
0 | January 1, 2022 | 287 | 301 | FISCAL_YEAR |
1 | 000-15867 | 476 | 484 | CFN |
2 | CADENCE DESIGN SYSTEMS, INC | 527 | 553 | ORG |
3 | Delaware | 650 | 657 | STATE |
4 | 00-0000000 | 661 | 670 | IRS |
5 | 2655 Seely Avenue, Building 5,\nSan Jose,\nCal... | 772 | 822 | ADDRESS |
6 | (408)\n-943-1234 | 886 | 900 | PHONE |
7 | Common Stock | 1098 | 1109 | TITLE_CLASS |
8 | $0.01 | 1112 | 1116 | TITLE_CLASS_VALUE |
9 | CDNS | 1138 | 1141 | TICKER |
10 | Nasdaq Global Select Market | 1143 | 1169 | STOCK_EXCHANGE |
11 | Common Stock | 3799 | 3810 | TITLE_CLASS |
12 | $0.01 | 3813 | 3817 | TITLE_CLASS_VALUE |
13 | Cadence Design Systems, Inc | 3931 | 3957 | ORG |
Alright! CADENCE DESIGN SYSTEMS, INC has been detected as an organization.
Now, let's augment CADENCE DESIGN SYSTEMS, INC
with more information about the company, given that there are no more details in the SEC10K form I can use.
But before augmenting, there is a very important step we need to carry out: Company Name Normalization
🚀We will continue this notebook in 10.1.Data_Augmentation_with_ChunkMappers.ipynb