! pip install -q johnsnowlabs
Using my.johnsnowlabs.com SSO
from johnsnowlabs import nlp, finance, viz
# nlp.install(force_browser=True)
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
nlp.install()
spark = nlp.start()
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
👌 Launched cpu optimized session with with: 🚀Spark-NLP==4.2.8, 💊Spark-Healthcare==4.2.8, running on ⚡ PySpark==3.1.2
Financial relation extraction is a process of automatically extracting structured information from unstructured text data related to finance and economics. This can be done using natural language processing (NLP) techniques, such as named entity recognition and relation extraction.
Some examples of financial relation extraction include extracting information about companies and their financial performance, such as revenue, profits, and debt, as well as information about financial markets and economic indicators, such as stock prices and exchange rates.
📚Here are the list of pretrained Relation Extraction models:
Relation Extraction Models
📚This pipeline will:
These components are common for all the pipelines we will use.
def get_generic_base_pipeline():
"""Common components used in all pipelines"""
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
base_pipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings
])
return base_pipeline
generic_base_pipeline = get_generic_base_pipeline()
bert_embeddings_sec_bert_base download started this may take some time. Approximate size to download 390.4 MB [OK!]
# Text Classifier
def get_text_classification_pipeline(model):
"""This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document
the management roles and experiences are mentioned"""
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings = nlp.UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
classifier = nlp.ClassifierDLModel.pretrained(model, "en", "finance/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
embeddings,
classifier])
return nlpPipeline
import pandas as pd
def get_relations_df (results, col='relations'):
"""Shows a Dataframe with the relations extracted by Spark NLP"""
rel_pairs=[]
for rel in results[0][col]:
rel_pairs.append((
rel.result,
rel.metadata['entity1'],
rel.metadata['entity1_begin'],
rel.metadata['entity1_end'],
rel.metadata['chunk1'],
rel.metadata['entity2'],
rel.metadata['entity2_begin'],
rel.metadata['entity2_end'],
rel.metadata['chunk2'],
rel.metadata['confidence']
))
rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])
return rel_df
NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.
Let's suppose we want to extract information about Acquisitions and Subsidiaries. If we don't know where that information is in the document, we can use Text Classifiers to find it.
To check the SEC 10K Summary page, we have a specific model called "finclf_acquisitions_item"
Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt
with open('cdns-20220101.html.txt', 'r') as f:
cadence_sec10k = f.read()
print(cadence_sec10k[:100])
Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 __________
pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 _____________________________________ FORM 10-K _____________________________________ (Mark One) ☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended January 1, 2022 OR ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from _________ to_________. Commission file number 000-15867 _____________________________________ CADENCE DESIGN SYSTEMS, INC. (Exact name of registrant as specified in its charter) ____________________________________ Delaware 00-0000000 (State or Other Jurisdiction ofIncorporation or Organization) (I.R.S. EmployerIdentification No.) 2655 Seely Avenue, Building 5, San Jose, California 95134 (Address of Principal Executive Offices) (Zip Code) (408) -943-1234 (Registrant’s Telephone Number, including Area Code) Securities registered pursuant to Section 12(b) of the Act: Title of Each Class Trading Symbol(s) Names of Each Exchange on which Registered Common Stock, $0.01 par value per share CDNS Nasdaq Global Select Market Securities registered pursuant to Section 12(g) of the Act: None Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. Yes ☒ No ☐ Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. Yes ☐ No ☒ Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes ☒ No ☐ Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes ☒ No ☐ Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act. Large Accelerated Filer ☒ Accelerated Filer ☐ Non-accelerated Filer ☐ Smaller Reporting Company ☐ Emerging Growth Company ☐ If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. ☐ Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. ☒ Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). Yes ☐ No ☒ The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000. On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding. DOCUMENTS INCORPORATED BY REFERENCE Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.
# Some examples
candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[67]]]
classification_pipeline = get_text_classification_pipeline('finclf_acquisitions_item')
df = spark.createDataFrame(candidates).toDF("text")
model = classification_pipeline.fit(df)
result = model.transform(df)
tfhub_use download started this may take some time. Approximate size to download 923.7 MB [OK!] finclf_acquisitions_item download started this may take some time. Approximate size to download 21.3 MB [OK!]
result.select('category.result').show()
+--------------+ | result| +--------------+ | [other]| | [other]| | [other]| |[acquisitions]| +--------------+
📚Let's use some NER models to obtain information about Organizations and Dates, and understand if:
We will use the deteceted page[67]
as input
ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_dates")
ner_converter_date = nlp.NerConverter()\
.setInputCols(["sentence","token","ner_dates"])\
.setOutputCol("ner_chunk_date")
ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_orgs")
ner_converter_org = nlp.NerConverter()\
.setInputCols(["sentence","token","ner_orgs"])\
.setOutputCol("ner_chunk_org")\
chunk_merger = finance.ChunkMergeApproach()\
.setInputCols('ner_chunk_org', "ner_chunk_date")\
.setOutputCol('ner_chunk')
pos = nlp.PerceptronModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("pos")
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos", "token"])\
.setOutputCol("dependencies")
re_filter = finance.RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setRelationPairs(["ORG-ORG", "ORG-DATE"])\
.setMaxSyntacticDistance(10)
reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relations_acq")\
.setPredictionThreshold(0.1)
annotation_merger = finance.AnnotationMerger()\
.setInputCols("relations_acq", "relations_alias")\
.setOutputCol("relations")
nlpPipeline = nlp.Pipeline(stages=[
generic_base_pipeline,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL,
annotation_merger])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)
finner_sec_dates download started this may take some time. [OK!] finner_orgs_prods_alias download started this may take some time. [OK!] pos_anc download started this may take some time. Approximate size to download 3.9 MB [OK!] dependency_conllu download started this may take some time. Approximate size to download 16.7 MB [OK!] finre_acquisitions_subsidiaries_md download started this may take some time. [OK!]
sample_text = pages[67].replace("“", "\"").replace("”", "\"")
result = light_model.fullAnnotate(sample_text)
rel_df = get_relations_df(result)
rel_df
relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |
---|---|---|---|---|---|---|---|---|---|---|
0 | has_acquisition_date | ORG | 440 | 446 | Cadence | DATE | 427 | 437 | fiscal 2020 | 0.99945384 |
1 | has_acquisition_date | ORG | 490 | 504 | AWR Corporation | DATE | 427 | 437 | fiscal 2020 | 0.99891853 |
2 | was_acquired_by | ORG | 490 | 504 | AWR Corporation | ORG | 440 | 446 | Cadence | 0.99111485 |
3 | was_acquired_by | ORG | 518 | 540 | Integrand Software, Inc | ORG | 440 | 446 | Cadence | 0.99635243 |
4 | was_acquired_by | ORG | 518 | 540 | Integrand Software, Inc | ORG | 490 | 504 | AWR Corporation | 0.94192755 |
5 | other | ORG | 1210 | 1212 | AWR | ORG | 1218 | 1226 | Integrand | 0.9999858 |
6 | other | ORG | 1229 | 1235 | Cadence | DATE | 1358 | 1367 | nine years | 0.996561 |
7 | other | ORG | 1905 | 1907 | AWR | ORG | 1913 | 1921 | Integrand | 0.9999651 |
8 | has_acquisition_date | ORG | 1955 | 1961 | Cadence | DATE | 2007 | 2017 | fiscal 2020 | 0.99776745 |
9 | other | DATE | 2219 | 2229 | fiscal 2021 | ORG | 2322 | 2330 | Cadence’s | 0.99219704 |
10 | other | DATE | 2235 | 2245 | fiscal 2020 | ORG | 2322 | 2330 | Cadence’s | 0.99703074 |
11 | other | DATE | 2539 | 2549 | fiscal 2021 | ORG | 2598 | 2606 | Cadence’s | 0.94122887 |
12 | other | DATE | 2552 | 2555 | 2020 | ORG | 2598 | 2606 | Cadence’s | 0.96238184 |
13 | other | DATE | 2560 | 2563 | 2019 | ORG | 2598 | 2606 | Cadence’s | 0.9658956 |
14 | other | DATE | 3191 | 3222 | the third quarter of fiscal 2021 | ORG | 3262 | 3270 | Cadence’s | 0.5690664 |
rel_df = rel_df[(rel_df["relation"] != "other") & (rel_df["relation"] != "no_rel")]
rel_df
relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |
---|---|---|---|---|---|---|---|---|---|---|
0 | has_acquisition_date | ORG | 440 | 446 | Cadence | DATE | 427 | 437 | fiscal 2020 | 0.99945384 |
1 | has_acquisition_date | ORG | 490 | 504 | AWR Corporation | DATE | 427 | 437 | fiscal 2020 | 0.99891853 |
2 | was_acquired_by | ORG | 490 | 504 | AWR Corporation | ORG | 440 | 446 | Cadence | 0.99111485 |
3 | was_acquired_by | ORG | 518 | 540 | Integrand Software, Inc | ORG | 440 | 446 | Cadence | 0.99635243 |
4 | was_acquired_by | ORG | 518 | 540 | Integrand Software, Inc | ORG | 490 | 504 | AWR Corporation | 0.94192755 |
8 | has_acquisition_date | ORG | 1955 | 1961 | Cadence | DATE | 2007 | 2017 | fiscal 2020 | 0.99776745 |
from sparknlp_display import RelationExtractionVisualizer
re_vis = viz.RelationExtractionVisualizer()
re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["other", "no_rel"], show_relations=True)