Notebook

Colab Setup¶

🎬 Installation¶

In [ ]:

! pip install -q johnsnowlabs

🔗 Automatic Installation¶

Using my.johnsnowlabs.com SSO

In [ ]:

from johnsnowlabs import *

# nlp.install(force_browser=True)

🔗 Manual downloading¶

If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

Go to my.johnsnowlabs.com
Download your license
Upload it using the following command

In [ ]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Install it

In [ ]:

nlp.install()

📌 Starting¶

In [ ]:

from johnsnowlabs import *
spark = nlp.start()

Loading the data¶

▒▒▒▒▒▒▒▒▒▒ 100% ᴄᴏᴍᴘʟᴇᴛᴇ!

In [ ]:

import requests
URL = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/commercial_lease_1.txt"
URL_2 = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/commercial_lease_2.txt"
URL_3 = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/credit_agreement_2.txt"
URL_4 = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/loan_agreement.txt"


response = requests.get(URL)
response2 = requests.get(URL_2)
response3 = requests.get(URL_3)
response4 = requests.get(URL_4)


commercial_lease = response.content.decode('utf-8')
commercial_lease_2 = response2.content.decode('utf-8')
credit_agreement = response3.content.decode('utf-8')
loan_agreement = response4.content.decode('utf-8')

🔎 Document Clasification¶

Commercial Lease Classification¶

📜 Let's give the commercial lease classification model various types of documents to see if it correctly detects them or not.¶

📜 The documents that are being used in the below cells for testing are commercial lease, credit agreement, loan agreement and another commercial lease.¶

In [ ]:

documents = [commercial_lease,credit_agreement,loan_agreement,commercial_lease_2]
documents = [[i] for i in documents]

In [ ]:

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
  
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")
    
doc_classifier = legal.ClassifierDLModel.pretrained("legclf_commercial_lease", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")
    
nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    embeddings,
    doc_classifier])

df = spark.createDataFrame(documents).toDF("text")

model = nlpPipeline.fit(df)

result = model.transform(df)

result.select('category.result').show(truncate=False)

sent_bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
legclf_commercial_lease download started this may take some time.
[OK!]
+------------------+
|result            |
+------------------+
|[commercial-lease]|
|[other]           |
|[other]           |
|[commercial-lease]|
+------------------+

Here, we can see that the classifier accurately detected the commercial lease documents.¶

Among these documents there is also a Loan Agreement. You can also detect it. In this case the model was trained using Setence Bert Embeddings.¶

In [ ]:

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
  
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")
    
doc_classifier = legal.ClassifierDLModel.pretrained("legclf_loan_agreement_bert", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")
    
nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    embeddings,
    doc_classifier])
 
df = spark.createDataFrame(documents).toDF("text")

model = nlpPipeline.fit(df)

result = model.transform(df)

result.select('category.result').show(truncate=False)

sent_bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
legclf_loan_agreement_bert download started this may take some time.
[OK!]
+----------------+
|result          |
+----------------+
|[other]         |
|[other]         |
|[loan-agreement]|
|[other]         |
+----------------+

The classifier has recognised it, You may find more classifiers on the models hub page for various documents: https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&task=Text+Classification ¶

In [ ]:

🔎 Paragraph Splitting¶

Reason: Generally, clauses lengths range from one paragraph to N. Splitting into larger sections such as pages might result in too much information and cause the meaning to become distorted, clauses to become mixed and confuse the classifiers. On the other hand, sentence-level information is too limited. So the best split we can make for clause extraction is at the paragraph level.¶

📜Explanation:

.setCustomBounds(["\r\n"]) sets an array of regular expression(s) to tell the annotator how to split the document. (Here we are splitting by paragraph.)
.setUseCustomBoundsOnly(True) the default behaviour of SentenceDetector is Sentence Splitting, so we set to ignore the default regex ('\n', ...).
.setExplodeSentences(True) creates one new row in the dataframe per split.

In [ ]:

document_assembler = nlp.DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")

text_splitter = legal.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("pages")\
    .setCustomBounds(["\r\n\r\n "])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

In [ ]:

sdf = spark.createDataFrame([[commercial_lease]]).toDF("text")

fit = nlp_pipeline.fit(sdf)

lp = nlp.LightPipeline(fit)

res = lp.annotate(commercial_lease_2)
pages = res['pages']
pages = [p for p in pages if p.strip() != ''] # We remove empty pages

In [ ]:

len(pages)

Out[ ]:

Let's now examine these paragraphs and determine which one of them is an introductory clause.¶

You may find more clauses on the models hub page: https://nlp.johnsnowlabs.com/models?q=clause&edition=Legal+NLP&task=Text+Classification ¶

In [ ]:

embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")
    
doc_classifier = legal.ClassifierDLModel.pretrained("legclf_introduction_clause", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")

nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    embeddings,
    doc_classifier])

texts = [[i] for i in pages]
df = spark.createDataFrame(texts).toDF("text")

model = nlpPipeline.fit(df)

result = model.transform(df)
result.select('category.result').show()

sent_bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
legclf_introduction_clause download started this may take some time.
[OK!]
+--------------+
|        result|
+--------------+
|[introduction]|
|       [other]|
|[introduction]|
|[introduction]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|[introduction]|
|[introduction]|
|       [other]|
|       [other]|
+--------------+
only showing top 20 rows

In [ ]:

introductory_clause = result.select('text').filter("category.result[0] != 'other'").collect()

In [ ]:

print(introductory_clause[1][0])

THIS Lease Agreement , is made and entered into this _____day of May, 2006 by and between Global, Inc., (hereinafter called "Landlord"), and IMI Global, Inc., with a mailing address of ___, (hereinafter referred as "Tenant").

🔎 Pretrained Pipelines¶

Spark NLP provides pre-trained pipelines that have already been fitted with specific annotators and transformers for various use cases, so you don't have to create a pipeline from scratch. If you need to adjust the parameters of the Relation Extraction model, you can utilize the aforementioned Relation Extraction pipeline.

🔎 Named Entity Recognition¶

Let's use one of the clauses that have been identified as `introductory` for detecting the entities(NER) using and Introductory Clause specific NER and then mapping the relations between them.¶

To learn more about the pipeline being utilized here, please refer to the model's hub page on the Johns Snow Labs NLP website: https://nlp.johnsnowlabs.com/2023/02/02/legpipe_ner_contract_doc_parties_alias_former_en.html ¶

In [ ]:

legal_pipeline = nlp.PretrainedPipeline("legpipe_ner_contract_doc_parties_alias_former", "en", "legal/models")

text = [introductory_clause[1][0]]

In [ ]:

sdf = spark.createDataFrame([text]).toDF("text")

In [ ]:

df = legal_pipeline.transform(sdf)

In [ ]:

result = legal_pipeline.fullAnnotate(text)[0]
result.keys()

In [ ]:

from johnsnowlabs import viz

ner_viz = viz.NerVisualizer()

ner_viz.display(result, label_col='ner_chunk')

THIS Lease Agreement DOC , is made and entered into this _____day of May, 2006 EFFDATE by and between Global, Inc PARTY., (hereinafter called "Landlord ALIAS"), and IMI Global, Inc PARTY., with a mailing address of ___, (hereinafter referred as "Tenant ALIAS").

In [ ]:

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

textSplitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

alias_parser = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("subheader")\
    .setJsonPath("alias.json") \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(True)\
    .setCaseSensitive(False)

alias_parser2 = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("subheader2")\
    .setJsonPath("alias_2.json") \
    .setCaseSensitive(True) \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(False)

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
        .setInputCols("sentence", "token") \
        .setOutputCol("embeddings")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties_lg', 'en', 'legal/models')\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setThreshold(0.7)\
    .setOutputCol("ner_chunk")

zero_shot_ner = legal.ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.3)\
    .setEntityDefinitions(
        {
            
            "PARTY": ["which Inc?", "Which Ltd?","Which company?","Which party?"],
            "EFFDATE": ["What is the date?"],
            "ALIAS": ["Where is the location?","What Aliases are used to refer to the PARTY?","What Aliases are used to refer to the effdate?","What Aliases are used to refer to the DOC?"],
            "FORMER_NAME": ['Formerly known as?'],
            "ADDRESS":["What is the full location?","where is the address?","Where is the principal location of business?"],
            "DOC":["What agreement?"]
            
        })


ner_converter_zeroshot = legal.NerConverterInternal()\
  .setInputCols(["sentence", "token", "zero_shot_ner"])\
  .setOutputCol("ner_chunk_zeroshot")\
  .setGreedyMode(True)

chunk_merger = legal.ChunkMergeApproach()\
    .setInputCols("ner_chunk", "ner_chunk_zeroshot", "subheader", "subheader2")\
    .setOutputCol('merged_ner_chunks')
    
nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    textSplitter,
    tokenizer,
    alias_parser,
    alias_parser2, 
    embeddings,
    ner_model,
    ner_converter,
    zero_shot_ner,
    ner_converter_zeroshot,
    chunk_merger
])

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_contract_doc_parties_lg download started this may take some time.
[OK!]
legner_roberta_zeroshot download started this may take some time.
[OK!]

	relation	entity1	entity1_begin	entity1_end	chunk1	entity2	entity2_begin	entity2_end	chunk2	confidence
0	dated_as	DOC	0	19	THIS Lease Agreement	EFFDATE	62	73	of May, 2006	0.9999546
1	signed_by	DOC	0	19	THIS Lease Agreement	PARTY	90	100	Global, Inc	0.9911765
2	has_alias	PARTY	90	100	Global, Inc	ALIAS	125	132	Landlord	0.9999889
3	has_alias	PARTY	141	155	IMI Global, Inc	ALIAS	216	221	Tenant	0.9999893

Colab Setup¶

🎬 Installation¶

🔗 Automatic Installation¶

🔗 Manual downloading¶

📌 Starting¶

Loading the data¶

🔎 Document Clasification¶

Commercial Lease Classification¶

📜 Let's give the commercial lease classification model various types of documents to see if it correctly detects them or not.¶

📜 The documents that are being used in the below cells for testing are commercial lease, credit agreement, loan agreement and another commercial lease.¶

Here, we can see that the classifier accurately detected the commercial lease documents.¶

Among these documents there is also a Loan Agreement. You can also detect it. In this case the model was trained using Setence Bert Embeddings.¶

The classifier has recognised it, You may find more classifiers on the models hub page for various documents: https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&task=Text+Classification ¶

🔎 Paragraph Splitting¶

Let's now examine these paragraphs and determine which one of them is an introductory clause.¶

You may find more clauses on the models hub page: https://nlp.johnsnowlabs.com/models?q=clause&edition=Legal+NLP&task=Text+Classification ¶

🔎 Pretrained Pipelines¶

🔎 Named Entity Recognition¶

Let's use one of the clauses that have been identified as `introductory` for detecting the entities(NER) using and Introductory Clause specific NER and then mapping the relations between them.¶

To learn more about the pipeline being utilized here, please refer to the model's hub page on the Johns Snow Labs NLP website: https://nlp.johnsnowlabs.com/2023/02/02/legpipe_ner_contract_doc_parties_alias_former_en.html ¶

🔎 Relation Extraction¶

🔎 Visualizing the results¶

Let's delve deeper into the Pretrained pipelines we've used earlier and explore their inner workings.¶

This is where you can customize the pipelines for Relation Extraction and Named Entity Recognition models to refine the results.¶

🔎 1. Relation Extraction¶

Let's map the `relations` from the entities.¶

🔎 Visualizing the results¶

🔎 2. Named Entity Recognition¶

Colab Setup¶

🎬 Installation¶

🔗 Automatic Installation¶

🔗 Manual downloading¶

📌 Starting¶

Loading the data¶

🔎 Document Clasification¶

Commercial Lease Classification¶

📜 Let's give the commercial lease classification model various types of documents to see if it correctly detects them or not.¶

📜 The documents that are being used in the below cells for testing are *commercial lease, credit agreement, loan agreement* and another *commercial lease*.¶

Here, we can see that the classifier accurately detected the commercial lease documents.¶

Among these documents there is also a Loan Agreement. You can also detect it. In this case the model was trained using Setence Bert Embeddings.¶

The classifier has recognised it, You may find more classifiers on the models hub page for various documents: https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&task=Text+Classification¶

🔎 Paragraph Splitting¶

Let's now examine these paragraphs and determine which one of them is an introductory clause.¶

You may find more clauses on the models hub page: https://nlp.johnsnowlabs.com/models?q=clause&edition=Legal+NLP&task=Text+Classification¶

🔎 Pretrained Pipelines¶

🔎 Named Entity Recognition¶

Let's use one of the clauses that have been identified as introductory for detecting the entities(NER) using and Introductory Clause specific NER and then mapping the relations between them.¶

To learn more about the pipeline being utilized here, please refer to the model's hub page on the Johns Snow Labs NLP website: https://nlp.johnsnowlabs.com/2023/02/02/legpipe_ner_contract_doc_parties_alias_former_en.html¶

🔎 Relation Extraction¶

🔎 Visualizing the results¶

Let's delve deeper into the Pretrained pipelines we've used earlier and explore their inner workings.¶

This is where you can customize the pipelines for Relation Extraction and Named Entity Recognition models to refine the results.¶

🔎 1. Relation Extraction¶

Let's map the relations from the entities.¶

🔎 Visualizing the results¶

🔎 2. Named Entity Recognition¶

📜 The documents that are being used in the below cells for testing are commercial lease, credit agreement, loan agreement and another commercial lease.¶

The classifier has recognised it, You may find more classifiers on the models hub page for various documents: https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&task=Text+Classification ¶

You may find more clauses on the models hub page: https://nlp.johnsnowlabs.com/models?q=clause&edition=Legal+NLP&task=Text+Classification ¶

Let's use one of the clauses that have been identified as `introductory` for detecting the entities(NER) using and Introductory Clause specific NER and then mapping the relations between them.¶

To learn more about the pipeline being utilized here, please refer to the model's hub page on the Johns Snow Labs NLP website: https://nlp.johnsnowlabs.com/2023/02/02/legpipe_ner_contract_doc_parties_alias_former_en.html ¶

Let's map the `relations` from the entities.¶