Notebook

🎬 Installation¶

In [ ]:

! pip install -q johnsnowlabs

🔗 Automatic Installation¶

Using my.johnsnowlabs.com SSO

In [ ]:

from johnsnowlabs import nlp, finance, legal

nlp.install(refresh_install=True, visual=True, force_browser = True)

🔗 Manual downloading¶

If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

Go to my.johnsnowlabs.com
Download your license
Upload it using the following command

In [ ]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Install it

In [ ]:

nlp.install()

📌 Starting¶

In [ ]:

spark = nlp.start()

🔎 Legal Subpoenas NER (small)¶

✍Explanation:

The Legal Subpoenas NER (small) statement refers to a pre-trained named entity recognition (NER) model specifically designed for legal text processing.

The legner_subpoena model is trained specifically to recognize and extract information related to subpoenas in legal documents. A subpoena is a legal document issued by a court that commands an individual or organization to provide specific documents, testimony, or evidence relevant to a legal case. Recognizing and extracting subpoena-related information from large volumes of legal texts can be a time-consuming task, and the legner_subpoena model is designed to automate this process.

📚Entities:

ADDRESS, MATTER_VS, APPOINTMENT_HOUR, DOCUMENT_TOPIC, DOCUMENT_PERSON, COURT_ADDRESS, APPOINTMENT_DATE, COUNTY, CASE, SIGNER, COURT, DOCUMENT_DATE_TO, DOCUMENT_TYPE, STATE, DOCUMENT_DATE_FROM, RECEIVER, MATTER, SUBPOENA_DATE, DOCUMENT_DATE_YEAR

In [5]:

document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

textSplitter = legal.TextSplitter()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = nlp.Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)
  
loaded_ner_model = legal.NerModel.pretrained('legner_subpoena','en','legal/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

converter = nlp.NerConverter()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_span")

ner_prediction_pipeline = nlp.Pipeline(stages = [
                                            document,
                                            textSplitter,
                                            token,
                                            roberta_embeddings,
                                            loaded_ner_model,
                                            converter
                                            ])

empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)

text = """SUBPOENA TO PRODUCE DOCUMENTS, INFORMATION, OR OBJECTS OR TO PERMIT INSPECTION OF PREMISES IN A CIVIL ACTION

UNITED STATES DISTRICT COURT
DISTRICT OF NEW YORK

Plaintiff: Chang Lee
v.
Defendant: Jie Chen

To: Kim Nguyen
789 Elm Street
New York, NY 10003

You are hereby commanded to produce at the time, date, and place set forth below the following documents, electronically stored information, or tangible things:

All financial records, including bank statements, credit card statements, and tax returns for Jie Chen from January 1, 2017 to present;
All emails and other correspondence between Jie Chen and any business partners, associates or employees related to the above financial records from January 1, 2017 to present;
All contracts and agreements entered into by Jie Chen, including any non-disclosure agreements, from January 1, 2017 to present.
The production shall occur at the following time and location:

Date: August 15, 2023
Time: 10:00 a.m.
Location: Law Office of Lee & Associates, 456 Broadway, Suite 800, New York, NY 10003.

You are further commanded to preserve and protect the confidentiality of any documents, electronically stored information, or tangible things produced or inspected, in accordance with the applicable law or agreement.

You are not required to produce or permit inspection of any privileged or protected documents or information.

This subpoena is issued by the court at the request of the Plaintiff's attorney, and you are hereby ordered to comply with this subpoena as provided by the Federal Rules of Civil Procedure.

You must comply with this subpoena under the penalty of law.

Dated: May 4, 2023

[Signature of Clerk of Court]
By: Sarah Johnson
Deputy Clerk"""

sample_data = spark.createDataFrame([[text]]).toDF("text")

result = prediction_model.transform(sample_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_subpoena download started this may take some time.
[OK!]

In [6]:

from pyspark.sql import functions as F

result.select(F.explode(F.arrays_zip(result.token.result, 
                                     result.ner.result, 
                                     result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence")).show(200, truncate=100)

+--------------+--------------------+----------+
|         token|           ner_label|confidence|
+--------------+--------------------+----------+
|      SUBPOENA|                   O|       1.0|
|            TO|                   O|       1.0|
|       PRODUCE|                   O|    0.9991|
|     DOCUMENTS|     B-DOCUMENT_TYPE|    0.9844|
|             ,|                   O|       1.0|
|   INFORMATION|     B-DOCUMENT_TYPE|    0.9345|
|             ,|                   O|    0.9993|
|            OR|                   O|    0.9999|
|       OBJECTS|                   O|    0.9624|
|            OR|                   O|       1.0|
|            TO|                   O|       1.0|
|        PERMIT|                   O|       1.0|
|    INSPECTION|                   O|    0.9982|
|            OF|                   O|       1.0|
|      PREMISES|                   O|    0.9966|
|            IN|                   O|    0.9999|
|             A|                   O|    0.9999|
|         CIVIL|                   O|    0.9797|
|        ACTION|                   O|    0.9995|
|        UNITED|                   O|    0.9977|
|        STATES|                   O|     0.985|
|      DISTRICT|                   O|    0.5852|
|         COURT|                   O|    0.3786|
|      DISTRICT|                   O|    0.4604|
|            OF|                   O|    0.9222|
|           NEW|             B-STATE|    0.3463|
|          YORK|            I-COUNTY|    0.6732|
|     Plaintiff|                   O|    0.7924|
|             :|                   O|    0.9997|
|         Chang|            B-MATTER|    0.6147|
|           Lee|            I-MATTER|    0.8425|
|             v|                   O|      0.98|
|             .|                   O|    0.9977|
|     Defendant|                   O|    0.7419|
|             :|                   O|       1.0|
|           Jie|          B-RECEIVER|    0.7958|
|          Chen|          I-RECEIVER|    0.5887|
|            To|                   O|     0.997|
|             :|                   O|    0.9992|
|           Kim|          B-RECEIVER|    0.8381|
|        Nguyen|          I-RECEIVER|    0.7569|
|           789|           B-ADDRESS|    0.9807|
|           Elm|           I-ADDRESS|    0.9924|
|        Street|           I-ADDRESS|    0.9932|
|           New|           I-ADDRESS|    0.9938|
|          York|           I-ADDRESS|     0.992|
|             ,|           I-ADDRESS|    0.9922|
|            NY|           I-ADDRESS|    0.9936|
|         10003|           I-ADDRESS|    0.9934|
|           You|                   O|       1.0|
|           are|                   O|       1.0|
|        hereby|                   O|    0.9999|
|     commanded|                   O|    0.9998|
|            to|                   O|       1.0|
|       produce|                   O|    0.9996|
|            at|                   O|       1.0|
|           the|                   O|       1.0|
|          time|                   O|       1.0|
|             ,|                   O|       1.0|
|          date|                   O|    0.9999|
|             ,|                   O|    0.9998|
|           and|                   O|       1.0|
|         place|                   O|       1.0|
|           set|                   O|       1.0|
|         forth|                   O|       1.0|
|         below|                   O|       1.0|
|           the|                   O|       1.0|
|     following|                   O|       1.0|
|     documents|     B-DOCUMENT_TYPE|    0.9848|
|             ,|                   O|    0.9997|
|electronically|     B-DOCUMENT_TYPE|    0.9925|
|        stored|     I-DOCUMENT_TYPE|    0.9401|
|   information|     I-DOCUMENT_TYPE|    0.9836|
|             ,|                   O|    0.9996|
|            or|                   O|    0.9995|
|      tangible|                   O|     0.948|
|        things|                   O|    0.9717|
|             :|                   O|       1.0|
|           All|                   O|    0.9995|
|     financial|     B-DOCUMENT_TYPE|    0.9791|
|       records|     I-DOCUMENT_TYPE|    0.9918|
|             ,|                   O|    0.9999|
|     including|                   O|       1.0|
|          bank|     B-DOCUMENT_TYPE|    0.8418|
|    statements|     I-DOCUMENT_TYPE|     0.963|
|             ,|                   O|    0.9993|
|        credit|     B-DOCUMENT_TYPE|    0.7652|
|          card|     I-DOCUMENT_TYPE|    0.4994|
|    statements|     I-DOCUMENT_TYPE|    0.9563|
|             ,|                   O|    0.9997|
|           and|                   O|    0.9999|
|           tax|     B-DOCUMENT_TYPE|    0.8595|
|       returns|     I-DOCUMENT_TYPE|    0.9063|
|           for|                   O|    0.9999|
|           Jie|   B-DOCUMENT_PERSON|    0.9927|
|          Chen|   I-DOCUMENT_PERSON|    0.9865|
|          from|                   O|    0.9997|
|       January|B-DOCUMENT_DATE_FROM|    0.9993|
|             1|I-DOCUMENT_DATE_FROM|    0.9997|
|             ,|I-DOCUMENT_DATE_FROM|    0.9995|
|          2017|I-DOCUMENT_DATE_FROM|    0.9981|
|            to|                   O|    0.9998|
|       present|                   O|    0.9904|
|             ;|                   O|       1.0|
|           All|                   O|    0.9907|
|        emails|     B-DOCUMENT_TYPE|    0.9979|
|           and|                   O|    0.9999|
|         other|                   O|    0.9319|
|correspondence|     B-DOCUMENT_TYPE|    0.9553|
|       between|                   O|    0.9998|
|           Jie|   B-DOCUMENT_PERSON|    0.9817|
|          Chen|   I-DOCUMENT_PERSON|    0.9883|
|           and|                   O|    0.9998|
|           any|                   O|    0.9997|
|      business|   B-DOCUMENT_PERSON|    0.6979|
|      partners|   I-DOCUMENT_PERSON|    0.4181|
|             ,|                   O|       1.0|
|    associates|   B-DOCUMENT_PERSON|    0.6085|
|            or|                   O|    0.9997|
|     employees|   B-DOCUMENT_PERSON|    0.9321|
|       related|                   O|    0.9999|
|            to|                   O|    0.9999|
|           the|                   O|    0.9998|
|         above|                   O|    0.9998|
|     financial|     B-DOCUMENT_TYPE|    0.4994|
|       records|     I-DOCUMENT_TYPE|    0.6143|
|          from|                   O|    0.9997|
|       January|B-DOCUMENT_DATE_FROM|    0.9991|
|             1|I-DOCUMENT_DATE_FROM|    0.9998|
|             ,|I-DOCUMENT_DATE_FROM|    0.9994|
|          2017|I-DOCUMENT_DATE_FROM|    0.9959|
|            to|                   O|       1.0|
|       present|                   O|    0.9958|
|             ;|                   O|       1.0|
|           All|                   O|    0.9994|
|     contracts|     B-DOCUMENT_TYPE|    0.9421|
|           and|                   O|    0.9998|
|    agreements|     B-DOCUMENT_TYPE|    0.9462|
|       entered|                   O|    0.9382|
|          into|                   O|    0.9981|
|            by|                   O|    0.9998|
|           Jie|   B-DOCUMENT_PERSON|    0.9464|
|          Chen|   I-DOCUMENT_PERSON|    0.9799|
|             ,|                   O|    0.9996|
|     including|                   O|       1.0|
|           any|                   O|       1.0|
|non-disclosure|                   O|     0.931|
|    agreements|     B-DOCUMENT_TYPE|    0.3859|
|             ,|                   O|    0.9999|
|          from|                   O|    0.9998|
|       January|B-DOCUMENT_DATE_FROM|    0.9992|
|             1|I-DOCUMENT_DATE_FROM|    0.9998|
|             ,|I-DOCUMENT_DATE_FROM|     0.999|
|          2017|I-DOCUMENT_DATE_FROM|    0.9978|
|            to|                   O|       1.0|
|       present|                   O|    0.9993|
|             .|                   O|    0.9998|
|           The|                   O|    0.9996|
|    production|                   O|    0.9709|
|         shall|                   O|       1.0|
|         occur|                   O|       1.0|
|            at|                   O|       1.0|
|           the|                   O|       1.0|
|     following|                   O|       1.0|
|          time|                   O|       1.0|
|           and|                   O|       1.0|
|      location|                   O|       1.0|
|             :|                   O|       1.0|
|          Date|                   O|       1.0|
|             :|                   O|       1.0|
|        August|  B-APPOINTMENT_DATE|    0.8871|
|            15|  I-APPOINTMENT_DATE|     0.856|
|             ,|  I-APPOINTMENT_DATE|    0.9067|
|          2023|  I-APPOINTMENT_DATE|    0.9204|
|          Time|                   O|       1.0|
|             :|                   O|       1.0|
|         10:00|  B-APPOINTMENT_HOUR|    0.9982|
|           a.m|  I-APPOINTMENT_HOUR|    0.9995|
|             .|                   O|    0.9653|
|      Location|                   O|    0.9998|
|             :|                   O|       1.0|
|           Law|                   O|    0.9499|
|        Office|                   O|    0.9776|
|            of|                   O|    0.9892|
|           Lee|   B-DOCUMENT_PERSON|    0.8784|
|             &|   I-DOCUMENT_PERSON|    0.9816|
|    Associates|   I-DOCUMENT_PERSON|    0.9753|
|             ,|                   O|    0.9478|
|           456|     B-COURT_ADDRESS|    0.5613|
|      Broadway|     I-COURT_ADDRESS|    0.7624|
|             ,|     I-COURT_ADDRESS|    0.8556|
|         Suite|     I-COURT_ADDRESS|    0.9617|
|           800|     I-COURT_ADDRESS|    0.9469|
|             ,|     I-COURT_ADDRESS|     0.907|
|           New|     I-COURT_ADDRESS|    0.8847|
|          York|     I-COURT_ADDRESS|    0.8566|
|             ,|     I-COURT_ADDRESS|    0.7641|
|            NY|     I-COURT_ADDRESS|    0.7735|
|         10003|     I-COURT_ADDRESS|    0.7114|
|             .|                   O|    0.8257|
+--------------+--------------------+----------+
only showing top 200 rows

In [7]:

result.select(F.explode(F.arrays_zip(result.ner_span.result, result.ner_span.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)

+---------------------------------+------------------+----------+
|chunk                            |ner_label         |confidence|
+---------------------------------+------------------+----------+
|DOCUMENTS                        |DOCUMENT_TYPE     |0.9844    |
|INFORMATION                      |DOCUMENT_TYPE     |0.9345    |
|NEW YORK                         |STATE             |0.50975   |
|Chang Lee                        |MATTER            |0.7286    |
|Jie Chen                         |RECEIVER          |0.69225   |
|Kim Nguyen                       |RECEIVER          |0.7975    |
|789 Elm Street
New York, NY 10003|ADDRESS           |0.99141246|
|documents                        |DOCUMENT_TYPE     |0.9848    |
|electronically stored information|DOCUMENT_TYPE     |0.9720667 |
|financial records                |DOCUMENT_TYPE     |0.98545   |
|bank statements                  |DOCUMENT_TYPE     |0.9024    |
|credit card statements           |DOCUMENT_TYPE     |0.7403    |
|tax returns                      |DOCUMENT_TYPE     |0.8829    |
|Jie Chen                         |DOCUMENT_PERSON   |0.9896    |
|January 1, 2017                  |DOCUMENT_DATE_FROM|0.99915004|
|emails                           |DOCUMENT_TYPE     |0.9979    |
|correspondence                   |DOCUMENT_TYPE     |0.9553    |
|Jie Chen                         |DOCUMENT_PERSON   |0.985     |
|business partners                |DOCUMENT_PERSON   |0.55799997|
|associates                       |DOCUMENT_PERSON   |0.6085    |
+---------------------------------+------------------+----------+
only showing top 20 rows

🖨️ Getting Result with LightPipeline¶

In [8]:

import pandas as pd

light_model = nlp.LightPipeline(prediction_model)

light_result = light_model.fullAnnotate(text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_span']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    

df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

df.head(20)

Out[8]:

	chunks	begin	end	entities
0	DOCUMENTS	20	28	DOCUMENT_TYPE
1	INFORMATION	31	41	DOCUMENT_TYPE
2	NEW YORK	151	158	STATE
3	Chang Lee	172	180	MATTER
4	Jie Chen	196	203	RECEIVER
5	Kim Nguyen	210	219	RECEIVER
6	789 Elm Street\nNew York, NY 10003	221	253	ADDRESS
7	documents	351	359	DOCUMENT_TYPE
8	electronically stored information	362	394	DOCUMENT_TYPE
9	financial records	422	438	DOCUMENT_TYPE
10	bank statements	451	465	DOCUMENT_TYPE
11	credit card statements	468	489	DOCUMENT_TYPE
12	tax returns	496	506	DOCUMENT_TYPE
13	Jie Chen	512	519	DOCUMENT_PERSON
14	January 1, 2017	526	540	DOCUMENT_DATE_FROM
15	emails	558	563	DOCUMENT_TYPE
16	correspondence	575	588	DOCUMENT_TYPE
17	Jie Chen	598	605	DOCUMENT_PERSON
18	business partners	615	631	DOCUMENT_PERSON
19	associates	634	643	DOCUMENT_PERSON

📌 NER Visualizer¶

For saving the visualization result as html, provide save_path parameter in the display function.

In [9]:

# from sparknlp_display import NerVisualizer

visualiser = nlp.viz.NerVisualizer()

visualiser.display(light_result[0], label_col='ner_span', document_col='document')

SUBPOENA TO PRODUCE DOCUMENTS DOCUMENT_TYPE, INFORMATION DOCUMENT_TYPE, OR OBJECTS OR TO PERMIT INSPECTION OF PREMISES IN A CIVIL ACTION

UNITED STATES DISTRICT COURT
DISTRICT OF NEW YORK STATE

Plaintiff: Chang Lee MATTER
v.
Defendant: Jie Chen RECEIVER

To: Kim Nguyen RECEIVER
789 Elm Street
New York, NY 10003 ADDRESS

You are hereby commanded to produce at the time, date, and place set forth below the following documents DOCUMENT_TYPE, electronically stored information DOCUMENT_TYPE, or tangible things:

All financial records DOCUMENT_TYPE, including bank statements DOCUMENT_TYPE, credit card statements DOCUMENT_TYPE, and tax returns DOCUMENT_TYPE for Jie Chen DOCUMENT_PERSON from January 1, 2017 DOCUMENT_DATE_FROM to present;
All emails DOCUMENT_TYPE and other correspondence DOCUMENT_TYPE between Jie Chen DOCUMENT_PERSON and any business partners DOCUMENT_PERSON, associates DOCUMENT_PERSON or employees DOCUMENT_PERSON related to the above financial records DOCUMENT_TYPE from January 1, 2017 DOCUMENT_DATE_FROM to present;
All contracts DOCUMENT_TYPE and agreements DOCUMENT_TYPE entered into by Jie Chen DOCUMENT_PERSON, including any non-disclosure agreements DOCUMENT_TYPE, from January 1, 2017 DOCUMENT_DATE_FROM to present.
The production shall occur at the following time and location:

Date: August 15, 2023 APPOINTMENT_DATE
Time: 10:00 a.m APPOINTMENT_HOUR.
Location: Law Office of Lee & Associates DOCUMENT_PERSON, 456 Broadway, Suite 800, New York, NY 10003 COURT_ADDRESS.

You are further commanded to preserve and protect the confidentiality of any documents DOCUMENT_TYPE, electronically stored information DOCUMENT_TYPE, or tangible things produced or inspected, in accordance with the applicable law or agreement DOCUMENT_TYPE.

You are not required to produce or permit inspection of any privileged or protected documents DOCUMENT_TYPE or information DOCUMENT_TYPE.

This subpoena is issued by the court at the request of the Plaintiff's DOCUMENT_PERSON attorney DOCUMENT_PERSON, and you are hereby ordered to comply with this subpoena as provided by the Federal Rules of Civil Procedure.

You must comply with this subpoena under the penalty of law.

Dated: May 4, 2023 SUBPOENA_DATE

[Signature of Clerk of Court]
By: Sarah Johnson SIGNER
Deputy Clerk