! pip install -q johnsnowlabs
Using my.johnsnowlabs.com SSO
from johnsnowlabs import nlp, finance, legal
nlp.install(refresh_install=True, visual=True, force_browser = True)
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
nlp.install()
spark = nlp.start()
✍Explanation:
The Legal Subpoenas NER (small) statement refers to a pre-trained named entity recognition (NER) model specifically designed for legal text processing.
legner_subpoena
model is trained specifically to recognize and extract information related to subpoenas in legal documents. A subpoena is a legal document issued by a court that commands an individual or organization to provide specific documents, testimony, or evidence relevant to a legal case. Recognizing and extracting subpoena-related information from large volumes of legal texts can be a time-consuming task, and the legner_subpoena model is designed to automate this process.📚Entities:
ADDRESS
, MATTER_VS
, APPOINTMENT_HOUR
, DOCUMENT_TOPIC
, DOCUMENT_PERSON
, COURT_ADDRESS
, APPOINTMENT_DATE
, COUNTY
, CASE
, SIGNER
, COURT
, DOCUMENT_DATE_TO
, DOCUMENT_TYPE
, STATE
, DOCUMENT_DATE_FROM
, RECEIVER
, MATTER
, SUBPOENA_DATE
, DOCUMENT_DATE_YEAR
document = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
textSplitter = legal.TextSplitter()\
.setInputCols(['document'])\
.setOutputCol('sentence')
token = nlp.Tokenizer()\
.setInputCols(['sentence'])\
.setOutputCol('token')
roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings") \
.setMaxSentenceLength(512)
loaded_ner_model = legal.NerModel.pretrained('legner_subpoena','en','legal/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_span")
ner_prediction_pipeline = nlp.Pipeline(stages = [
document,
textSplitter,
token,
roberta_embeddings,
loaded_ner_model,
converter
])
empty_data = spark.createDataFrame([['']]).toDF("text")
prediction_model = ner_prediction_pipeline.fit(empty_data)
text = """SUBPOENA TO PRODUCE DOCUMENTS, INFORMATION, OR OBJECTS OR TO PERMIT INSPECTION OF PREMISES IN A CIVIL ACTION
UNITED STATES DISTRICT COURT
DISTRICT OF NEW YORK
Plaintiff: Chang Lee
v.
Defendant: Jie Chen
To: Kim Nguyen
789 Elm Street
New York, NY 10003
You are hereby commanded to produce at the time, date, and place set forth below the following documents, electronically stored information, or tangible things:
All financial records, including bank statements, credit card statements, and tax returns for Jie Chen from January 1, 2017 to present;
All emails and other correspondence between Jie Chen and any business partners, associates or employees related to the above financial records from January 1, 2017 to present;
All contracts and agreements entered into by Jie Chen, including any non-disclosure agreements, from January 1, 2017 to present.
The production shall occur at the following time and location:
Date: August 15, 2023
Time: 10:00 a.m.
Location: Law Office of Lee & Associates, 456 Broadway, Suite 800, New York, NY 10003.
You are further commanded to preserve and protect the confidentiality of any documents, electronically stored information, or tangible things produced or inspected, in accordance with the applicable law or agreement.
You are not required to produce or permit inspection of any privileged or protected documents or information.
This subpoena is issued by the court at the request of the Plaintiff's attorney, and you are hereby ordered to comply with this subpoena as provided by the Federal Rules of Civil Procedure.
You must comply with this subpoena under the penalty of law.
Dated: May 4, 2023
[Signature of Clerk of Court]
By: Sarah Johnson
Deputy Clerk"""
sample_data = spark.createDataFrame([[text]]).toDF("text")
result = prediction_model.transform(sample_data)
roberta_embeddings_legal_roberta_base download started this may take some time. Approximate size to download 447.2 MB [OK!] legner_subpoena download started this may take some time. [OK!]
from pyspark.sql import functions as F
result.select(F.explode(F.arrays_zip(result.token.result,
result.ner.result,
result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ner_label"),
F.expr("cols['2']['confidence']").alias("confidence")).show(200, truncate=100)
+--------------+--------------------+----------+ | token| ner_label|confidence| +--------------+--------------------+----------+ | SUBPOENA| O| 1.0| | TO| O| 1.0| | PRODUCE| O| 0.9991| | DOCUMENTS| B-DOCUMENT_TYPE| 0.9844| | ,| O| 1.0| | INFORMATION| B-DOCUMENT_TYPE| 0.9345| | ,| O| 0.9993| | OR| O| 0.9999| | OBJECTS| O| 0.9624| | OR| O| 1.0| | TO| O| 1.0| | PERMIT| O| 1.0| | INSPECTION| O| 0.9982| | OF| O| 1.0| | PREMISES| O| 0.9966| | IN| O| 0.9999| | A| O| 0.9999| | CIVIL| O| 0.9797| | ACTION| O| 0.9995| | UNITED| O| 0.9977| | STATES| O| 0.985| | DISTRICT| O| 0.5852| | COURT| O| 0.3786| | DISTRICT| O| 0.4604| | OF| O| 0.9222| | NEW| B-STATE| 0.3463| | YORK| I-COUNTY| 0.6732| | Plaintiff| O| 0.7924| | :| O| 0.9997| | Chang| B-MATTER| 0.6147| | Lee| I-MATTER| 0.8425| | v| O| 0.98| | .| O| 0.9977| | Defendant| O| 0.7419| | :| O| 1.0| | Jie| B-RECEIVER| 0.7958| | Chen| I-RECEIVER| 0.5887| | To| O| 0.997| | :| O| 0.9992| | Kim| B-RECEIVER| 0.8381| | Nguyen| I-RECEIVER| 0.7569| | 789| B-ADDRESS| 0.9807| | Elm| I-ADDRESS| 0.9924| | Street| I-ADDRESS| 0.9932| | New| I-ADDRESS| 0.9938| | York| I-ADDRESS| 0.992| | ,| I-ADDRESS| 0.9922| | NY| I-ADDRESS| 0.9936| | 10003| I-ADDRESS| 0.9934| | You| O| 1.0| | are| O| 1.0| | hereby| O| 0.9999| | commanded| O| 0.9998| | to| O| 1.0| | produce| O| 0.9996| | at| O| 1.0| | the| O| 1.0| | time| O| 1.0| | ,| O| 1.0| | date| O| 0.9999| | ,| O| 0.9998| | and| O| 1.0| | place| O| 1.0| | set| O| 1.0| | forth| O| 1.0| | below| O| 1.0| | the| O| 1.0| | following| O| 1.0| | documents| B-DOCUMENT_TYPE| 0.9848| | ,| O| 0.9997| |electronically| B-DOCUMENT_TYPE| 0.9925| | stored| I-DOCUMENT_TYPE| 0.9401| | information| I-DOCUMENT_TYPE| 0.9836| | ,| O| 0.9996| | or| O| 0.9995| | tangible| O| 0.948| | things| O| 0.9717| | :| O| 1.0| | All| O| 0.9995| | financial| B-DOCUMENT_TYPE| 0.9791| | records| I-DOCUMENT_TYPE| 0.9918| | ,| O| 0.9999| | including| O| 1.0| | bank| B-DOCUMENT_TYPE| 0.8418| | statements| I-DOCUMENT_TYPE| 0.963| | ,| O| 0.9993| | credit| B-DOCUMENT_TYPE| 0.7652| | card| I-DOCUMENT_TYPE| 0.4994| | statements| I-DOCUMENT_TYPE| 0.9563| | ,| O| 0.9997| | and| O| 0.9999| | tax| B-DOCUMENT_TYPE| 0.8595| | returns| I-DOCUMENT_TYPE| 0.9063| | for| O| 0.9999| | Jie| B-DOCUMENT_PERSON| 0.9927| | Chen| I-DOCUMENT_PERSON| 0.9865| | from| O| 0.9997| | January|B-DOCUMENT_DATE_FROM| 0.9993| | 1|I-DOCUMENT_DATE_FROM| 0.9997| | ,|I-DOCUMENT_DATE_FROM| 0.9995| | 2017|I-DOCUMENT_DATE_FROM| 0.9981| | to| O| 0.9998| | present| O| 0.9904| | ;| O| 1.0| | All| O| 0.9907| | emails| B-DOCUMENT_TYPE| 0.9979| | and| O| 0.9999| | other| O| 0.9319| |correspondence| B-DOCUMENT_TYPE| 0.9553| | between| O| 0.9998| | Jie| B-DOCUMENT_PERSON| 0.9817| | Chen| I-DOCUMENT_PERSON| 0.9883| | and| O| 0.9998| | any| O| 0.9997| | business| B-DOCUMENT_PERSON| 0.6979| | partners| I-DOCUMENT_PERSON| 0.4181| | ,| O| 1.0| | associates| B-DOCUMENT_PERSON| 0.6085| | or| O| 0.9997| | employees| B-DOCUMENT_PERSON| 0.9321| | related| O| 0.9999| | to| O| 0.9999| | the| O| 0.9998| | above| O| 0.9998| | financial| B-DOCUMENT_TYPE| 0.4994| | records| I-DOCUMENT_TYPE| 0.6143| | from| O| 0.9997| | January|B-DOCUMENT_DATE_FROM| 0.9991| | 1|I-DOCUMENT_DATE_FROM| 0.9998| | ,|I-DOCUMENT_DATE_FROM| 0.9994| | 2017|I-DOCUMENT_DATE_FROM| 0.9959| | to| O| 1.0| | present| O| 0.9958| | ;| O| 1.0| | All| O| 0.9994| | contracts| B-DOCUMENT_TYPE| 0.9421| | and| O| 0.9998| | agreements| B-DOCUMENT_TYPE| 0.9462| | entered| O| 0.9382| | into| O| 0.9981| | by| O| 0.9998| | Jie| B-DOCUMENT_PERSON| 0.9464| | Chen| I-DOCUMENT_PERSON| 0.9799| | ,| O| 0.9996| | including| O| 1.0| | any| O| 1.0| |non-disclosure| O| 0.931| | agreements| B-DOCUMENT_TYPE| 0.3859| | ,| O| 0.9999| | from| O| 0.9998| | January|B-DOCUMENT_DATE_FROM| 0.9992| | 1|I-DOCUMENT_DATE_FROM| 0.9998| | ,|I-DOCUMENT_DATE_FROM| 0.999| | 2017|I-DOCUMENT_DATE_FROM| 0.9978| | to| O| 1.0| | present| O| 0.9993| | .| O| 0.9998| | The| O| 0.9996| | production| O| 0.9709| | shall| O| 1.0| | occur| O| 1.0| | at| O| 1.0| | the| O| 1.0| | following| O| 1.0| | time| O| 1.0| | and| O| 1.0| | location| O| 1.0| | :| O| 1.0| | Date| O| 1.0| | :| O| 1.0| | August| B-APPOINTMENT_DATE| 0.8871| | 15| I-APPOINTMENT_DATE| 0.856| | ,| I-APPOINTMENT_DATE| 0.9067| | 2023| I-APPOINTMENT_DATE| 0.9204| | Time| O| 1.0| | :| O| 1.0| | 10:00| B-APPOINTMENT_HOUR| 0.9982| | a.m| I-APPOINTMENT_HOUR| 0.9995| | .| O| 0.9653| | Location| O| 0.9998| | :| O| 1.0| | Law| O| 0.9499| | Office| O| 0.9776| | of| O| 0.9892| | Lee| B-DOCUMENT_PERSON| 0.8784| | &| I-DOCUMENT_PERSON| 0.9816| | Associates| I-DOCUMENT_PERSON| 0.9753| | ,| O| 0.9478| | 456| B-COURT_ADDRESS| 0.5613| | Broadway| I-COURT_ADDRESS| 0.7624| | ,| I-COURT_ADDRESS| 0.8556| | Suite| I-COURT_ADDRESS| 0.9617| | 800| I-COURT_ADDRESS| 0.9469| | ,| I-COURT_ADDRESS| 0.907| | New| I-COURT_ADDRESS| 0.8847| | York| I-COURT_ADDRESS| 0.8566| | ,| I-COURT_ADDRESS| 0.7641| | NY| I-COURT_ADDRESS| 0.7735| | 10003| I-COURT_ADDRESS| 0.7114| | .| O| 0.8257| +--------------+--------------------+----------+ only showing top 200 rows
result.select(F.explode(F.arrays_zip(result.ner_span.result, result.ner_span.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)
+---------------------------------+------------------+----------+ |chunk |ner_label |confidence| +---------------------------------+------------------+----------+ |DOCUMENTS |DOCUMENT_TYPE |0.9844 | |INFORMATION |DOCUMENT_TYPE |0.9345 | |NEW YORK |STATE |0.50975 | |Chang Lee |MATTER |0.7286 | |Jie Chen |RECEIVER |0.69225 | |Kim Nguyen |RECEIVER |0.7975 | |789 Elm Street New York, NY 10003|ADDRESS |0.99141246| |documents |DOCUMENT_TYPE |0.9848 | |electronically stored information|DOCUMENT_TYPE |0.9720667 | |financial records |DOCUMENT_TYPE |0.98545 | |bank statements |DOCUMENT_TYPE |0.9024 | |credit card statements |DOCUMENT_TYPE |0.7403 | |tax returns |DOCUMENT_TYPE |0.8829 | |Jie Chen |DOCUMENT_PERSON |0.9896 | |January 1, 2017 |DOCUMENT_DATE_FROM|0.99915004| |emails |DOCUMENT_TYPE |0.9979 | |correspondence |DOCUMENT_TYPE |0.9553 | |Jie Chen |DOCUMENT_PERSON |0.985 | |business partners |DOCUMENT_PERSON |0.55799997| |associates |DOCUMENT_PERSON |0.6085 | +---------------------------------+------------------+----------+ only showing top 20 rows
import pandas as pd
light_model = nlp.LightPipeline(prediction_model)
light_result = light_model.fullAnnotate(text)
chunks = []
entities = []
sentence= []
begin = []
end = []
for n in light_result[0]['ner_span']:
begin.append(n.begin)
end.append(n.end)
chunks.append(n.result)
entities.append(n.metadata['entity'])
sentence.append(n.metadata['sentence'])
df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end,
'sentence_id':sentence, 'entities':entities})
df.head(20)
chunks | begin | end | sentence_id | entities | |
---|---|---|---|---|---|
0 | DOCUMENTS | 20 | 28 | 0 | DOCUMENT_TYPE |
1 | INFORMATION | 31 | 41 | 0 | DOCUMENT_TYPE |
2 | NEW YORK | 151 | 158 | 0 | STATE |
3 | Chang Lee | 172 | 180 | 0 | MATTER |
4 | Jie Chen | 196 | 203 | 0 | RECEIVER |
5 | Kim Nguyen | 210 | 219 | 0 | RECEIVER |
6 | 789 Elm Street\nNew York, NY 10003 | 221 | 253 | 0 | ADDRESS |
7 | documents | 351 | 359 | 0 | DOCUMENT_TYPE |
8 | electronically stored information | 362 | 394 | 0 | DOCUMENT_TYPE |
9 | financial records | 422 | 438 | 0 | DOCUMENT_TYPE |
10 | bank statements | 451 | 465 | 0 | DOCUMENT_TYPE |
11 | credit card statements | 468 | 489 | 0 | DOCUMENT_TYPE |
12 | tax returns | 496 | 506 | 0 | DOCUMENT_TYPE |
13 | Jie Chen | 512 | 519 | 0 | DOCUMENT_PERSON |
14 | January 1, 2017 | 526 | 540 | 0 | DOCUMENT_DATE_FROM |
15 | emails | 558 | 563 | 0 | DOCUMENT_TYPE |
16 | correspondence | 575 | 588 | 0 | DOCUMENT_TYPE |
17 | Jie Chen | 598 | 605 | 0 | DOCUMENT_PERSON |
18 | business partners | 615 | 631 | 0 | DOCUMENT_PERSON |
19 | associates | 634 | 643 | 0 | DOCUMENT_PERSON |
For saving the visualization result as html, provide save_path parameter in the display function.
# from sparknlp_display import NerVisualizer
visualiser = nlp.viz.NerVisualizer()
visualiser.display(light_result[0], label_col='ner_span', document_col='document')