! pip install -q johnsnowlabs
Using my.johnsnowlabs.com SSO
from johnsnowlabs import nlp, finance
# nlp.install(force_browser=True)
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
nlp.install()
spark = nlp.start()
👌 Launched cpu optimized session with with: 🚀Spark-NLP==4.3.0, 💊Spark-Healthcare==4.3.0, running on ⚡ PySpark==3.1.2
We have this pipeline can be used to deidentify financial information from texts.The financial information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate DOC
, EFFDATE
, PARTY
, ALIAS
, SIGNING_PERSON
, SIGNING_TITLE
, COUNTRY
, CITY
, STATE
, STREET
, ZIP
, EMAIL
, FAX
, LOCATION-OTHER
, DATE
,PHONE
entities.
deid_pipeline = nlp.PretrainedPipeline("finpipe_deid", "en", "finance/models")
finpipe_deid download started this may take some time. Approx size to download 437.3 MB [OK!]
deid_pipeline.model.stages
[DocumentAssembler_20aaea0b09c9, SentenceDetector_f836f3c49dd7, REGEX_TOKENIZER_3d88a1dee1d9, BERT_EMBEDDINGS_29ce72cd673e, FinanceNerModel_1e04a0ea86dc, NER_CONVERTER_053dc2c885dc, FinanceNerModel_99ecfbac41c1, NER_CONVERTER_c31e7133c116, FinanceNerModel_fae1a65403a6, NER_CONVERTER_e54c4e5afd15, CONTEXTUAL-PARSER_72fff5ea72a3, CONTEXTUAL-PARSER_247b3d47153a, CONTEXTUAL-PARSER_8804c3848e07, CONTEXTUAL-PARSER_138e93ac7638, CONTEXTUAL-PARSER_222a1bc3dc39, MERGE_72dccb34a947, DE-IDENTIFICATION_95319986720c, DE-IDENTIFICATION_e98c1ba6424c, DE-IDENTIFICATION_b423b4e6a14e, DE-IDENTIFICATION_d6ea024c8838]
text= """ REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
Commvault Systems, Inc.
(Exact name of registrant as specified in its charter)
Signed By : Sherly Johnson
(Address of principal executive offices, including zip code)
(732) 870-4000
(telephone number, including area code)
Name of each exchange on which registered
CVLT
The NASDAQ Stock Market
"""
deid_res= deid_pipeline.annotate(text)
deid_res.keys()
dict_keys(['obfuscated', 'ner_10k_chunk', 'email', 'document', 'ner_signers_chunk', 'deidentified', 'alias', 'chiefs', 'masked_fixed_length_chars', 'token', 'ner_signers', 'ner_generic_chunk', 'embeddings', 'merged_ner_chunks', 'ner_10k', 'sentence', 'phone', 'orgs', 'masked_with_chars', 'ner_generic'])
import pandas as pd
pd.set_option("display.max_colwidth", 100)
df= pd.DataFrame(list(zip(deid_res["sentence"],
deid_res["deidentified"],
deid_res["masked_with_chars"],
deid_res["masked_fixed_length_chars"],
deid_res["obfuscated"])),
columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])
df
Sentence | Masked | Masked with Chars | Masked with Fixed Chars | Obfuscated | |
---|---|---|---|---|---|
0 | REPORT PURSUANT TO SECTION 13 OR 15 | REPORT PURSUANT TO SECTION 13 OR 15 | REPORT PURSUANT TO SECTION 13 OR 15 | REPORT PURSUANT TO SECTION 13 OR 15 | REPORT PURSUANT TO SECTION 13 OR 15 |
1 | (d) OF THE SECURITIES EXCHANGE ACT OF\nCommvault Systems, Inc. | (d) OF <ORG>. | (d) OF [***************************************************]. | (d) OF ****. | (d) OF Gillespie Inc. |
2 | (Exact name of registrant as specified in its charter) \nSigned By : Sherly Johnson\n(Address of... | (Exact name of registrant as specified in its charter) \nSigned By : <PERSON>\n(Address of princ... | (Exact name of registrant as specified in its charter) \nSigned By : [************]\n(Address of... | (Exact name of registrant as specified in its charter) \nSigned By : ****\n(Address of principal... | (Exact name of registrant as specified in its charter) \nSigned By : Ashley Patrick\n(Address of... |