! pip install -q johnsnowlabs
Using my.johnsnowlabs.com SSO
from johnsnowlabs import nlp, legal
# nlp.install(force_browser=True)
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
nlp.install()
spark = nlp.start()
Here, we will train a legal resolver model with a sample dataset.We will train a company name normalization model. Our dataset columns has to be object type.
Let's start to train.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_company_name.csv
import pandas as pd
df = pd.read_csv('sample_company_name.csv')
df
company_name | irs_number | comp_abbreviation_var | |
---|---|---|---|
0 | StepOne Personal Health, Inc. | 900785095 | StepOne Personal Health |
1 | StepOne Personal Health, Inc. | 900785095 | StepOne Personal Health Inc |
2 | StepOne Personal Health, Inc. | 900785095 | STEPONE PERSONAL HEALTH INC |
3 | StepOne Personal Health, Inc. | 900785095 | StepOne Personal Health inc |
4 | StepOne Personal Health, Inc. | 900785095 | StepOne Personal Health INC |
... | ... | ... | ... |
9995 | INGLES MARKETS INC | 560846267 | Ingles Markets Inc |
9996 | INGLES MARKETS INC | 560846267 | INGLES MARKETS Inc. |
9997 | INGLES MARKETS INC | 560846267 | INGLES MARKETS inc. |
9998 | INGLES MARKETS INC | 560846267 | INGLES MARKETS INC |
9999 | INGLES MARKETS INC | 560846267 | INGLES MARKETS |
10000 rows × 3 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 company_name 10000 non-null object 1 irs_number 10000 non-null int64 2 comp_abbreviation_var 10000 non-null object dtypes: int64(1), object(2) memory usage: 234.5+ KB
df['comp_abbreviation_var'] =df['comp_abbreviation_var'].astype(str)
df['irs_number'] =df['irs_number'].astype(str)
df['company_name'] =df['company_name'].astype(str)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 company_name 10000 non-null object 1 irs_number 10000 non-null object 2 comp_abbreviation_var 10000 non-null object dtypes: object(3) memory usage: 234.5+ KB
df.shape
(10000, 3)
Now we will get the sentence embeddings of comp_abbreviation_var
column.
data = spark.createDataFrame(df)
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("comp_abbreviation_var")\
.setOutputCol("sentence")
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
training_pipeline = nlp.Pipeline(stages = [
documentAssembler,
embeddings])
training_model = training_pipeline.fit(data)
final_data = training_model.transform(data)
tfhub_use download started this may take some time. Approximate size to download 923.7 MB [OK!]
final_data.show()
+--------------------+----------+---------------------+--------------------+--------------------+ | company_name|irs_number|comp_abbreviation_var| sentence| sentence_embeddings| +--------------------+----------+---------------------+--------------------+--------------------+ |StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 22...|[{sentence_embedd...| |StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...| |StepOne Personal ...| 900785095| STEPONE PERSONAL ...|[{document, 0, 26...|[{sentence_embedd...| |StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...| |StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 26...|[{sentence_embedd...| |StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...| |StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...| |StepOne Personal ...| 900785095| Stepone Personal ...|[{document, 0, 26...|[{sentence_embedd...| |StepOne Personal ...| 900785095| StepOne Personal ...|[{document, 0, 27...|[{sentence_embedd...| |Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...| |Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...| |Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 24...|[{sentence_embedd...| |Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...| |Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 20...|[{sentence_embedd...| |Equity One Net In...| 320467879| EQUITY ONE NET IN...|[{document, 0, 24...|[{sentence_embedd...| |Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...| |Equity One Net In...| 320467879| Equity One Net In...|[{document, 0, 25...|[{sentence_embedd...| |AmeriCredit Autom...| 880475154| AmeriCredit Autom...|[{document, 0, 39...|[{sentence_embedd...| |GROUNDFLOOR FINAN...| 463414189| GROUNDFLOOR FINAN...|[{document, 0, 23...|[{sentence_embedd...| |GROUNDFLOOR FINAN...| 463414189| GROUNDFLOOR FINAN...|[{document, 0, 23...|[{sentence_embedd...| +--------------------+----------+---------------------+--------------------+--------------------+ only showing top 20 rows
We have sentence_embeddings
column in our training dataframe that we will use as input while training the model.
%%time
use = legal.SentenceEntityResolverApproach()\
.setNeighbours(50)\
.setThreshold(10000)\
.setInputCols("sentence_embeddings")\
.setLabelCol("company_name")\
.setOutputCol('original_company_name')\
.setNormalizedCol("company_name")\
.setDistanceFunction("EUCLIDEAN")\
.setCaseSensitive(False)\
.setUseAuxLabel(True)\
.setAuxLabelCol('irs_number')
model = use.fit(final_data)
CPU times: user 108 ms, sys: 13.8 ms, total: 121 ms Wall time: 15.3 s
# Save model
model.write().overwrite().save("use_company_name")
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
.setInputCols("ner_chunk") \
.setOutputCol("sentence_embeddings")
resolver = legal.SentenceEntityResolverModel.load("use_company_name") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("normalized_name")\
.setDistanceFunction("EUCLIDEAN")
pipeline = nlp.Pipeline(
stages = [
documentAssembler,
embeddings,
resolver,
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
light_model= nlp.LightPipeline(model)
tfhub_use download started this may take some time. Approximate size to download 923.7 MB [OK!]
# returns LP resolution results
import pandas as pd
pd.set_option('display.max_colwidth', 0)
def get_codes (lp, text, vocab='company_name', hcc=False):
full_light_result = lp.fullAnnotate(text)
chunks = []
codes = []
begin = []
end = []
resolutions=[]
all_distances =[]
all_codes=[]
all_cosines = []
all_k_aux_labels=[]
for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):
begin.append(chunk.begin)
end.append(chunk.end)
chunks.append(chunk.result)
codes.append(code.result)
all_codes.append(code.metadata['all_k_results'].split(':::'))
resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
all_distances.append(code.metadata['all_k_distances'].split(':::'))
all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
if hcc:
try:
all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
except:
all_k_aux_labels.append([])
else:
all_k_aux_labels.append([])
df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes,
'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})
if hcc:
df['billable'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])
df['hcc_status'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])
df['hcc_code'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])
df = df.drop(['all_k_aux_labels'], axis=1)
return df
text = "AmeriCann Inc"
%time
get_codes (light_model, text, vocab = 'normalized_name')
CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs Wall time: 5.01 µs
chunks | begin | end | code | all_codes | resolutions | all_distances | |
---|---|---|---|---|---|---|---|
0 | AmeriCann Inc | 0 | 12 | AmeriCann, Inc. | [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] | [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] | [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] |
text = 'AmeriCann inc'
%time get_codes (light_model, text, vocab='normalized_name')
CPU times: user 9.2 ms, sys: 277 µs, total: 9.48 ms Wall time: 61.2 ms
chunks | begin | end | code | all_codes | resolutions | all_distances | |
---|---|---|---|---|---|---|---|
0 | AmeriCann inc | 0 | 12 | AmeriCann, Inc. | [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] | [AmeriCann, Inc., LUMIOX, INC., AGILYSYS INC, Ameresco, Inc., IMMUCOR INC, AAON INC, CRYOLIFE INC] | [0.0000, 0.1080, 0.1110, 0.1133, 0.1145, 0.1165, 0.1170] |
text = 'StepOne Personal Health inc'
%time get_codes (light_model, text, vocab='normalized_name')
CPU times: user 4.66 ms, sys: 2.18 ms, total: 6.84 ms Wall time: 52.7 ms
chunks | begin | end | code | all_codes | resolutions | all_distances | |
---|---|---|---|---|---|---|---|
0 | StepOne Personal Health inc | 0 | 26 | StepOne Personal Health, Inc. | [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] | [StepOne Personal Health, Inc., Kura Oncology, Inc., Axsome Therapeutics, Inc., CVS HEALTH Corp, EDGEWELL PERSONAL CARE Co, Cardiovascular Systems Inc, CESCA THERAPEUTICS INC.] | [0.0000, 0.2224, 0.2714, 0.2729, 0.2802, 0.2868, 0.2874] |
text = 'Alzamend Neuro INC'
%time get_codes (light_model, text, vocab='normalized_name')
CPU times: user 7.07 ms, sys: 732 µs, total: 7.81 ms Wall time: 67 ms
chunks | begin | end | code | all_codes | resolutions | all_distances | |
---|---|---|---|---|---|---|---|
0 | Alzamend Neuro INC | 0 | 17 | Alzamend Neuro, Inc. | [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] | [Alzamend Neuro, Inc., Kura Oncology, Inc., REGENERON PHARMACEUTICALS INC, Dipexium Pharmaceuticals, Inc., AEOLUS PHARMACEUTICALS, INC., Flex Pharma, Inc., PROTO SCRIPT PHARMACEUTICAL CORP] | [0.0000, 0.1704, 0.1802, 0.1934, 0.2149, 0.2162, 0.2254] |
text = 'MMEX Resources Corporation'
%time get_codes (light_model, text, vocab='normalized_name')
CPU times: user 7.98 ms, sys: 2.37 ms, total: 10.3 ms Wall time: 56.6 ms
chunks | begin | end | code | all_codes | resolutions | all_distances | |
---|---|---|---|---|---|---|---|
0 | MMEX Resources Corporation | 0 | 25 | MMEX Resources Corp | [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] | [MMEX Resources Corp, ANTERO RESOURCES Corp, ARTESIAN RESOURCES CORP, ESTERLINE TECHNOLOGIES CORP, Timberline Resources Corp, CATALYST PAPER CORP, INFRASTRUCTURE DEVELOPMENTS CORP.] | [0.1096, 0.1540, 0.1624, 0.2054, 0.2202, 0.2406, 0.2451] |
text = 'Alphadyne Asset Management Lp.'
%time get_codes (light_model, text, vocab='normalized_name')
CPU times: user 7.89 ms, sys: 941 µs, total: 8.83 ms Wall time: 43.4 ms
chunks | begin | end | code | all_codes | resolutions | all_distances | |
---|---|---|---|---|---|---|---|
0 | Alphadyne Asset Management Lp. | 0 | 29 | Alphadyne Asset Management LP | [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] | [Alphadyne Asset Management LP, YACKTMAN ASSET MANAGEMENT LP, TOCQUEVILLE ASSET MANAGEMENT L.P., SYSTEMATIC FINANCIAL MANAGEMENT LP, Madyson Equity Group, LP, AMERIGAS PARTNERS LP, CAPRIN ASSET MANAGEMENT LLC /ADV, ALLIANCEBERNSTEIN HOLDING L.P.] | [0.0000, 0.0724, 0.1040, 0.2378, 0.2470, 0.2570, 0.2614, 0.2722] |