📜Let's have a look at what takes to train your custom AssertionDL model for negation
.
finner_orgs_prods_alias
with requires bert_embeddings_sec_bert_base
embeddings! pip install -q johnsnowlabs
Using my.johnsnowlabs.com SSO
from johnsnowlabs import nlp, finance
# nlp.install(force_browser=True)
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
nlp.install()
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/assertion_df.csv
import pandas as pd
training_df = pd.read_csv('./assertion_df.csv')
training_df
text | target | label | start | end | |
---|---|---|---|---|---|
0 | CEC ENTERTAINMENT INC is not purchasing GEO GR... | CEC ENTERTAINMENT INC | negative | 0 | 2 |
1 | CEC ENTERTAINMENT INC is not purchasing GEO GR... | GEO GROUP INC | positive | 6 | 8 |
2 | BRAVE ASSET MANAGEMENT INC is paying Mondelez ... | Mondelez International , Inc . | positive | 6 | 10 |
3 | BRAVE ASSET MANAGEMENT INC is paying Mondelez ... | BRAVE ASSET MANAGEMENT INC | positive | 0 | 3 |
4 | Compound Natural Foods Inc . is not investing ... | Compound Natural Foods Inc . | negative | 0 | 4 |
... | ... | ... | ... | ... | ... |
93 | Cboe EDGA Exchange , Inc . is not providing Bl... | BlueStar Financial Group , Inc . | positive | 9 | 14 |
94 | VSOURCE INC is hiring URSTADT BIDDLE PROPERTIE... | URSTADT BIDDLE PROPERTIES INC | positive | 4 | 7 |
95 | VSOURCE INC is hiring URSTADT BIDDLE PROPERTIE... | VSOURCE INC | positive | 0 | 1 |
96 | Emergent BioSolutions Inc . is not providing C... | Emergent BioSolutions Inc . | negative | 0 | 3 |
97 | Emergent BioSolutions Inc . is not providing C... | CUMMINS INC | positive | 7 | 8 |
98 rows × 5 columns
📜
text
: your text examples;target
: your NER chunk, extracted using finner_orgs_prods_alias
in our case;label
: the assertion label. In our example, we have two labels: positive
and negative
.start
: the first token number of the chunk. You can get this information from the begin
column in your NER model metadata.end
: the last token number of the chunk. You can get this information from the end
column in your NER model metadata.# Create Spark Dataframe
training_data = spark.createDataFrame(training_df)
training_data.show()
+--------------------+--------------------+--------+-----+---+ | text| target| label|start|end| +--------------------+--------------------+--------+-----+---+ |CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative| 0| 2| |CEC ENTERTAINMENT...| GEO GROUP INC|positive| 6| 8| |BRAVE ASSET MANAG...|Mondelez Internat...|positive| 6| 10| |BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive| 0| 3| |Compound Natural ...|Compound Natural ...|negative| 0| 4| |Compound Natural ...|AMERICAN ELECTRIC...|positive| 9| 13| |Marijuana Co of A...|PVM International...|positive| 10| 14| |Marijuana Co of A...|Marijuana Co of A...|positive| 0| 6| |NORTEK INC is not...| NORTEK INC|negative| 0| 1| |NORTEK INC is not...|EN2GO INTERNATION...|positive| 6| 8| |QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive| 8| 11| |QUALCOMM INC/DE i...| QUALCOMM INC/DE|positive| 0| 1| |TransDigm Group I...| TransDigm Group INC|negative| 0| 2| |TransDigm Group I...| ABIOMED INC|positive| 9| 10| |Fundrise Income e...|MIDDLETON & CO IN...|positive| 9| 12| |Fundrise Income e...|Fundrise Income e...|negative| 0| 5| |Nexeo Solutions ,...|Nexeo Solutions ,...|negative| 0| 4| |Nexeo Solutions ,...|ARCA biopharma , ...|positive| 8| 12| |Angie's List , In...| RC-1 , Inc .|positive| 11| 14| |Angie's List , In...|Angie's List , Inc .|negative| 0| 4| +--------------------+--------------------+--------+-----+---+ only showing top 20 rows
training_data.printSchema()
root |-- text: string (nullable = true) |-- target: string (nullable = true) |-- label: string (nullable = true) |-- start: long (nullable = true) |-- end: long (nullable = true)
%time
training_data.count()
CPU times: user 3 µs, sys: 1 µs, total: 4 µs Wall time: 7.15 µs
98
(train_data, test_data) = training_data.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(train_data.count()))
print("Test Dataset Count: " + str(test_data.count()))
Training Dataset Count: 69 Test Dataset Count: 29
train_data.show()
+--------------------+--------------------+--------+-----+---+ | text| target| label|start|end| +--------------------+--------------------+--------+-----+---+ |3AM TECHNOLOGIES ...|3AM TECHNOLOGIES INC|negative| 0| 2| |3AM TECHNOLOGIES ...|NATURAL ALTERNATI...|positive| 6| 9| |ALEXANDRIA REAL E...|ALEXANDRIA REAL E...|negative| 0| 4| |ATMI INC is eligi...| ATMI INC|positive| 0| 1| |ATMI INC is eligi...|NEAH POWER SYSTEM...|positive| 5| 10| |Angie's List , In...|Angie's List , Inc .|negative| 0| 4| |Angie's List , In...| RC-1 , Inc .|positive| 11| 14| |Artificial Intell...| APA OPTICS INC /MN/|positive| 10| 13| |Artificial Intell...|Artificial Intell...|negative| 0| 5| |CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative| 0| 2| |CEC ENTERTAINMENT...| GEO GROUP INC|positive| 6| 8| |DELTA APPAREL , I...| DELTA APPAREL , INC|positive| 0| 3| |DELTA APPAREL , I...|Long-Term Stock E...|positive| 10| 15| |Fundrise Income e...|Fundrise Income e...|negative| 0| 5| |GHP Investment Ad...|PARALLAX HEALTH S...|positive| 10| 15| |LANDAUER INC is n...| LANDAUER INC|negative| 0| 1| |LANDAUER INC is n...|PLANTRONICS INC /CA/|positive| 5| 7| |MGP INGREDIENTS I...| LAND O LAKES INC|positive| 6| 9| |Marijuana Co of A...|Marijuana Co of A...|positive| 0| 6| |Marijuana Co of A...|PVM International...|positive| 10| 14| +--------------------+--------------------+--------+-----+---+ only showing top 20 rows
test_data.show()
+--------------------+--------------------+--------+-----+---+ | text| target| label|start|end| +--------------------+--------------------+--------+-----+---+ |ALEXANDRIA REAL E...| CDEX INC|positive| 9| 10| |BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive| 0| 3| |BRAVE ASSET MANAG...|Mondelez Internat...|positive| 6| 10| |Compound Natural ...|AMERICAN ELECTRIC...|positive| 9| 13| |Compound Natural ...|Compound Natural ...|negative| 0| 4| |Fundrise Income e...|MIDDLETON & CO IN...|positive| 9| 12| |GHP Investment Ad...|GHP Investment Ad...|positive| 0| 5| |MGP INGREDIENTS I...| MGP INGREDIENTS INC|negative| 0| 2| |Mountain Capital ...|Mountain Capital ...|negative| 0| 5| |Palo Alto Network...|Palo Alto Network...|negative| 0| 3| |QUAINT OAK BANCOR...|QUAINT OAK BANCOR...|positive| 0| 3| |QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive| 8| 11| |QUALCOMM INC/DE i...| QUALCOMM INC/DE|positive| 0| 1| |SHOE CARNIVAL INC...| SHOE CARNIVAL INC|negative| 0| 2| |SURMODICS INC is ...|Cboe EDGA Exchang...|positive| 4| 9| |WACCAMAW BANKSHAR...|WACCAMAW BANKSHAR...|positive| 0| 2| |AMERICAN CENTURY ...|AMERICAN CENTURY ...|negative| 0| 5| |AMERICAN CENTURY ...| Charmed Homes Inc .|positive| 9| 12| |EATON VANCE MASSA...|BATS Y-Exchange ,...|positive| 13| 17| |EATON VANCE MASSA...|EATON VANCE MASSA...|positive| 0| 5| +--------------------+--------------------+--------+-----+---+ only showing top 20 rows
Calculated using the bert_embeddings_sec_bert_base
embeddings on your text
column
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
.setInputCols("document", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
bert_embeddings_sec_bert_base download started this may take some time. Approximate size to download 390.4 MB [OK!]
document = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
chunk = nlp.Doc2Chunk()\
.setInputCols("document")\
.setOutputCol("doc_chunk")\
.setChunkCol("target")\
.setStartCol("start")\
.setStartColByTokenIndex(True)\
.setFailOnMissing(False)\
.setLowerCase(False)
token = nlp.Tokenizer()\
.setInputCols(['document'])\
.setOutputCol('token')
We save the test data in parquet format to use in AssertionDLApproach()
.
assertion_pipeline = nlp.Pipeline(
stages = [
document,
chunk,
token,
bert_embeddings])
assertion_test_data = assertion_pipeline.fit(test_data).transform(test_data)
assertion_test_data.write.mode('overwrite').parquet('test_data.parquet')
assertion_test_data.columns
['text', 'target', 'label', 'start', 'end', 'document', 'doc_chunk', 'token', 'embeddings']
assertion_train_data = assertion_pipeline.fit(training_data).transform(training_data)
assertion_train_data.write.mode('overwrite').parquet('train_data.parquet')
assertion_train_data.columns
['text', 'target', 'label', 'start', 'end', 'document', 'doc_chunk', 'token', 'embeddings']
! pip install -q tensorflow==2.7.0
! pip install -q tensorflow-addons
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 489.6/489.6 MB 2.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 463.1/463.1 KB 41.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 73.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 44.4 MB/s eta 0:00:00
We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline.
TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.
graph_folder= "./tf_graphs"
assertion_graph_builder = finance.TFGraphBuilder()\
.setModelName("assertion_dl")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setLabelColumn("label")\
.setGraphFolder(graph_folder)\
.setGraphFile("assertion_graph.pb")\
.setMaxSequenceLength(1200)\
.setHiddenUnitsNumber(25)
📜Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models
This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format [X,Y]
being X
the number of tokens to consider on the left of the chunk, and Y
the max number of tokens to consider on the right. Let’s take a look at what different windows mean:
[-1,-1]
which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in setMaxSentLen()
).[0,0]
means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.[9,15]
is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.Check this Scope Window Tuning Assertion Status Detection notebook that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.
In our case, the best Scope Window is around [10,10]
scope_window = [50, 50]
assertionStatus = finance.AssertionDLApproach()\
.setLabelCol("label")\
.setInputCols("document", "doc_chunk", "embeddings")\
.setOutputCol("assertion")\
.setBatchSize(128)\
.setLearningRate(0.001)\
.setEpochs(2)\
.setStartCol("start")\
.setEndCol("end")\
.setMaxSentLen(1200)\
.setEnableOutputLogs(True)\
.setOutputLogsPath('training_logs/')\
.setGraphFolder(graph_folder)\
.setGraphFile(f"{graph_folder}/assertion_graph.pb")\
.setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
.setScopeWindow(scope_window)
#.setValidationSplit(0.2)\
#.setDropout(0.1)\
assertion_pipeline = nlp.Pipeline(
stages = [
assertion_graph_builder,
assertionStatus])
training_data.printSchema()
root |-- text: string (nullable = true) |-- target: string (nullable = true) |-- label: string (nullable = true) |-- start: long (nullable = true) |-- end: long (nullable = true)
assertion_train_data = spark.read.parquet('train_data.parquet')
assertion_train_data.groupBy('label').count().show()
+--------+-----+ | label|count| +--------+-----+ |positive| 71| |negative| 27| +--------+-----+
%%time
assertion_model = assertion_pipeline.fit(assertion_train_data)
TF Graph Builder configuration: Model name: assertion_dl Graph folder: ./tf_graphs Graph file name: assertion_graph.pb Build params: {'n_classes': 2, 'feat_size': 768, 'max_seq_len': 1200, 'n_hidden': 25}
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term
Device mapping: no known devices.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/sparknlp_jsl/_tf_graph_builders/tf2contrib/rnn.py:229: bidirectional_dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/rnn.py:441: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.RNN(cell)`, which is equivalent to this API WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/layers/legacy_rnn/rnn_cell_impl.py:766: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor
Device mapping: no known devices. assertion_dl graph exported to ./tf_graphs/assertion_graph.pb CPU times: user 11.1 s, sys: 691 ms, total: 11.8 s Wall time: 55.5 s
Checking the results saved in the log file
import os
log_files = os.listdir("./training_logs")
log_files
['AssertionDLApproach_1288a0afc47a.log']
with open("./training_logs/"+log_files[0]) as log_file:
print(log_file.read())
Name of the selected graph: ./tf_graphs/assertion_graph.pb Training started, trainExamples: 98 Epoch: 0 started, learning rate: 0.001, dataset size: 98 Done, 21.154691869 total training loss: 2.5376515, avg training loss: 2.5376515, batches: 1 Quality on test dataset: time to finish evaluation: 2.22s Total test loss: 2.1600 Avg test loss: 2.1600 label tp fp fn prec rec f1 negative 10 19 0 0.3448276 1.0 0.5128205 positive 0 0 19 0.0 0.0 0.0 tp: 10 fp: 19 fn: 19 labels: 2 Macro-average prec: 0.1724138, rec: 0.5, f1: 0.25641024 Micro-average prec: 0.3448276, rec: 0.3448276, f1: 0.3448276 Epoch: 1 started, learning rate: 9.5E-4, dataset size: 98 Done, 6.591055361 total training loss: 2.3828928, avg training loss: 2.3828928, batches: 1 Quality on test dataset: time to finish evaluation: 1.69s Total test loss: 2.0273 Avg test loss: 2.0273 label tp fp fn prec rec f1 negative 10 19 0 0.3448276 1.0 0.5128205 positive 0 0 19 0.0 0.0 0.0 tp: 10 fp: 19 fn: 19 labels: 2 Macro-average prec: 0.1724138, rec: 0.5, f1: 0.25641024 Micro-average prec: 0.3448276, rec: 0.3448276, f1: 0.3448276
assertion_test_data = spark.read.parquet('test_data.parquet')
preds = assertion_model.transform(assertion_test_data).select('label','assertion.result')
preds.show()
+--------+----------+ | label| result| +--------+----------+ |positive|[negative]| |positive|[negative]| |positive|[negative]| |positive|[negative]| |negative|[negative]| |positive|[negative]| |positive|[negative]| |negative|[negative]| |negative|[negative]| |negative|[negative]| |positive|[negative]| |positive|[negative]| |positive|[negative]| |negative|[negative]| |positive|[negative]| |positive|[negative]| |negative|[negative]| |positive|[negative]| |positive|[negative]| |positive|[negative]| +--------+----------+ only showing top 20 rows
preds_df = preds.toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x: x[0] if len(x) else pd.NA)
preds_df.dropna(inplace=True)
preds_df
label | result | |
---|---|---|
0 | positive | negative |
1 | positive | negative |
2 | positive | negative |
3 | positive | negative |
4 | negative | negative |
5 | positive | negative |
6 | positive | negative |
7 | negative | negative |
8 | negative | negative |
9 | negative | negative |
10 | positive | negative |
11 | positive | negative |
12 | positive | negative |
13 | negative | negative |
14 | positive | negative |
15 | positive | negative |
16 | negative | negative |
17 | positive | negative |
18 | positive | negative |
19 | positive | negative |
20 | positive | negative |
21 | negative | negative |
22 | negative | negative |
23 | positive | negative |
24 | positive | negative |
25 | negative | negative |
26 | negative | negative |
27 | positive | negative |
28 | positive | negative |
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report
print (classification_report( preds_df['label'], preds_df['result']))
precision recall f1-score support negative 0.34 1.00 0.51 10 positive 0.00 0.00 0.00 19 accuracy 0.34 29 macro avg 0.17 0.50 0.26 29 weighted avg 0.12 0.34 0.18 29
assertion_model.stages
[TFGraphBuilderModel_a389b6a16cae, FINANCE-ASSERTION_DL_13ea29236849]
# Save a Spark NLP model
assertion_model.stages[-1].write().overwrite().save('Assertion')
# cd into saved dir and zip
! cd /content/Assertion ; zip -r /content/Assertion.zip *
adding: fields/ (stored 0%) adding: fields/datasetParams/ (stored 0%) adding: fields/datasetParams/_SUCCESS (stored 0%) adding: fields/datasetParams/.part-00000.crc (stored 0%) adding: fields/datasetParams/part-00001 (deflated 95%) adding: fields/datasetParams/part-00000 (deflated 27%) adding: fields/datasetParams/.part-00001.crc (deflated 44%) adding: fields/datasetParams/._SUCCESS.crc (stored 0%) adding: metadata/ (stored 0%) adding: metadata/_SUCCESS (stored 0%) adding: metadata/.part-00000.crc (stored 0%) adding: metadata/part-00000 (deflated 38%) adding: metadata/._SUCCESS.crc (stored 0%) adding: tensorflow (deflated 39%)
The model had very little data, since it was created as a playground fopr the certification trainingts to run quickly
. So don't expect big performance (for that, you have a pretrained version used earlier on this notebook).
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = finance.TextSplitter() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
finassertion = finance.AssertionDLModel.load("Assertion")\
.setInputCols(["sentence", "ner_chunk", "embeddings"])\
.setOutputCol("finlabel")
pipe = nlp.Pipeline(stages = [ document_assembler, text_splitter, tokenizer, embeddings, ner, ner_converter, finassertion])
bert_embeddings_sec_bert_base download started this may take some time. Approximate size to download 390.4 MB [OK!] finner_orgs_prods_alias download started this may take some time. [OK!]
text = "Gradio INC will enter into a joint agreement with Hugging Face, Inc."
sdf = spark.createDataFrame([[text]]).toDF("text")
res = pipe.fit(sdf).transform(sdf)
import pyspark.sql.functions as F
res.select(F.explode(F.arrays_zip(res.ner_chunk.result,
res.finlabel.result)).alias("cols"))\
.select(F.expr("cols['0']").alias("ner_chunk"),
F.expr("cols['1']").alias("assertion")).show(200, truncate=100)
+-----------------+---------+ | ner_chunk|assertion| +-----------------+---------+ | Gradio INC| negative| |Hugging Face, Inc| negative| +-----------------+---------+