🔎 Training Finance Assertion Status¶

📜Let's have a look at what takes to train your custom AssertionDL model for negation.

First, make sure you have an NER model which retrieves those entities for you. In our case, we will use finner_orgs_prods_alias with requires bert_embeddings_sec_bert_base embeddings
Second, check the embeddings the NER model is using and reuse them for the Assertion Model, so that you don't calculate embeddings twice.

🎬 Installation¶

In [ ]:

! pip install -q johnsnowlabs

🔗 Automatic Installation¶

Using my.johnsnowlabs.com SSO

In [ ]:

from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

🔗 Manual downloading¶

If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

Go to my.johnsnowlabs.com
Download your license
Upload it using the following command

In [ ]:

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Install it

In [ ]:

nlp.install()

📌 Start Spark Session¶

In [ ]:

from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [ ]:

from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

🚀 Data Prep¶

In [ ]:

! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/assertion_df.csv

In [ ]:

import pandas as pd

training_df = pd.read_csv('./assertion_df.csv')

training_df

Out[ ]:

	text	target	label	start	end
0	CEC ENTERTAINMENT INC is not purchasing GEO GR...	CEC ENTERTAINMENT INC	negative	0	2
1	CEC ENTERTAINMENT INC is not purchasing GEO GR...	GEO GROUP INC	positive	6	8
2	BRAVE ASSET MANAGEMENT INC is paying Mondelez ...	Mondelez International , Inc .	positive	6	10
3	BRAVE ASSET MANAGEMENT INC is paying Mondelez ...	BRAVE ASSET MANAGEMENT INC	positive	0	3
4	Compound Natural Foods Inc . is not investing ...	Compound Natural Foods Inc .	negative	0	4
...	...	...	...	...	...
93	Cboe EDGA Exchange , Inc . is not providing Bl...	BlueStar Financial Group , Inc .	positive	9	14
94	VSOURCE INC is hiring URSTADT BIDDLE PROPERTIE...	URSTADT BIDDLE PROPERTIES INC	positive	4	7
95	VSOURCE INC is hiring URSTADT BIDDLE PROPERTIE...	VSOURCE INC	positive	0	1
96	Emergent BioSolutions Inc . is not providing C...	Emergent BioSolutions Inc .	negative	0	3
97	Emergent BioSolutions Inc . is not providing C...	CUMMINS INC	positive	7	8

98 rows × 5 columns

📜

text: your text examples;
target: your NER chunk, extracted using finner_orgs_prods_alias in our case;
label: the assertion label. In our example, we have two labels: positive and negative.
start: the first token number of the chunk. You can get this information from the begin column in your NER model metadata.
end: the last token number of the chunk. You can get this information from the end column in your NER model metadata.

🏃‍♀️ Dataframe creation: training and test splits¶

In [ ]:

# Create Spark Dataframe
training_data = spark.createDataFrame(training_df)
training_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative|    0|  2|
|CEC ENTERTAINMENT...|       GEO GROUP INC|positive|    6|  8|
|BRAVE ASSET MANAG...|Mondelez Internat...|positive|    6| 10|
|BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive|    0|  3|
|Compound Natural ...|Compound Natural ...|negative|    0|  4|
|Compound Natural ...|AMERICAN ELECTRIC...|positive|    9| 13|
|Marijuana Co of A...|PVM International...|positive|   10| 14|
|Marijuana Co of A...|Marijuana Co of A...|positive|    0|  6|
|NORTEK INC is not...|          NORTEK INC|negative|    0|  1|
|NORTEK INC is not...|EN2GO INTERNATION...|positive|    6|  8|
|QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive|    8| 11|
|QUALCOMM INC/DE i...|     QUALCOMM INC/DE|positive|    0|  1|
|TransDigm Group I...| TransDigm Group INC|negative|    0|  2|
|TransDigm Group I...|         ABIOMED INC|positive|    9| 10|
|Fundrise Income e...|MIDDLETON & CO IN...|positive|    9| 12|
|Fundrise Income e...|Fundrise Income e...|negative|    0|  5|
|Nexeo Solutions ,...|Nexeo Solutions ,...|negative|    0|  4|
|Nexeo Solutions ,...|ARCA biopharma , ...|positive|    8| 12|
|Angie's List , In...|        RC-1 , Inc .|positive|   11| 14|
|Angie's List , In...|Angie's List , Inc .|negative|    0|  4|
+--------------------+--------------------+--------+-----+---+
only showing top 20 rows

In [ ]:

training_data.printSchema()

root
 |-- text: string (nullable = true)
 |-- target: string (nullable = true)
 |-- label: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)

In [ ]:

%time 
training_data.count()

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.15 µs

Out[ ]:

In [ ]:

(train_data, test_data) = training_data.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(train_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 69
Test Dataset Count: 29

In [ ]:

train_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|3AM TECHNOLOGIES ...|3AM TECHNOLOGIES INC|negative|    0|  2|
|3AM TECHNOLOGIES ...|NATURAL ALTERNATI...|positive|    6|  9|
|ALEXANDRIA REAL E...|ALEXANDRIA REAL E...|negative|    0|  4|
|ATMI INC is eligi...|            ATMI INC|positive|    0|  1|
|ATMI INC is eligi...|NEAH POWER SYSTEM...|positive|    5| 10|
|Angie's List , In...|Angie's List , Inc .|negative|    0|  4|
|Angie's List , In...|        RC-1 , Inc .|positive|   11| 14|
|Artificial Intell...| APA OPTICS INC /MN/|positive|   10| 13|
|Artificial Intell...|Artificial Intell...|negative|    0|  5|
|CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative|    0|  2|
|CEC ENTERTAINMENT...|       GEO GROUP INC|positive|    6|  8|
|DELTA APPAREL , I...| DELTA APPAREL , INC|positive|    0|  3|
|DELTA APPAREL , I...|Long-Term Stock E...|positive|   10| 15|
|Fundrise Income e...|Fundrise Income e...|negative|    0|  5|
|GHP Investment Ad...|PARALLAX HEALTH S...|positive|   10| 15|
|LANDAUER INC is n...|        LANDAUER INC|negative|    0|  1|
|LANDAUER INC is n...|PLANTRONICS INC /CA/|positive|    5|  7|
|MGP INGREDIENTS I...|    LAND O LAKES INC|positive|    6|  9|
|Marijuana Co of A...|Marijuana Co of A...|positive|    0|  6|
|Marijuana Co of A...|PVM International...|positive|   10| 14|
+--------------------+--------------------+--------+-----+---+
only showing top 20 rows

In [ ]:

test_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|ALEXANDRIA REAL E...|            CDEX INC|positive|    9| 10|
|BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive|    0|  3|
|BRAVE ASSET MANAG...|Mondelez Internat...|positive|    6| 10|
|Compound Natural ...|AMERICAN ELECTRIC...|positive|    9| 13|
|Compound Natural ...|Compound Natural ...|negative|    0|  4|
|Fundrise Income e...|MIDDLETON & CO IN...|positive|    9| 12|
|GHP Investment Ad...|GHP Investment Ad...|positive|    0|  5|
|MGP INGREDIENTS I...| MGP INGREDIENTS INC|negative|    0|  2|
|Mountain Capital ...|Mountain Capital ...|negative|    0|  5|
|Palo Alto Network...|Palo Alto Network...|negative|    0|  3|
|QUAINT OAK BANCOR...|QUAINT OAK BANCOR...|positive|    0|  3|
|QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive|    8| 11|
|QUALCOMM INC/DE i...|     QUALCOMM INC/DE|positive|    0|  1|
|SHOE CARNIVAL INC...|   SHOE CARNIVAL INC|negative|    0|  2|
|SURMODICS INC is ...|Cboe EDGA Exchang...|positive|    4|  9|
|WACCAMAW BANKSHAR...|WACCAMAW BANKSHAR...|positive|    0|  2|
|AMERICAN CENTURY ...|AMERICAN CENTURY ...|negative|    0|  5|
|AMERICAN CENTURY ...| Charmed Homes Inc .|positive|    9| 12|
|EATON VANCE MASSA...|BATS Y-Exchange ,...|positive|   13| 17|
|EATON VANCE MASSA...|EATON VANCE MASSA...|positive|    0|  5|
+--------------------+--------------------+--------+-----+---+
only showing top 20 rows

🔎 Using Bert Embeddings¶

Calculated using the bert_embeddings_sec_bert_base embeddings on your text column

In [ ]:

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("document", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]

In [ ]:

document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(False)

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

We save the test data in parquet format to use in AssertionDLApproach().

In [ ]:

assertion_pipeline = nlp.Pipeline(
    stages = [
    document,
    chunk,
    token,
    bert_embeddings])

assertion_test_data = assertion_pipeline.fit(test_data).transform(test_data)

assertion_test_data.write.mode('overwrite').parquet('test_data.parquet')

In [ ]:

assertion_test_data.columns

Out[ ]:

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

In [ ]:

assertion_train_data = assertion_pipeline.fit(training_data).transform(training_data)

assertion_train_data.write.mode('overwrite').parquet('train_data.parquet')

In [ ]:

assertion_train_data.columns

Out[ ]:

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

🔎 Graph setup¶

In [ ]:

! pip install -q tensorflow==2.7.0
! pip install -q tensorflow-addons

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 489.6/489.6 MB 2.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 463.1/463.1 KB 41.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 73.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 44.4 MB/s eta 0:00:00

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline.

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [ ]:

graph_folder= "./tf_graphs"

In [ ]:

assertion_graph_builder =  finance.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(1200)\
    .setHiddenUnitsNumber(25)

📜Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models

This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format [X,Y] being X the number of tokens to consider on the left of the chunk, and Y the max number of tokens to consider on the right. Let’s take a look at what different windows mean:

By default, the window is [-1,-1] which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in setMaxSentLen() ).
[0,0] means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
[9,15] is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.

Check this Scope Window Tuning Assertion Status Detection notebook that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [ ]:

scope_window = [50, 50]

assertionStatus = finance.AssertionDLApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "doc_chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("start")\
    .setEndCol("end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\

In [ ]:

assertion_pipeline = nlp.Pipeline(
    stages = [
    assertion_graph_builder,
    assertionStatus])

In [ ]:

training_data.printSchema()

root
 |-- text: string (nullable = true)
 |-- target: string (nullable = true)
 |-- label: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)

In [ ]:

assertion_train_data = spark.read.parquet('train_data.parquet')

In [ ]:

assertion_train_data.groupBy('label').count().show()

+--------+-----+
|   label|count|
+--------+-----+
|positive|   71|
|negative|   27|
+--------+-----+

In [ ]:

%%time
assertion_model = assertion_pipeline.fit(assertion_train_data)

TF Graph Builder configuration:
Model name: assertion_dl
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 2, 'feat_size': 768, 'max_seq_len': 1200, 'n_hidden': 25}

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term

Device mapping: no known devices.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/sparknlp_jsl/_tf_graph_builders/tf2contrib/rnn.py:229: bidirectional_dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/rnn.py:441: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/layers/legacy_rnn/rnn_cell_impl.py:766: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor

Device mapping: no known devices.
assertion_dl graph exported to ./tf_graphs/assertion_graph.pb
CPU times: user 11.1 s, sys: 691 ms, total: 11.8 s
Wall time: 55.5 s

Checking the results saved in the log file

In [ ]:

import os

log_files = os.listdir("./training_logs")
log_files

Out[ ]:

['AssertionDLApproach_1288a0afc47a.log']

In [ ]:

with open("./training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

Name of the selected graph: ./tf_graphs/assertion_graph.pb
Training started, trainExamples: 98


Epoch: 0 started, learning rate: 0.001, dataset size: 98
Done, 21.154691869 total training loss: 2.5376515, avg training loss: 2.5376515, batches: 1
Quality on test dataset: 
time to finish evaluation: 2.22s
Total test loss: 2.1600	Avg test loss: 2.1600
label	 tp	 fp	 fn	 prec	 rec	 f1
negative	 10	 19	 0	 0.3448276	 1.0	 0.5128205
positive	 0	 0	 19	 0.0	 0.0	 0.0
tp: 10 fp: 19 fn: 19 labels: 2
Macro-average	 prec: 0.1724138, rec: 0.5, f1: 0.25641024
Micro-average	 prec: 0.3448276, rec: 0.3448276, f1: 0.3448276


Epoch: 1 started, learning rate: 9.5E-4, dataset size: 98
Done, 6.591055361 total training loss: 2.3828928, avg training loss: 2.3828928, batches: 1
Quality on test dataset: 
time to finish evaluation: 1.69s
Total test loss: 2.0273	Avg test loss: 2.0273
label	 tp	 fp	 fn	 prec	 rec	 f1
negative	 10	 19	 0	 0.3448276	 1.0	 0.5128205
positive	 0	 0	 19	 0.0	 0.0	 0.0
tp: 10 fp: 19 fn: 19 labels: 2
Macro-average	 prec: 0.1724138, rec: 0.5, f1: 0.25641024
Micro-average	 prec: 0.3448276, rec: 0.3448276, f1: 0.3448276

In [ ]:

assertion_test_data = spark.read.parquet('test_data.parquet')

In [ ]:

preds = assertion_model.transform(assertion_test_data).select('label','assertion.result')

preds.show()

+--------+----------+
|   label|    result|
+--------+----------+
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|negative|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
+--------+----------+
only showing top 20 rows

In [ ]:

preds_df = preds.toPandas()

In [ ]:

preds_df["result"] = preds_df["result"].apply(lambda x: x[0] if len(x) else pd.NA)
preds_df.dropna(inplace=True)

preds_df

Out[ ]:

	label	result
0	positive	negative
1	positive	negative
2	positive	negative
3	positive	negative
4	negative	negative
5	positive	negative
6	positive	negative
7	negative	negative
8	negative	negative
9	negative	negative
10	positive	negative
11	positive	negative
12	positive	negative
13	negative	negative
14	positive	negative
15	positive	negative
16	negative	negative
17	positive	negative
18	positive	negative
19	positive	negative
20	positive	negative
21	negative	negative
22	negative	negative
23	positive	negative
24	positive	negative
25	negative	negative
26	negative	negative
27	positive	negative
28	positive	negative

In [ ]:

# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['label'], preds_df['result']))

              precision    recall  f1-score   support

    negative       0.34      1.00      0.51        10
    positive       0.00      0.00      0.00        19

    accuracy                           0.34        29
   macro avg       0.17      0.50      0.26        29
weighted avg       0.12      0.34      0.18        29

✔️ Saving the trained model¶

In [ ]:

assertion_model.stages

Out[ ]:

[TFGraphBuilderModel_a389b6a16cae, FINANCE-ASSERTION_DL_13ea29236849]

In [ ]:

# Save a Spark NLP model
assertion_model.stages[-1].write().overwrite().save('Assertion')

# cd into saved dir and zip
! cd /content/Assertion ; zip -r /content/Assertion.zip *

  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 95%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: fields/datasetParams/.part-00001.crc (deflated 44%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: metadata/ (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/part-00000 (deflated 38%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: tensorflow (deflated 39%)

✔️ Testing the model¶

The model had very little data, since it was created as a playground fopr the certification trainingts to run quickly. So don't expect big performance (for that, you have a pretrained version used earlier on this notebook).

In [ ]:

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

finassertion = finance.AssertionDLModel.load("Assertion")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"])\
    .setOutputCol("finlabel")

pipe = nlp.Pipeline(stages = [ document_assembler, text_splitter, tokenizer, embeddings, ner, ner_converter, finassertion])

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_orgs_prods_alias download started this may take some time.
[OK!]

In [ ]:

text = "Gradio INC will enter into a joint agreement with Hugging Face, Inc."

In [ ]:

sdf = spark.createDataFrame([[text]]).toDF("text")
res = pipe.fit(sdf).transform(sdf)

In [ ]:

import pyspark.sql.functions as F
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.finlabel.result)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']").alias("assertion")).show(200, truncate=100)

+-----------------+---------+
|        ner_chunk|assertion|
+-----------------+---------+
|       Gradio INC| negative|
|Hugging Face, Inc| negative|
+-----------------+---------+