Integrate_NLP_Libraries.ipynb: Combine the Outputs of Multiple NLP Libraries using Text Extensions for Pandas

Introduction

This notebook demonstrates the interoperable capabilities of the open source library Text Extensions for Pandas. Specifically we use Pandas DataFrames as a bridge between multiple natural language processing libraries. The example that we show here uses the capabilities of IBM's Watson Natural Language Understanding service and SpaCy to perform a number of NLP tasks such as extracting entities, relations, spans and sentiment.

Environment Setup

This notebook requires a Python 3.7 or later environment with the following packages:

You can satisfy the dependency on text_extensions_for_pandas in either of two ways:

  • Run pip install text_extensions_for_pandas before running this notebook. This command adds the library to your Python environment from the latest PyPi release.
  • Or optionally, run this notebook out of your local copy of the Text Extensions for Pandas project's source tree. In this case, the notebook will use the version of Text Extensions for Pandas in your local source tree if the package is not installed in your Python environment.
In [1]:
# Uncomment and run this cell if you are using this notebook in a cloud environment such 
# as IBM Watson Studio or Google Colab and you want to install the required packages. 
# Note: This will install packages to your environment so only run if you need to install 
# these packages.

# Uncomment below cell to install packages
# !pip install ibm_watson spacy text_extensions_for_pandas
In [2]:
# Core Python libraries
import json
import os
import sys
import pandas as pd
from typing import *

# IBM Watson libraries
import ibm_watson
import ibm_watson.natural_language_understanding_v1 as nlu
import ibm_cloud_sdk_core

# SpaCy
import spacy

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

Using the Watson Natural Language Understanding Service

In this section, we will setup various parts of the Watson NLU service and pass our documents through the service to obtain various features and outputs from the service.

This section is divided into subsections to setup, connect and use the service.

Set up the Watson Natural Language Understanding Service

In this part of the notebook, we will use the Watson Natural Language Understanding (NLU) service to extract key features from our example document.

You can create an instance of Watson NLU on the IBM Cloud for free by navigating to this page and clicking on the button marked "Get started free". You can also install your own instance of Watson NLU on OpenShift by using IBM Watson Natural Language Understanding for IBM Cloud Pak for Data.

You'll need two pieces of information to access your instance of Watson NLU: An API key and a service URL. If you're using Watson NLU on the IBM Cloud, you can find your API key and service URL in the IBM Cloud web UI. Navigate to the resource list and click on your instance of Natural Language Understanding to open the management UI for your service. Then click on the "Manage" tab to show a page with your API key and service URL.

The cell that follows assumes that you are using the environment variables IBM_API_KEY and IBM_SERVICE_URL to store your credentials. If you're running this notebook in Jupyter on your laptop, you can set these environment variables while starting up jupyter notebook or jupyter lab. For example:

IBM_API_KEY='<my API key>' \
IBM_SERVICE_URL='<my service URL>' \
  jupyter lab

Alternately, you can uncomment the first two lines of code below to set the IBM_API_KEY and IBM_SERVICE_URL environment variables directly. Be careful not to store your API key in any publicly-accessible location!

In [3]:
# If you need to embed your credentials inline, uncomment the following two lines and
# paste your credentials in the indicated locations.
# os.environ["IBM_API_KEY"] = "<API key goes here>"
# os.environ["IBM_SERVICE_URL"] = "<Service URL goes here>"

# Retrieve the API key for your Watson NLU service instance
if "IBM_API_KEY" not in os.environ:
    raise ValueError("Expected Watson NLU api key in the environment variable 'IBM_API_KEY'")
api_key = os.environ.get("IBM_API_KEY")
    
# Retrieve the service URL for your Watson NLU service instance
if "IBM_SERVICE_URL" not in os.environ:
    raise ValueError("Expected Watson NLU service URL in the environment variable 'IBM_SERVICE_URL'")
service_url = os.environ.get("IBM_SERVICE_URL")  

Connect to the Watson Natural Language Understanding Python API

This notebook uses the IBM Watson Python SDK to perform authentication on the IBM Cloud via the IAMAuthenticator class. See the IBM Watson Python SDK documentation for more information.

We start by using the API key and service URL from the previous cell to create an instance of the Python API for Watson NLU.

In [4]:
natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(
    version="2019-07-12",
    authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)
)
natural_language_understanding.set_service_url(service_url)
natural_language_understanding
Out[4]:
<ibm_watson.natural_language_understanding_v1.NaturalLanguageUnderstandingV1 at 0x7fcc501455e0>

Pass a Document through the Watson NLU Service

Once you've opened a connection to the Watson NLU service, you can pass documents through the service by invoking the analyze() method.

The example document that we use here is an excerpt from the plot summary for Monty Python and the Holy Grail, drawn from the Wikipedia entry for that movie.

Let's preview what the raw text looks like:

In [5]:
from IPython.core.display import display, HTML
doc_file = "../resources/holy_grail_short.txt"
with open(doc_file, "r") as f:
    doc_text = f.read()
    
display(HTML(f"<b>Document Text:</b><blockquote>{doc_text}</blockquote>"))
Document Text:
In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Watson Natural Language Understanding can perform multiple kinds of analysis on the example document.

We will be looking at the following:

  • entities (with sentiment)
  • keywords (with sentiment and emotion)
  • relations
  • semantic_roles
  • syntax (with sentences, tokens, and part of speech)

See the Watson NLU documentation for a full description of the types of analysis that NLU can perform.

In [6]:
# Make the request
response = natural_language_understanding.analyze(
    text=doc_text,
    # TODO: Use this URL once we've pushed the shortened document to Github
    #url="https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail_short.txt",
    return_analyzed_text=True,
    features=nlu.Features(
        entities=nlu.EntitiesOptions(sentiment=True),
        keywords=nlu.KeywordsOptions(sentiment=True, emotion=True),
        relations=nlu.RelationsOptions(),
        semantic_roles=nlu.SemanticRolesOptions(),
        syntax=nlu.SyntaxOptions(sentences=True, 
                                 tokens=nlu.SyntaxOptionsTokens(lemma=True, part_of_speech=True))
    )).get_result()

The response from the analyze() method is a Python dictionary. The dictionary contains an entry for each pass of analysis requested, plus some additional entries with metadata about the API request itself. Here's a list of the keys in response:

In [7]:
response.keys()
Out[7]:
dict_keys(['usage', 'syntax', 'semantic_roles', 'relations', 'language', 'keywords', 'entities', 'analyzed_text'])

Text Extensions for Pandas includes a handy function watson_nlu_parse_response() that turns the output of Watson NLU's analyze() function into a dictionary of Pandas DataFrames. This makes it much easier to process the output from NLU and perform downstream operations. Let us run the NLU response object through that conversion below.

In [8]:
dfs = tp.io.watson.nlu.parse_response(response)
dfs.keys()
Out[8]:
dict_keys(['syntax', 'entities', 'entity_mentions', 'keywords', 'relations', 'semantic_roles'])

The output of each analysis pass that Watson NLU performed is now a DataFrame. Let's look at the outputs of the "relations" pass. Here's the original output as Python objects:

In [9]:
response["relations"]
Out[9]:
[{'type': 'partOfMany',
  'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.",
  'score': 0.610221,
  'arguments': [{'text': 'Galahad',
    'location': [208, 215],
    'entities': [{'type': 'Person', 'text': 'Galahad'}]},
   {'text': 'their',
    'location': [323, 328],
    'entities': [{'type': 'Person', 'text': 'their'}]}]},
 {'type': 'partOfMany',
  'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.",
  'score': 0.710112,
  'arguments': [{'text': 'Lancelot',
    'location': [266, 274],
    'entities': [{'type': 'Person', 'text': 'Lancelot'}]},
   {'text': 'their',
    'location': [323, 328],
    'entities': [{'type': 'Person', 'text': 'their'}]}]},
 {'type': 'parentOf',
  'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.",
  'score': 0.3821,
  'arguments': [{'text': 'their',
    'location': [323, 328],
    'entities': [{'type': 'Person', 'text': 'their'}]},
   {'text': 'squires',
    'location': [329, 336],
    'entities': [{'type': 'Person', 'text': 'squires'}]}]},
 {'type': 'residesIn',
  'sentence': 'Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place".',
  'score': 0.492869,
  'arguments': [{'text': 'Arthur',
    'location': [362, 368],
    'entities': [{'type': 'Person', 'text': 'King Arthur'}]},
   {'text': 'Camelot',
    'location': [386, 393],
    'entities': [{'type': 'GeopoliticalEntity', 'text': 'Camelot'}]}]},
 {'type': 'locatedAt',
  'sentence': 'Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place".',
  'score': 0.339446,
  'arguments': [{'text': 'men',
    'location': [379, 382],
    'entities': [{'type': 'Person', 'text': 'men'}]},
   {'text': 'Camelot',
    'location': [386, 393],
    'entities': [{'type': 'GeopoliticalEntity', 'text': 'Camelot'}]}]},
 {'type': 'affectedBy',
  'sentence': 'As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.',
  'score': 0.604304,
  'arguments': [{'text': 'them',
    'location': [572, 576],
    'entities': [{'type': 'Person', 'text': 'their'}]},
   {'text': 'speaks',
    'location': [562, 568],
    'entities': [{'type': 'EventCommunication', 'text': 'speaks'}]}]}]

And here's the DataFrame version of the same information:

In [10]:
dfs["relations"]
Out[10]:
type sentence_span score arguments.0.span arguments.1.span arguments.0.entities.type arguments.1.entities.type arguments.0.entities.text arguments.1.entities.text
0 partOfMany [130, 361): 'Along the way, he recruits Sir Be... 0.610221 [208, 215): 'Galahad' [323, 328): 'their' Person Person Galahad their
1 partOfMany [130, 361): 'Along the way, he recruits Sir Be... 0.710112 [266, 274): 'Lancelot' [323, 328): 'their' Person Person Lancelot their
2 parentOf [130, 361): 'Along the way, he recruits Sir Be... 0.382100 [323, 328): 'their' [329, 336): 'squires' Person Person their squires
3 residesIn [362, 512): 'Arthur leads the men to Camelot, ... 0.492869 [362, 368): 'Arthur' [386, 393): 'Camelot' Person GeopoliticalEntity King Arthur Camelot
4 locatedAt [362, 512): 'Arthur leads the men to Camelot, ... 0.339446 [379, 382): 'men' [386, 393): 'Camelot' Person GeopoliticalEntity men Camelot
5 affectedBy [513, 629): 'As they turn away, God (an image ... 0.604304 [572, 576): 'them' [562, 568): 'speaks' Person EventCommunication their speaks

As you can see above, it is much more organized and convenient to deal with once we have it as a DataFrame.

Each row in the DataFrame contains information about a single relationship that Watson Natural Language Understanding identified in our input text. As you can see, Watson NLU returns a lot of information about each relationship. For simplicity, let's focus on three columns:

  • "type": The type of relationship between the two entities
  • "arguments.0.span": Span of characters in the original text where the first entity in the relationship appeared
  • "argmennts.1.span": Span of the second entity in the relationship
In [11]:
relations = dfs["relations"][["type", "arguments.0.span", "arguments.1.span"]].copy()
relations
Out[11]:
type arguments.0.span arguments.1.span
0 partOfMany [208, 215): 'Galahad' [323, 328): 'their'
1 partOfMany [266, 274): 'Lancelot' [323, 328): 'their'
2 parentOf [323, 328): 'their' [329, 336): 'squires'
3 residesIn [362, 368): 'Arthur' [386, 393): 'Camelot'
4 locatedAt [379, 382): 'men' [386, 393): 'Camelot'
5 affectedBy [572, 576): 'them' [562, 568): 'speaks'

Manipulate Span Data

Text Extensions for Pandas uses Pandas extension types to represent spans (regions of a document) and tensors (multi-dimensional arrays). For example, the "arguments.0.span" and "arguments.1.span" columns in the above DataFrame are both stored using the extension type for spans.

Here's the Pandas data type (also known as "dtype") information for the three columns of this DataFrame:

In [12]:
relations.dtypes
Out[12]:
type                   object
arguments.0.span    SpanDtype
arguments.1.span    SpanDtype
dtype: object

Note how the "arguments.0.span" and "arguments.1.span" columns are of dtype SpanDtype. SpanDtype is a Pandas extension type from the Text Extensions for Pandas library. The SpanDtype data type corresponds to two Python classes: Span for scalar values and SpanArray for array values. SpanArray is a subclass of the Pandas ExtensionArray class, which is the base class for custom 1-D array types in Pandas.

You can access the array object behind any Pandas extension type via the pandas.Series.array property:

In [13]:
print(relations["arguments.0.span"].array)
<SpanArray>
[ [208, 215): 'Galahad', [266, 274): 'Lancelot',    [323, 328): 'their',
   [362, 368): 'Arthur',      [379, 382): 'men',     [572, 576): 'them']
Length: 6, dtype: SpanDtype

Extension types support most the functionality of built-in Pandas types like Int64Dtype and DatetimeTZDtype.

For example, SpanDtype defines the + (also known as __add__()) operation for spans to mean "the shortest span that completely covers both input spans". So we can "add" the contents of the "arguments.0.span" and "arguments.1.span" columns of our DataFrame to obtain a span that covers both arguments, plus the text in between them. The cell below demonstrates a simple + operation with Spans.

In [14]:
relations["context"] = relations["arguments.0.span"] + relations["arguments.1.span"]
relations
Out[14]:
type arguments.0.span arguments.1.span context
0 partOfMany [208, 215): 'Galahad' [323, 328): 'their' [208, 328): 'Galahad the Pure, Sir Robin the N...
1 partOfMany [266, 274): 'Lancelot' [323, 328): 'their' [266, 328): 'Lancelot, and Sir Not-Appearing-i...
2 parentOf [323, 328): 'their' [329, 336): 'squires' [323, 336): 'their squires'
3 residesIn [362, 368): 'Arthur' [386, 393): 'Camelot' [362, 393): 'Arthur leads the men to Camelot'
4 locatedAt [379, 382): 'men' [386, 393): 'Camelot' [379, 393): 'men to Camelot'
5 affectedBy [572, 576): 'them' [562, 568): 'speaks' [562, 576): 'speaks to them'

Take a look at the last row of the above DataFrame. The span in "arguments.0.span" comes after "arguments.1.span" in the last row, but the "context" column is still correct.

A SpanArray can also render itself using Jupyter Notebook callbacks. To see the HTML representation of the SpanArray, pass the array object to Jupyter's display() function; or make that object be the last line of the cell, as in the following example:

In [15]:
relations["context"].array
Out[15]:
begin end context
0 208 328 Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their
1 266 328 Lancelot, and Sir Not-Appearing-in-this-Film, along with their
2 323 336 their squires
3 362 393 Arthur leads the men to Camelot
4 379 393 men to Camelot
5 562 576 speaks to them

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires Set and Robin's troubadours. Arthur leads the men to Camelot , but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

This makes it very easy to visually inspect relevant portions of a document and to present any findings.

You can also convert an individual element of the array into a Python object of type Span that represents that single span as a scalar value:

In [16]:
target_span = relations.iloc[0]["arguments.1.span"]
target_span
Out[16]:
[323, 328): 'their'

You can use a Span object to create a Pandas selection condition. For example, we can use a selection condition to select the rows from the relations DataFrame whose second argument's span matches the span we just stored in the variable target_span:

In [17]:
relations[relations["arguments.1.span"] == target_span]
Out[17]:
type arguments.0.span arguments.1.span context
0 partOfMany [208, 215): 'Galahad' [323, 328): 'their' [208, 328): 'Galahad the Pure, Sir Robin the N...
1 partOfMany [266, 274): 'Lancelot' [323, 328): 'their' [266, 328): 'Lancelot, and Sir Not-Appearing-i...

Pandas extension types also support aggregation. Let's use the sum() aggregate to find the portion of the document that includes the context for all the relationships in the above DataFrame.

Recall that the Text Extensions for Pandas defines the addition operator for spans as "the shortest span that completely covers both input spans". Similarly, the "sum" of a collection of spans is the shortest span that completely covers all the spans.

In [18]:
max_context_span = relations[relations["arguments.1.span"] == target_span]["context"].sum()
print(f"""
Span: {str(max_context_span)}
Covered text: "{max_context_span.covered_text}"
""")
Span: [208, 328): 'Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and [...]'
Covered text: "Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their"

Extract Additional Features with SpaCy

With Text Extensions for Pandas, you can use Pandas DataFrames as a common representation for you NLP application's intermediate data, regardless of which NLP library you used to produce that data.

In the cell that follows, we take the text that we just ran through Watson NLU and feed that text through a SpaCy langauge model. Then we use the make_tokens_and_features() function from Text Extensions for Pandas to convert this output to a Pandas DataFrame of token features.

In order to load the spacy language model, download the spacy model using the following command: $ python -m spacy download en_core_web_sm

You can also add a line in the below cell to install it inline: !python -m spacy download en_core_web_sm

In [19]:
doc_text = response["analyzed_text"]
spacy_language_model = spacy.load("en_core_web_sm")
token_features = tp.io.spacy.make_tokens_and_features(doc_text, spacy_language_model)
token_features
Out[19]:
id span lemma pos tag dep head shape ent_iob ent_type is_alpha is_stop sentence
0 0 [0, 2): 'In' in ADP IN prep 12 Xx O True True [0, 129): 'In AD 932, King Arthur and his squi...
1 1 [3, 5): 'AD' ad NOUN NN pobj 0 XX B DATE True False [0, 129): 'In AD 932, King Arthur and his squi...
2 2 [6, 9): '932' 932 NUM CD nummod 1 ddd I DATE False False [0, 129): 'In AD 932, King Arthur and his squi...
3 3 [9, 10): ',' , PUNCT , punct 12 , O False False [0, 129): 'In AD 932, King Arthur and his squi...
4 4 [11, 15): 'King' King PROPN NNP compound 5 Xxxx O True False [0, 129): 'In AD 932, King Arthur and his squi...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
142 142 [606, 613): 'finding' find VERB VBG pcomp 141 xxxx O True False [513, 629): 'As they turn away, God (an image ...
143 143 [614, 617): 'the' the DET DT det 145 xxx O True True [513, 629): 'As they turn away, God (an image ...
144 144 [618, 622): 'Holy' Holy PROPN NNP compound 145 Xxxx O True False [513, 629): 'As they turn away, God (an image ...
145 145 [623, 628): 'Grail' Grail PROPN NNP dobj 142 Xxxxx O True False [513, 629): 'As they turn away, God (an image ...
146 146 [628, 629): '.' . PUNCT . punct 133 . O False False [513, 629): 'As they turn away, God (an image ...

147 rows × 13 columns

Recall that, in the previous section of this notebook, we defined a variable max_context_span containing the region of the text that covers the elements of some relationships that Watson Natural Language Understanding identified:

In [20]:
max_context_span
Out[20]:
[208, 328): 'Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and [...]'

Let's identify all the rows of our SpaCy token features DataFrame that overlap with this span of the document.

The SpanArray class has a built-in operation overlaps() for building Pandas selection conditions based on overlapping spans. Here we use overlaps() to filter the token_features DataFrame based on the value of max_context_span:

In [21]:
spacy_context_tokens = token_features[token_features["span"].array.overlaps(max_context_span)]
spacy_context_tokens
Out[21]:
id span lemma pos tag dep head shape ent_iob ent_type is_alpha is_stop sentence
44 44 [208, 215): 'Galahad' Galahad PROPN NNP npadvmod 32 Xxxxx B PERSON True False [130, 235): 'Along the way, he recruits Sir Be...
45 45 [216, 219): 'the' the DET DT det 46 xxx I PERSON True True [130, 235): 'Along the way, he recruits Sir Be...
46 46 [220, 224): 'Pure' Pure PROPN NNP appos 44 Xxxx I PERSON True False [130, 235): 'Along the way, he recruits Sir Be...
47 47 [224, 225): ',' , PUNCT , punct 46 , O False False [130, 235): 'Along the way, he recruits Sir Be...
48 48 [226, 229): 'Sir' Sir PROPN NNP compound 49 Xxx O True False [130, 235): 'Along the way, he recruits Sir Be...
49 49 [230, 235): 'Robin' Robin PROPN NNP appos 46 Xxxxx B PERSON True False [130, 235): 'Along the way, he recruits Sir Be...
50 50 [236, 239): 'the' the DET DT det 57 xxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
51 51 [240, 243): 'Not' not PART RB neg 53 Xxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
52 52 [243, 244): '-' - PUNCT HYPH punct 53 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
53 53 [244, 249): 'Quite' Quite PROPN NNP compound 55 Xxxxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
54 54 [249, 250): '-' - PUNCT HYPH punct 55 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
55 55 [250, 252): 'So' so ADV RB advmod 57 Xx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
56 56 [252, 253): '-' - PUNCT HYPH punct 57 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
57 57 [253, 258): 'Brave' brave NOUN NN ROOT 57 Xxxxx O True False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
58 58 [258, 259): '-' - PUNCT HYPH punct 57 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
59 59 [259, 261): 'as' as ADP IN prep 57 xx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
60 60 [261, 262): '-' - PUNCT HYPH punct 59 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
61 61 [262, 265): 'Sir' Sir PROPN NNP compound 63 Xxx O True False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
62 62 [265, 266): '-' - PUNCT HYPH punct 63 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
63 63 [266, 274): 'Lancelot' Lancelot PROPN NNP pobj 59 Xxxxx O True False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
64 64 [274, 275): ',' , PUNCT , punct 57 , O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
65 65 [276, 279): 'and' and CCONJ CC cc 57 xxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
66 66 [280, 283): 'Sir' Sir PROPN NNP compound 69 Xxx O True False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
67 67 [284, 287): 'Not' not PART RB neg 69 Xxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
68 68 [287, 288): '-' - PUNCT HYPH punct 69 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
69 69 [288, 297): 'Appearing' appearing NOUN NN conj 57 Xxxxx O True False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
70 70 [297, 298): '-' - PUNCT HYPH punct 69 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
71 71 [298, 300): 'in' in ADP IN prep 69 xx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
72 72 [300, 301): '-' - PUNCT HYPH punct 71 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
73 73 [301, 305): 'this' this DET DT det 75 xxxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
74 74 [305, 306): '-' - PUNCT HYPH punct 75 - O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
75 75 [306, 310): 'Film' Film PROPN NNP pobj 71 Xxxx O True False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
76 76 [310, 311): ',' , PUNCT , punct 69 , O False False [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
77 77 [312, 317): 'along' along ADP IN prep 57 xxxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
78 78 [318, 322): 'with' with ADP IN prep 77 xxxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...
79 79 [323, 328): 'their' their PRON PRP$ poss 80 xxxx O True True [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...

Combine Outputs of Both Libraries

Notice that the "sentence" column of the SpaCy output in the previous cell contains multiple different values, even though all the tokens are actually from the same sentence. SpaCy's language model has incorrectly split this sentence into multiple smaller sentences. We can use pandas.DataFrame.drop_duplicates() to show exactly which sentence fragments are present in this slice of the SpaCy output:

In [22]:
spacy_context_tokens[["sentence"]].drop_duplicates()
Out[22]:
sentence
44 [130, 235): 'Along the way, he recruits Sir Be...
50 [236, 361): 'the Not-Quite-So-Brave-as-Sir-Lan...

Alternately, we can drill down to the SpanArray object to show a HTML representation of these sentence fragments in context:

In [23]:
spacy_context_tokens["sentence"].array.unique()
Out[23]:
begin end begin tokenend token context
0 130 235 27 50 Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin
1 236 361 50 86 the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

The sentence identification component of Watson Natural Language Understanding does a better job than SpaCy on this region of the document.

Earlier in this notebook, we created a Python dictionary dfs, where each item in the dictionary is a DataFrame. The DataFrame under the key "syntax" holds the output of Watson NLU's syntax analysis, which includes sentence information. Let's extract the section of this Watson NLU output that matches our target span.

In [24]:
watson_syntax = dfs["syntax"]
watson_context_tokens = watson_syntax[watson_syntax["span"].array.overlaps(max_context_span)]
watson_context_tokens
Out[24]:
span part_of_speech lemma sentence
44 [208, 215): 'Galahad' PROPN None [130, 361): 'Along the way, he recruits Sir Be...
45 [216, 219): 'the' DET the [130, 361): 'Along the way, he recruits Sir Be...
46 [220, 224): 'Pure' PROPN None [130, 361): 'Along the way, he recruits Sir Be...
47 [224, 225): ',' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
48 [226, 229): 'Sir' PROPN Sir [130, 361): 'Along the way, he recruits Sir Be...
49 [230, 235): 'Robin' PROPN Robin [130, 361): 'Along the way, he recruits Sir Be...
50 [236, 239): 'the' DET the [130, 361): 'Along the way, he recruits Sir Be...
51 [240, 243): 'Not' ADV not [130, 361): 'Along the way, he recruits Sir Be...
52 [243, 244): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
53 [244, 249): 'Quite' PROPN None [130, 361): 'Along the way, he recruits Sir Be...
54 [249, 250): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
55 [250, 252): 'So' ADV so [130, 361): 'Along the way, he recruits Sir Be...
56 [252, 253): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
57 [253, 258): 'Brave' ADJ brave [130, 361): 'Along the way, he recruits Sir Be...
58 [258, 259): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
59 [259, 261): 'as' ADP as [130, 361): 'Along the way, he recruits Sir Be...
60 [261, 262): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
61 [262, 265): 'Sir' PROPN Sir [130, 361): 'Along the way, he recruits Sir Be...
62 [265, 266): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
63 [266, 274): 'Lancelot' PROPN None [130, 361): 'Along the way, he recruits Sir Be...
64 [274, 275): ',' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
65 [276, 279): 'and' CCONJ and [130, 361): 'Along the way, he recruits Sir Be...
66 [280, 283): 'Sir' PROPN Sir [130, 361): 'Along the way, he recruits Sir Be...
67 [284, 287): 'Not' ADV not [130, 361): 'Along the way, he recruits Sir Be...
68 [287, 288): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
69 [288, 297): 'Appearing' PROPN None [130, 361): 'Along the way, he recruits Sir Be...
70 [297, 298): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
71 [298, 300): 'in' ADP in [130, 361): 'Along the way, he recruits Sir Be...
72 [300, 301): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
73 [301, 305): 'this' PRON this [130, 361): 'Along the way, he recruits Sir Be...
74 [305, 306): '-' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
75 [306, 310): 'Film' PROPN Film [130, 361): 'Along the way, he recruits Sir Be...
76 [310, 311): ',' PUNCT None [130, 361): 'Along the way, he recruits Sir Be...
77 [312, 317): 'along' ADP along [130, 361): 'Along the way, he recruits Sir Be...
78 [318, 322): 'with' ADP with [130, 361): 'Along the way, he recruits Sir Be...
79 [323, 328): 'their' PRON their [130, 361): 'Along the way, he recruits Sir Be...

The Watson NLU output correctly maps every token to the same sentence, and the span of the sentence is correct:

In [25]:
watson_context_tokens["sentence"].unique()
Out[25]:
begin end begin tokenend token context
0 130 361 27 86 Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

Let's create a DataFrame of token metadata that combines the higher-quality sentence information from Watson NLU with the token features from SpaCy.

In [26]:
context_tokens = spacy_context_tokens.copy()  # Make a copy so we can modify the copy
context_tokens["sentence"] = watson_context_tokens["sentence"].copy()
context_tokens.head(10)  # Show first 10 rows
Out[26]:
id span lemma pos tag dep head shape ent_iob ent_type is_alpha is_stop sentence
44 44 [208, 215): 'Galahad' Galahad PROPN NNP npadvmod 32 Xxxxx B PERSON True False [130, 361): 'Along the way, he recruits Sir Be...
45 45 [216, 219): 'the' the DET DT det 46 xxx I PERSON True True [130, 361): 'Along the way, he recruits Sir Be...
46 46 [220, 224): 'Pure' Pure PROPN NNP appos 44 Xxxx I PERSON True False [130, 361): 'Along the way, he recruits Sir Be...
47 47 [224, 225): ',' , PUNCT , punct 46 , O False False [130, 361): 'Along the way, he recruits Sir Be...
48 48 [226, 229): 'Sir' Sir PROPN NNP compound 49 Xxx O True False [130, 361): 'Along the way, he recruits Sir Be...
49 49 [230, 235): 'Robin' Robin PROPN NNP appos 46 Xxxxx B PERSON True False [130, 361): 'Along the way, he recruits Sir Be...
50 50 [236, 239): 'the' the DET DT det 57 xxx O True True [130, 361): 'Along the way, he recruits Sir Be...
51 51 [240, 243): 'Not' not PART RB neg 53 Xxx O True True [130, 361): 'Along the way, he recruits Sir Be...
52 52 [243, 244): '-' - PUNCT HYPH punct 53 - O False False [130, 361): 'Along the way, he recruits Sir Be...
53 53 [244, 249): 'Quite' Quite PROPN NNP compound 55 Xxxxx O True True [130, 361): 'Along the way, he recruits Sir Be...

The columns "head", "id", and "dep" of the SpaCy features map the tokens to nodes of the sentence's dependency parse. Specifically:

  • the "id" column gives each token an integer ID
  • the "head" column indicates the ID of the parent, or head, token of each token in the parse tree
  • the "dep" column indicates the type of relationship between each parent-child pair in the parse tree

Text Extensions for Pandas includes a function render_parse_tree() that displays parse trees using displaCy. Let's use render_parse_tree() to render the SpaCy parse tree information for the tokens in our DataFrame:

In [27]:
tp.io.spacy.render_parse_tree(spacy_context_tokens)
Galahad NNP the DT Pure NNP , , Sir NNP Robin NNP the DT Not RB - HYPH Quite NNP - HYPH So RB - HYPH Brave NN - HYPH as IN - HYPH Sir NNP - HYPH Lancelot NNP , , and CC Sir NNP Not RB - HYPH Appearing NN - HYPH in IN - HYPH this DT - HYPH Film NNP , , along IN with IN their PRP$ det appos punct compound appos det neg punct compound punct advmod punct punct prep punct compound punct pobj punct cc compound neg punct conj punct prep punct det punct pobj punct prep prep

Conclusion

In this notebook we demonstrated how Text Extensions for Pandas can be used to perform various NLP tasks. We started by loading our document and passing it through Watson NLU service. We extracted various entities and relations. We used Text Extensions for Pandas to manipualte the Span data and visualize some of our findings. Finally we pass this through a language model using SpaCy which gives us more insights such as parts of speech tagging. We then combine all the results to render a parse tree.

This notebook also demonstrates how easy it is to inter-operate with other popular NLP packages such as SpaCy, pandas and IBM Watson NLU.

In [ ]: