Suppose our goal is to have Python read a sentence and extract some content from it. The most common application is sentiment analysis, wherein Python scans over a sentence and tells us whether the sentence has a particular sentiment (e.g. "good" or "bad").
For example:
"We had an awful quarter, sales have been terrible."
has a negative tone. Python can detect this tone by being fed a list of negative words (which would include "awful" and "terrible") and then finding those words in the example sentence. This application is fairly straight-forward; the sample code below tells us the sentence is 100% negative.
# example sentence
sentence = "We had an awful quarter, sales have been terrible."
# example tone lists (real lists would be much longer than these)
positive_words = ["great", "tremendous", "amazing"]
negative_words = ["awful", "terrible", "horrific"]
# tone = num. neg. words / (num. neg. words + num. pos. words)
num_pos = len([word for word in sentence.split() if word in positive_words])
num_neg = len([word for word in sentence.split() if word in negative_words])
tone = num_neg / (num_neg + num_pos)
print(tone)
1.0
We can go deeper than this. Python has modules that allow us to unpack the grammar of a sentence. By doing so, we can look for more specific types of content. Here, we'll consider a search for news articles that report management issued guidance.
To begin, consider an obvious instance of management guidance:
"XYZ announced that earnings will increase this year."
No sentence could be more plain than this. XYZ, the hypothetical company in the example above, announces that earnings are expected to increase this year. Because the topic ("earnings") pertains to a future period ("this year") rather than a prior period (e.g. "last quarter"), the statement is forward-looking.
The task for finding management issued guidance can be broken down into three parts:
Let's start with task (1). Given a sentence
sent = 'XYZ announced that earnings will increase this year.'
Begin by looking for earnings-related words:
# list of financial words/phrases, the full list could be much longer
earnings_words = ['earnings', 'profitability', 'dollars per share']
# scan over earnings_words and check whether these words appear in the sentence of interest
[w in sent for w in earnings_words]
[True, False, False]
Over the three words in the list earnings_words
, the first of these ("earnings") appears in the sentence.
Next look for forward-looking language:
# list of forward-looking words, the full list could be much longer
forward_words = ['forecasted', 'estimated', 'will', 'expected']
# scan over forward_words and check whether these words appear in the sentence of interest
[w in sent for w in forward_words]
[False, False, True, False]
Over the four words in forward_words
, the third of these ("will") appears in the sentence.
We must be careful that the forward-looking language is being applied to the earnings-related word, rather than elsewhere in the sentence. For example, in the sentence below, the earnings word ("earnings") is in a separate and independent clause from the the forward word ("will").
bad_sent = '''XYZ stated that although earnings had fallen last year,
the board remained confident in how the new CEO will manage the company.'''
To ensure that the forward-looking word and earnings-related word are connected in the sentence, the grammar of the sentence must be convered.
To do this, one can run the sentence through spaCy
to analyze the text.
Version warning: for compatibility with a module discussed later, I'm using spacy
version 2.1.0 here. This is an ancient (~ June 2019) version of the module that happened to erroneously ignore the jupyter=False
flag. This was fixed in later versions along the 2.1.x chain, as shown here. If you want to save rendered grammar maps in version 2.1.0, correct the spacy/displacy/__init__.py
file in your site-packages
.
# load spaCy module
import spacy
# pass the sentence through spaCy's text-processing pipeline
nlp = spacy.load("en_core_web_lg")
doc = nlp(sent)
# display the grammar of the sentence
svg = spacy.displacy.render(doc,
style="dep", # show the dependency strcuture,
options={'distance':110, # make the output smaller
'collapse_phrases':True}, # collapse noun phrases
jupyter=False) # disable Jupyter auto-rendering (return render at HTML)
with open('assets/dependency_map.svg', 'w', encoding='utf-8') as fout:
fout.write(svg)
All words have a part of speech (e.g. VERB, NOUN) as well as a dependency. For example, "XYZ" is a proper noun and is the subject (dependency type) for the verb "announced" (the dependency word).
We can access all of this information from the doc
object returned from nlp()
.
for w in doc:
print(w.text, w.pos_, w.dep_, w.head.text)
XYZ PROPN nsubj announced announced VERB ROOT announced that ADP mark increase earnings NOUN nsubj increase will VERB aux increase increase VERB ccomp announced this DET det year year NOUN npadvmod increase . PUNCT punct announced
One simple way to verify that the earnings-related word and the forward-looking word are discussing the same component of a sentence is to ensure that each of the two words shares the same verb. This ignores more complicated sentence structures, and additional checks should be added in to the code.
The verb for the earnings-related word is found:
e_words = [w for w in doc if w.text in earnings_words]
def get_verb(w):
h = w
while True:
if h.pos_ == 'VERB' and h.dep_ != 'aux':
break
h = h.head
return h
e_verbs = {w:get_verb(w) for w in e_words}
for e, v in e_verbs.items():
print(e.text, v.text)
earnings increase
The verb for the forward-looking word is similarly found:
f_words = [w for w in doc if w.text in forward_words]
f_verbs = {w:get_verb(w) for w in f_words}
for f, v in f_verbs.items():
print(f.text, v.text)
will increase
Because "earnings" (the earnings-related word) and "will" (the forward-looking word) share the verb "increase", we can understand that the forward-looking language is being used to discuss the earnings-related topic.
Note that we ignored verbs with dependency "aux" in the above. Auxiliary verbs modify other verbs; they are not the principal verb of the subject-verb pair that we are looking for. However, auxiliary verbs are important because they help us verify forward-looking language. English does not have a formal future tense. Rather, future actions are indicated by auxiliary phrases. For instance, "this year's earnings increase" is in the present tense whereas "next year's earnings will increase". In the latter case, the verb "increase" is modified by the auxiliary verb "will". Auxiliary verbs do not always indicate a future tense; their presence is more nuanced. For example:
sent1 = 'XYZ had expected earnings to increase last year.'
sent2 = 'XYZ expected earnings to increase next year.'
In sent1
, "had" modifies "expected" to place it in the past tense. In sent2
, the lack of a auxiliary modifier on "expected" leaves it in the present tense; because "expected" is understood to be about future events, we know that the present tense of this word discusses future events.
What remains is to determine whether the forward-looking statement about an earnings-related topic is being given by management. We don't, for instance, wish to include forecasts made by analysts. To determine the speaker in the sentence, we need to find other subjects in the sentence. The word "earnings" in sent
is the subject for "increase" whereas the noun phrase "XYZ" is the subject for "announced". These two verbs are linked together (they are causal compliments). We begin by mapping each verb to a subject:
def get_subjMap(doc):
subj_map = {}
for s in doc.sents:
for w in s:
if w.dep_ == 'nsubj':
subj_map.update({w.head: w})
return subj_map
subj_map = get_subjMap(doc)
for v, w in subj_map.items():
print(w, v)
XYZ announced earnings increase
Then, starting at the verb we discovered earlier (and saved in e_verbs
), we look for related subject-verb phrases.
for e, v in e_verbs.items():
subj = subj_map[v.head]
print(e, subj)
earnings XYZ
This gives confirmation that the agent doing the forecasting is XYZ.
What about instances in which it is not immediately clear from the subject of the sentence what the affiliation of the speaker is? For example:
para = '''
XYZ announced strong results for the quarter.
Alice Smith, CEO of XYZ, remains optimistic.
Bob Johnson, an analyst covering XYZ pressured Smith for details on the latest earnings call.
Smith stated that she expected earnings growth over the next year.
'''
It is the last sentence that has a forecast. However, the subject doing the forcasting is "Smith". Absent any other context, it is unclear from that sentence alone whether "Smith" is affiliated with the company. Note that her affiliation is clairified two sentences earlier.
Because we've expanded the text to contain multiple setences, before going any further let's define a function to check each sentence for the information we've thus far been able to extract. If the function finds a forward-looking statement about an earnings-related item, it should return:
A sentence may have multiple instances of items (1)-(3), so the function should be structured to return a list of those instances as well as item (4).
docp = nlp(para)
def find_sentence(doc):
return_items = {}
for s in doc.sents:
# look for earnings-related words
ep_words = [w for w in s if w.text in earnings_words]
ep_verbs = {w:get_verb(w) for w in ep_words}
# look for forward-looking words
fp_words = [w for w in s if w.text in forward_words]
fp_verbs = {w:get_verb(w) for w in fp_words}
# verify that the forward and earnings word match
for e, ev in ep_verbs.items():
for f, fv in fp_verbs.items():
if ev == fv:
if s not in return_items:
return_items.update({s: [[e.text, f.text, ev]]})
else:
return_items[s].append([e.text, f.text, ev])
return return_items
found_sentences = find_sentence(docp)
for sentence, instances in found_sentences.items():
print(sentence)
for instance in instances:
print('\t', instance)
Smith stated that she expected earnings growth over the next year. ['earnings', 'expected', expected]
The map of subject-verb pairs in the paragraph is given by:
subj_map = get_subjMap(docp)
for v, w in subj_map.items():
print(w, v)
XYZ announced Smith remains Johnson pressured Smith stated she expected
And so if we go looking for the speaker in our forecast sentence:
for instances in found_sentences.values():
for instance in instances:
e, f, v = instance
subj = subj_map[v]
print(e, f, v, subj)
earnings expected expected she
We find that the speaker is simply "she".
To figure out who the "she" refers to, utilize a co-reference tool. The tool is in the neuralcoref
module and can be added to a spacy
pipeline.
(Technical note: neuralcoref
requires spacy==2.1.0
, though a version for spacy 3+
is in development.)
import neuralcoref
# create a new spacy pipeline
nlp2 = spacy.load('en_core_web_lg')
# add neuralcoref to this pipeline
neuralcoref.add_to_pipe(nlp2)
<spacy.lang.en.English at 0x2055f8c2f48>
Now, when we pass the paragraph to spacy
, the output model includes a list of coreference clusters.
docp2 = nlp2(para)
for item in docp2._.coref_clusters:
print(item.main, item.mentions)
XYZ [ XYZ, XYZ] Smith [Alice Smith, CEO of XYZ, Smith, Smith, she]
The second coreference cluster shows us that the "she" we're interested is in the same co-reference cluster with "Alice Smith", indicating that the "she" refers to "Alice Smith". Also within this co-reference cluster is the phrase "CEO of XYZ". Given that "XYZ" is the company we are interested in, we can usually deduce that the "she" is representing XYZ.
found_sentences2 = find_sentence(docp2)
sent_list2 = find_sentence(docp2)
subj_map2 = get_subjMap(docp2)
for instances in found_sentences2.values():
for instance in instances:
e, f, v = instance
subj = subj_map2[v]
print(e, f, v, subj, subj._.coref_clusters)
earnings expected expected she [Smith: [Alice Smith, CEO of XYZ, Smith, Smith, she]]
Hence, will a little bit of grammar-parsing, it is possible to find reports of management issued guidance in a news article. Obviously, the English language can be far more complex than what's shown above. A fully developed text-parser will need to consider a much richer array of problems (a text-parser I built for this sort of project needed about 800 lines of Python code just to read over the document and check various grammatical constructs). However, it's nice to see what Python can do in this simplified example.