Notebook

A Basic Guide To Creating Your First Text Feature¶

In this notebook we'll be creating a feature to identify which news articles contain references to a certain semantic idea. We could do this for any idea/concept e.g. Resignation, War, Earthquakes, Corona Virus! For the sake of this example we're just going to pick one topic: 'war in the middle east' that we hypothosise might be predictive of oil prices and therefore, oil stock price.

Let's Begin¶

First of all we're going to need to capture the countries we're interested in. So, here's a list of middle eastern countries:

In [1]:

countries_list = ['Bahrain', 'Qatar', 'Saudi Arabia', 'United Arab Emirates', 'UAE', 'Yemen', 'Egypt', 'Oman', 'Iran','Iraq', 'Israel', 'Jordan', 'Kuwait', 'Lebanon', 'Palestine', 'Syria', 'Syrian Arab Republic']

The next step is to capture the idea of war. This is probably the hardest part for us to code out as people rarely convey ideas in a consistant or black and white manner. Whilst, we might just say 'There is a war in the middle east', we could also say 'There is an uprising in Yemen' or 'Armed forces forced to act against malitias rural Iraq' (no direct mention of war), or even 'ISIS take key foothold in Aleppo' (no direct mention of war or even soldiers).

Your ability to create models that can successfully capture these nuances in language will largely define how effective this type of approach will be. In this example we're just going to touch the surface. Once you're happy with the basics you should do your own research and thinking to come up with ways to improve this approach. Now let's get a few words related to war and disturbance:

In [2]:

# to get synonyms for our hotwords
from nltk.corpus import wordnet 

In [12]:

hotwords =  ['war', 'insurgency', 'revolt', 'coup', 'massacre']
synonyms = [] 
  
for word in hotwords:
    for syn in wordnet.synsets(word): 
        for l in syn.lemmas(): 
            synonyms.append(l.name()) 

In [14]:

set(synonyms)

Out[14]:

{'butchery',
 'carnage',
 'churn_up',
 'coup',
 "coup_d'etat",
 'disgust',
 'gross_out',
 'insurgence',
 'insurgency',
 'insurrection',
 'mass_murder',
 'massacre',
 'mow_down',
 'nauseate',
 'putsch',
 'rebellion',
 'repel',
 'revolt',
 'rising',
 'sicken',
 'slaughter',
 'state_of_war',
 'takeover',
 'uprising',
 'war',
 'warfare'}

To make it easy for us to work with this data, we are going to use the spacy library. Specifically, we're going to use it to speed up extracting locations within our news strings. We're also just using to sample strings (news1 and news2) to keep things simple in this notebook.

In [16]:

import spacy

news1 = 'With the backing of Iran and Russia, Hezbollah successfully supported the Syrian Arab Army (SAA) in largely putting down the armed insurgency.'
news2 = 'Iran\'s Voters Send a Clear Message to the Regime'
nlp = spacy.load('en_core_web_md')

docs = []
for news in [news1, news2]:
    doc = nlp(news)
    docs.append(doc)

Now let's create a simple function that returns 1 if the articles contain countries in our Middle East list, and words about war from our synonyms. By making a function we could use this for several different combinations of countries and synonyms (e.g. Financial Crisis in Europe, Trump/Elections in America etc):

In [40]:

def featurize(doc, set_countries, set_words):
    places = []
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            places.append(ent.text)
    if len(set_countries & set(places)) == 0:
        return 0
    
    tokens = []
    for token in doc:
        if token.lemma_ in set_words:
            return 1
    return 0

Once we're happy with this function we can can then use it to check for article containing references to the idea of 'war' in the Middle East:

In [41]:

featureval = [ featurize(doc, set(countries_list), set(synonyms)) for doc in docs]

If we then run this on the two example pieces of news above:

In [42]:

featureval

Out[42]:

[1, 0]

Done! Our simple feature is able to capture instances of warlike situations being reported about middle eastern countries.

Obviously this is pretty naive and too simple to capture the whole context around a story. There's a lot more you can and should do:

Instead of finding synonyms you can look at the similarties between the word embeddings (between hotwords and tokens in the news text) - as the context of a word can drastically change its meaning (What do the following words mean: left, play, bear, kind, lie, read etc etc)
Currently our feature is binary, you can try to make it a continuous float value on the basis of number of and types of occurences.

and so much more....

You can work out logic to find any features that you think might be relevant to price change and test your hypothesis. Some possible features might be:

News related to Russia or Putin
Stories related to trade war between US and China
Any large scale pandemics ...

Or in more detail, here are just a few of the things that affected oil prices during the last decade:

Japan earthquake and nuclear spill
Arab Spring
Middle Eastern Tension
Eurozone Crisis / Greek Bailout
Iran Tensions
Refinery Fires

There are plenty more things that just this list, so do a bit of research. What about the other indexes?

In [ ]: