Notebook

6 Corpus Exploration¶

Text Mining for Historians (with Python)¶

A Gentle Introduction to Working with Textual Data in Python¶

Created by Kaspar Beelen and Luke Blaxill¶

For the German Historical Institute, London¶

This Notebook explores various tools for analysing and comparing texts at the corpus level. As such, these are your first ventures into "macro-analysis" with Python. The methods described here are particularly powerful in combination with the techniques for content selection explained in Notebook 5 Corpus Creation.

More specifically, we will have a closer look at:

Keyword in Context Analysis: Explore context of words, similar to concordance in AntConc
Collocations: Compute which tokens tend to co-occur together
Feature selection: Compute which tokens are distinctive for a subset of texts

6.1 Keyword in Context¶

Computers are excellent in indexing, organizing and retrieving information. However, interpreting information (especially natural language) is still a difficult task. Keyword-in-Context (KWIC) analysis, brings together the best of both worlds: the retrieval power of machines, with the close-reading skills of the historian. KWIC (or concordance) centres a corpus on a specific query term, with n words (or characters) to the left and the right.

In this section, we investigate reports of the London Medical Officers of Health, the London's Pulse corpus.

The reports were produced each year by the Medical Officer of Health (MOH) of a district and set out the work done by his public health and sanitary officers. The reports provided vital data on birth and death rates, infant mortality, incidence of infectious and other diseases, and a general statement on the health of the population.

Source: https://wellcomelibrary.org/moh/about-the-reports/about-the-medical-officer-of-health-reports/

We start by importing the necessary libraries. Some of the code is explained in previous Notebooks, so won't discuss it in detail here.

The tools we need are:

nltk: Natural Language Toolkint: for tokenization and concordance
pathlib: a library for managing files and folders

In [1]:

import nltk # import natural language toolkit
nltk.download('stopwords')
from pathlib import Path # import Path object from pathlib
from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kbeelen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In [2]:

!ls data/MOH/ # list all files in data/MOH/python/

antconc    python     python.zip

In [6]:

# in case you unzipped data before
!rm -r data/MOH/python

In [7]:

!unzip data/MOH/python.zip -d data/MOH/

Archive:  data/MOH/python.zip
   creating: data/MOH/python/
  inflating: data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt  
  inflating: data/MOH/python/CityofWestminster.1932.b18247945.txt  
  inflating: data/MOH/python/CityofWestminster.1921.b18247830.txt  
  inflating: data/MOH/python/PoplarandBromley.1900.b18245754.txt  
  inflating: data/MOH/python/Poplar.1919.b18120878.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt  
  inflating: data/MOH/python/CityofWestminster.1907.b18247726.txt  
  inflating: data/MOH/python/CityofWestminster.1906.b18247714.txt  
  inflating: data/MOH/python/CityofWestminster.1903.b18247684.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1903.b1824578x.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1938.b18246102.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1960.b18246321.txt  
  inflating: data/MOH/python/CityofWestminster.1920.b18247829.txt  
  inflating: data/MOH/python/CityofWestminster.1945.b1824807x.txt  
  inflating: data/MOH/python/CityofWestminster.1904.b18247696.txt  
  inflating: data/MOH/python/Westminster.1898.b19874340.txt  
  inflating: data/MOH/python/Westminster.1900.b19823228.txt  
  inflating: data/MOH/python/CityofWestminster.1951.b18248135.txt  
  inflating: data/MOH/python/CityofWestminster.1902.b18247672.txt  
  inflating: data/MOH/python/CityofWestminster.1905.b18247702.txt  
  inflating: data/MOH/python/Poplar.1894.b17999157.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1930.b18246023.txt  
  inflating: data/MOH/python/CityofWestminster.1942.b18248044.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1907.b18245821.txt  
  inflating: data/MOH/python/CityofWestminster.1928.b18247908.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1943.b18246151.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1959.b1824631x.txt  
  inflating: data/MOH/python/CityofWestminster.1959.b18248214.txt  
  inflating: data/MOH/python/CityofWestminster.1936.b18247982.txt  
  inflating: data/MOH/python/CityofWestminster.1901.b18247660.txt  
  inflating: data/MOH/python/Poplar.1918.b18120866.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1942.b1824614x.txt  
  inflating: data/MOH/python/Poplar.1898.b18222882.txt  
  inflating: data/MOH/python/CityofWestminster.1969.b18248317.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1909.b18245845.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1913.b18245882.txt  
  inflating: data/MOH/python/CityofWestminster.1908.b18247738.txt  
  inflating: data/MOH/python/CityofWestminster.1966.b18248287.txt  
  inflating: data/MOH/python/CityofWestminster.1971.b18248330.txt  
  inflating: data/MOH/python/CityofWestminster.1922.b18247842.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1964.b18246369.txt  
  inflating: data/MOH/python/Poplar.1898.b18222833.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1957.b18246291.txt  
  inflating: data/MOH/python/CityofWestminster.1911.b18247763.txt  
  inflating: data/MOH/python/CityofWestminster.1910.b18247751.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1956.b1824628x.txt  
  inflating: data/MOH/python/Poplar.1896.b19885040.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1912.b18245870.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1915.b18245900.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1937.b18246096.txt  
  inflating: data/MOH/python/CityofWestminster.1956.b18248184.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1940.b18246126.txt  
  inflating: data/MOH/python/CityofWestminster.1970.b18248329.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1927.b18245997.txt  
  inflating: data/MOH/python/CityofWestminster.1925.b18247878.txt  
  inflating: data/MOH/python/CityofWestminster.1941.b18248032.txt  
  inflating: data/MOH/python/CityofWestminster.1952.b18248147.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1925.b18245973.txt  
  inflating: data/MOH/python/CityofWestminster.1924.b18247866.txt  
  inflating: data/MOH/python/CityofWestminster.1953.b18248159.txt  
  inflating: data/MOH/python/CityofWestminster.1912.b18247775.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1953.b18246254.txt  
  inflating: data/MOH/python/Westminster.1889.b20057076.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1914.b18245894.txt  
  inflating: data/MOH/python/CityofWestminster.1954.b18248160.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1941.b18246138.txt  
  inflating: data/MOH/python/Westminster.1861.b18248408.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1963.b18246357.txt  
  inflating: data/MOH/python/CityofWestminster.1948.b1824810x.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1910.b18245857.txt  
  inflating: data/MOH/python/CityofWestminster.1949.b18248111.txt  
  inflating: data/MOH/python/CityofWestminster.1913.b18247787.txt  
  inflating: data/MOH/python/Westminster.1859.b1824838x.txt  
  inflating: data/MOH/python/Westminster.1858.b18248378.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1922.b18245948.txt  
  inflating: data/MOH/python/CityofWestminster.1931.b18247933.txt  
  inflating: data/MOH/python/Poplar.1893.b17950454.txt  
  inflating: data/MOH/python/Westminster.1893.b18018312.txt  
  inflating: data/MOH/python/Westminster.1899.b18223011.txt  
  inflating: data/MOH/python/CityofWestminster.1967.b18248299.txt  
  inflating: data/MOH/python/CityofWestminster.1964.b18248263.txt  
  inflating: data/MOH/python/CityofWestminster.1917.b18247817.txt  
  inflating: data/MOH/python/CityofWestminster.1915.b18247805.txt  
  inflating: data/MOH/python/CityofWestminster.1926.b1824788x.txt  
  inflating: data/MOH/python/Westminster.1860.b18248391.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1955.b18246278.txt  
  inflating: data/MOH/python/CityofWestminster.1965.b18248275.txt  
  inflating: data/MOH/python/PoplarDistrictBowandStratford.1900.b18245730.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1932.b18246047.txt  
  inflating: data/MOH/python/CityofWestminster.1940.b18248020.txt  
  inflating: data/MOH/python/Westminster.1894.b18018324.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1926.b18245985.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1952.b18246242.txt  
  inflating: data/MOH/python/CityofWestminster.1914.b18247799.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1936.b18246084.txt  
  inflating: data/MOH/python/CityofWestminster.1957.b18248196.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1908.b18245833.txt  
  inflating: data/MOH/python/CityofWestminster.1927.b18247891.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1911.b18245869.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1928.b1824600x.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1962.b18246345.txt  
  inflating: data/MOH/python/Westminster.1896.b18038207.txt  
  inflating: data/MOH/python/Poplar.1897.b18222869.txt  
  inflating: data/MOH/python/Westminster.1897.b19874352.txt  
  inflating: data/MOH/python/Westminster.1858.b18248366.txt  
  inflating: data/MOH/python/CityofWestminster.1955.b18248172.txt  
  inflating: data/MOH/python/CityofWestminster.1963.b18248251.txt  
  inflating: data/MOH/python/Poplar.1916.b18120854.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1923.b1824595x.txt  
  inflating: data/MOH/python/Westminster.1895.b19874364.txt  
  inflating: data/MOH/python/Westminster.1888.b20057064.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1949.b18246217.txt  
  inflating: data/MOH/python/PoplarandBromley.1895.b18245742.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1917.b18245912.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1933.b18246059.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1924.b18245961.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1954.b18246266.txt  
  inflating: data/MOH/python/CityofWestminster.1930.b18247921.txt  
  inflating: data/MOH/python/CityofWestminster.1962.b1824824x.txt  
  inflating: data/MOH/python/CityofWestminster.1923.b18247854.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1929.b18246011.txt  
  inflating: data/MOH/python/CityofWestminster.1958.b18248202.txt  
  inflating: data/MOH/python/CityofWestminster.1937.b18247994.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1921.b18245936.txt  
  inflating: data/MOH/python/CityofWestminster.1938.b18248007.txt  
  inflating: data/MOH/python/CityofWestminster.1947.b18248093.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1950.b18246229.txt  
  inflating: data/MOH/python/Westminster.1891.b2005709x.txt  
  inflating: data/MOH/python/Westminster.1857.b18248342.txt  
  inflating: data/MOH/python/CityofWestminster.1933.b18247957.txt  
  inflating: data/MOH/python/Poplar.1899.b18222894.txt  
  inflating: data/MOH/python/CityofWestminster.1944.b18248068.txt  
  inflating: data/MOH/python/CityofWestminster.1909.b1824774x.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1946.b18246187.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1931.b18246035.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1951.b18246230.txt  
  inflating: data/MOH/python/Westminster.1857.b18248354.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1904.b18245791.txt  
  inflating: data/MOH/python/CityofWestminster.1960.b18248226.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1961.b18246333.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1939.b18246114.txt  
  inflating: data/MOH/python/CityofWestminster.1961.b18248238.txt  
  inflating: data/MOH/python/CityofWestminster.1943.b18248056.txt  
  inflating: data/MOH/python/CityofWestminster.1950.b18248123.txt  
  inflating: data/MOH/python/CityofWestminster.1934.b18247969.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1947.b18246199.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1905.b18245808.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1906.b1824581x.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1901.b18245766.txt  
  inflating: data/MOH/python/Westminster.1892.b20057106.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1944.b18246163.txt  
  inflating: data/MOH/python/CityofWestminster.1968.b18248305.txt  
  inflating: data/MOH/python/Poplar.1893.b17997835.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1958.b18246308.txt  
  inflating: data/MOH/python/CityofWestminster.1929.b1824791x.txt  
  inflating: data/MOH/python/CityofWestminster.1939.b18248019.txt  
  inflating: data/MOH/python/CityofWestminster.1935.b18247970.txt  
  inflating: data/MOH/python/Poplar.1896.b19885039.txt

The data are stored in the following folder structure:

data
|___ moh
     |___ python
          |____ CityofWestminster.1901.b18247660.txt
          |____ ...

The code below:

harvests all paths to .txt files in working_data/moh/python
converts the result to a list

In [8]:

moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python

We can print the paths to ten document with list slicing: [:10] means, get document from index positions 0 till 9. (i.e. the first ten items).

In [9]:

print(moh_reports_paths[:10]) # print the first ten items

[PosixPath('data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt'), PosixPath('data/MOH/python/CityofWestminster.1932.b18247945.txt'), PosixPath('data/MOH/python/CityofWestminster.1921.b18247830.txt'), PosixPath('data/MOH/python/PoplarandBromley.1900.b18245754.txt'), PosixPath('data/MOH/python/Poplar.1919.b18120878.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt'), PosixPath('data/MOH/python/CityofWestminster.1907.b18247726.txt'), PosixPath('data/MOH/python/CityofWestminster.1906.b18247714.txt'), PosixPath('data/MOH/python/CityofWestminster.1903.b18247684.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt')]

Once we know where all the files are located we can create a corpus.

To do this, we apply the following steps:

create an empty list variable where we will store the tokens of the corpus (line 3)
iterate over the collected paths (line 5)
read the text file (line 6)
lowercase the text (line 6)
tokenize the string (line 7): this converts the string to a list of tokens
iterate over tokens (line 8)
test if a token contains only alphabetic characters (line 9)
add a token to the list if line 9 evaluates to True (line 10)

The general flow of the program is similar to what we've seen before: we create an empty list where we store information from our text collection, in this case, all alphabetic tokens.

We use one more Notebook functionality %%time to print how long the cell took to run.

It could take a few seconds for the cell to run, so please be a bit patient:

In [10]:

%%time

corpus = [] # inititialize an empty list where we will store the MOH reports

for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths 
    text_lower = open(p).read().lower() # read the text files and lowercase the string
    tokens = wordpunct_tokenize(text_lower) # tokenize the string
    for token in tokens: # iterate over the tokens
        if token.isalpha(): # test if token only contains alphabetic characteris
            corpus.append(token) # if the above test evaluates to True, append token to the corpus list
print('collected', len(corpus),'tokens')

collected 3550169 tokens
CPU times: user 2.75 s, sys: 150 ms, total: 2.9 s
Wall time: 2.97 s

While this program works perfectly fine, it's not the most efficient code. The example below is a bit better, especially if you're confronted with lots of text files.

the with open statement is a convenient way of handling the opening and closing of files (to make sure you don't keep all information in memory), which would slow down the execution of your program
line 8 shows a list comprehension, this is similar to a for loop but faster and more concise.

We won't spend too much time discussing list comprehensions. The example below should suffice for now. We write a small program that collects odd numbers. First, we generate a list of numbers with range(10)...

In [11]:

# see the output of range(10)
list(range(10))

Out[11]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

... we test for division by 2: % is the modulus operator: "which returns the remainder after dividing the left-hand operand by right-hand operand". n % 2 evaluates to 0 if a number n can be divided by 2. In Python 0 is equal to False, meaning if n % 2 evaluates to 0/False we won't append the number to odd. if it evaluates to any other integer, we'll append n to odd.

In [12]:

print(10%2)
print(15%2)

0
1

In [13]:

%%time
# program for find odd numbers
numbers = range(10) # get numbers 0 to 9
odd = [] # empty list where we store even numbers
for k in numbers: # iterate over numbers
    if k % 2: # test if number if divisible by 2
        odd.append(k) # if True append
print(odd) # print number of tokens collected

[1, 3, 5, 7, 9]
CPU times: user 422 µs, sys: 444 µs, total: 866 µs
Wall time: 486 µs

The same can be achieved with just one line of code using a list comprehension.

In [14]:

%time
odd = [k for k in range(10) if k % 2]
print(odd)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.87 µs
[1, 3, 5, 7, 9]

-- Exercise¶

To see differences in performance, do the following:

Remove the print() statement
Increase the size of the list, i.e. change range(10) to range(1000000).
Compare the Wall time of these cells

Now returning to our example: run the slightly more efficient code and observe that it produces the same output, just faster!

In [15]:

%%time

corpus = [] # inititialize an empty list where we will store the MOH reports

for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths 
    with open(p) as in_doc: # make sure to close the document after opening it
        tokens = wordpunct_tokenize(in_doc.read().lower())
        corpus.extend([t for t in tokens if t.isalpha()]) # list comprehension    
print('collected', len(corpus),'tokens') # print number of tokens collected

collected 3550169 tokens
CPU times: user 2.44 s, sys: 150 ms, total: 2.59 s
Wall time: 2.62 s

After collecting all tokens in a list we can convert this to another data type: an NLTK Text object. The cell below shows the results of the conversion.

In [16]:

print(type(corpus))
nltk_corpus = nltk.text.Text(corpus) # convert the list of tokens to a nltk.text.Text object
print(type(nltk_corpus))

<class 'list'>
<class 'nltk.text.Text'>

Why is this useful? Well the NLTKText object comes with many useful methods for corpus exploration. To inspect all the tools attached to a Text object, apply the help() function to nltk_corpus or (help(nltk.text.Text) does the same trick). You have to scroll down a bit (ignore all methods starting with __) to inspect the class methods.

In [17]:

help(nltk_corpus) # show methods attached to the nltk.text.Text object or nltk_corpus variable

Help on Text in module nltk.text object:

class Text(builtins.object)
 |  Text(tokens, name=None)
 |  
 |  A wrapper around a sequence of simple (string) tokens, which is
 |  intended to support initial exploration of texts (via the
 |  interactive console).  Its methods perform a variety of analyses
 |  on the text's contexts (e.g., counting, concordancing, collocation
 |  discovery), and display the results.  If you wish to write a
 |  program which makes use of these analyses, then you should bypass
 |  the ``Text`` class, and use the appropriate analysis function or
 |  class directly instead.
 |  
 |  A ``Text`` is typically initialized from a given document or
 |  corpus.  E.g.:
 |  
 |  >>> import nltk.corpus
 |  >>> from nltk.text import Text
 |  >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, i)
 |  
 |  __init__(self, tokens, name=None)
 |      Create a Text object.
 |      
 |      :param tokens: The source text.
 |      :type tokens: sequence of str
 |  
 |  __len__(self)
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  __unicode__ = __str__(self)
 |  
 |  collocations(self, num=20, window_size=2)
 |      Print collocations derived from the text, ignoring stopwords.
 |      
 |      :seealso: find_collocations
 |      :param num: The maximum number of collocations to print.
 |      :type num: int
 |      :param window_size: The number of tokens spanned by a collocation (default=2)
 |      :type window_size: int
 |  
 |  common_contexts(self, words, num=20)
 |      Find contexts where the specified words appear; list
 |      most frequent common contexts first.
 |      
 |      :param word: The word used to seed the similarity search
 |      :type word: str
 |      :param num: The number of words to generate (default=20)
 |      :type num: int
 |      :seealso: ContextIndex.common_contexts()
 |  
 |  concordance(self, word, width=79, lines=25)
 |      Prints a concordance for ``word`` with the specified context window.
 |      Word matching is not case-sensitive.
 |      
 |      :param word: The target word
 |      :type word: str
 |      :param width: The width of each line, in characters (default=80)
 |      :type width: int
 |      :param lines: The number of lines to display (default=25)
 |      :type lines: int
 |      
 |      :seealso: ``ConcordanceIndex``
 |  
 |  concordance_list(self, word, width=79, lines=25)
 |      Generate a concordance for ``word`` with the specified context window.
 |      Word matching is not case-sensitive.
 |      
 |      :param word: The target word
 |      :type word: str
 |      :param width: The width of each line, in characters (default=80)
 |      :type width: int
 |      :param lines: The number of lines to display (default=25)
 |      :type lines: int
 |      
 |      :seealso: ``ConcordanceIndex``
 |  
 |  count(self, word)
 |      Count the number of times this word appears in the text.
 |  
 |  dispersion_plot(self, words)
 |      Produce a plot showing the distribution of the words through the text.
 |      Requires pylab to be installed.
 |      
 |      :param words: The words to be plotted
 |      :type words: list(str)
 |      :seealso: nltk.draw.dispersion_plot()
 |  
 |  findall(self, regexp)
 |      Find instances of the regular expression in the text.
 |      The text is a list of tokens, and a regexp pattern to match
 |      a single token must be surrounded by angle brackets.  E.g.
 |      
 |      >>> print('hack'); from nltk.book import text1, text5, text9
 |      hack...
 |      >>> text5.findall("<.*><.*><bro>")
 |      you rule bro; telling you bro; u twizted bro
 |      >>> text1.findall("<a>(<.*>)<man>")
 |      monied; nervous; dangerous; white; white; white; pious; queer; good;
 |      mature; white; Cape; great; wise; wise; butterless; white; fiendish;
 |      pale; furious; better; certain; complete; dismasted; younger; brave;
 |      brave; brave; brave
 |      >>> text9.findall("<th.*>{3,}")
 |      thread through those; the thought that; that the thing; the thing
 |      that; that that thing; through these than through; them that the;
 |      through the thick; them that they; thought that the
 |      
 |      :param regexp: A regular expression
 |      :type regexp: str
 |  
 |  generate(self, words)
 |      Issues a reminder to users following the book online
 |  
 |  index(self, word)
 |      Find the index of the first occurrence of the word in the text.
 |  
 |  plot(self, *args)
 |      See documentation for FreqDist.plot()
 |      :seealso: nltk.prob.FreqDist.plot()
 |  
 |  readability(self, method)
 |  
 |  similar(self, word, num=20)
 |      Distributional similarity: find other words which appear in the
 |      same contexts as the specified word; list most similar words first.
 |      
 |      :param word: The word used to seed the similarity search
 |      :type word: str
 |      :param num: The number of words to generate (default=20)
 |      :type num: int
 |      :seealso: ContextIndex.similar_words()
 |  
 |  unicode_repr = __repr__(self)
 |  
 |  vocab(self)
 |      :seealso: nltk.prob.FreqDist
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

Let's have a closer look at .concordance(). According to the official documentation this method

Prints a concordance for word with the specified context window. Word matching is not case-sensitive.

It takes multiple arguments: - word: query term - width: the context window, i.e. determines the number of character printed - lines: determines the number of lines to show (i.e. KWIC examples)

The first line of the output states the total number of hits for the query term (Displaying * of * matches:)

The example code below prints the context of the word "poor".

In [18]:

nltk_corpus.concordance('poor',width=100,lines=10) # print the context of poor, window = 100 character

Displaying 10 of 1112 matches:
mes difficult to arrange but the friends of the poor and the charity organisation society can often 
l analysis in one case the milk proved to be of poor quality the work is carried out in the ordinary
 fat fair quality between per cent and per cent poor quality between per cent and per cent adulterat
able i district total good quality fair quality poor quality adulterated no percent no percent no pe
in which the applicant is already in receipt of poor law relief or is considered ought to be referen
ing cases previously notified under to to total poor law institutions sanatoria poor law institution
der to to total poor law institutions sanatoria poor law institutions sanatoria pulmonary males fema
g to tuberculosis and the treatment of cases in poor law and other hospitals advance in social well 
y in which the fat was between and per cent and poor or inferior quality in which the fat was betwee
ood quality no per cent fair quality no percent poor quality no percent adulterated no percent south

--Exercise¶

Use KWIC analysis to compare the word "poor" in MOsH reportss from the City of Westminster and Poplar. Using everything you learned the previous Notebook

Create two subcopora one with Westminster, one with Poplar reports
Tokenize the texts and convert the list of tokens to an NLTK Text object
Use concardance to analyse the context of the work "poor"

In [19]:

# Enter code here

6.2 Collocations¶

While KWIC analysis is useful for investigating the context of words, it is a method that doesn't scale well: it helps with the close reading of around 100 words, but when examples run in the thousands it becomes more difficult. Collocations can help quantify the semantics of term, or how the meaning of words is different between corpora or subsamples of a corpus.

Collocations, as explained in the AntConc section, are often multi-word expressions containing tokens that tend to co-occur, such "New York City" (the span between words can be longer, they don't have to appear next to each other).

The NLTK Text object has collocations() function. Below we print and explain the documentation.

collocations(self, num=20, window_size=2)

Print collocations derived from the text, ignoring stopwords.

It has the following parameters:

:param num: The maximum number of collocations to print.

The number of collocations to print (if not specified it will print 20)

:param window_size: The number of tokens spanned by a collocation (default=2)

If window_size=2 collocations will only include bigrams (words occurring next to each other). But sometimes we wish to include longer intervals, to make the co-occurrence of words within a broader window more visible, this allows us to go beyond multiword expressions and study the distribution of words in a corpus more generally. For example, we could look if "men" and "women" are discussed in each other's context (within a span of 10), even if they don't appear next to each other.

In [20]:

%%time
nltk_corpus.collocations(window_size=2)

per cent; public health; county council; london county; medical
officer; scarlet fever; whooping cough; males females; local
government; legal proceedings; dwelling houses; poplar bromley; small
pox; ice cream; sub district; government board; child welfare; city
council; death rate; bromley bow
CPU times: user 9.61 s, sys: 76.3 ms, total: 9.68 s
Wall time: 9.72 s

In [21]:

%%time 
nltk_corpus.collocations(window_size=5)

street street; per cent; public health; county council; months months;
london county; medical officer; scarlet fever; males females; poplar
bromley; london council; road street; bromley bow; see page; road
road; whooping cough; officer health; medical health; poplar bow;
small pox
CPU times: user 29.2 s, sys: 389 ms, total: 29.6 s
Wall time: 30.5 s

While the .collocations() method provides a convenient tool for obtaining collocations from a corpus, its functionality remains rather limited. Below we will inspect the collocation functions of NLTK in more detail, giving you more power as well as precision.

Before we start we import all the required tools that nltk.collocations provides. This is handled by the import *, similar to a wildcard, it matches and loads all functions from nltk.collocations.

In [22]:

import nltk
from nltk.collocations import *

We have to select an association measure to compute the "strength" with which two tokens are "attracted" to each other. In general, collocations are words that are likely to appear together (within a specific context or window size). This explains why "the "red wine" is a strong collocation and "the wine" less so.

NLTK provides us with different measures, which you can print and investigate in more detail. Many of the functions refer to the classic NLP Handbook of Manning and Schütze, "Foundations of statistical natural language processing".

In [23]:

bigram_measures = nltk.collocations.BigramAssocMeasures()

In [24]:

help(bigram_measures)

Help on BigramAssocMeasures in module nltk.metrics.association object:

class BigramAssocMeasures(NgramAssocMeasures)
 |  A collection of bigram association measures. Each association measure
 |  is provided as a function with three arguments::
 |  
 |      bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)
 |  
 |  The arguments constitute the marginals of a contingency table, counting
 |  the occurrences of particular events in a corpus. The letter i in the
 |  suffix refers to the appearance of the word in question, while x indicates
 |  the appearance of any word. Thus, for example:
 |  
 |      n_ii counts (w1, w2), i.e. the bigram being scored
 |      n_ix counts (w1, *)
 |      n_xi counts (*, w2)
 |      n_xx counts (*, *), i.e. any bigram
 |  
 |  This may be shown with respect to a contingency table::
 |  
 |              w1    ~w1
 |           ------ ------
 |       w2 | n_ii | n_oi | = n_xi
 |           ------ ------
 |      ~w2 | n_io | n_oo |
 |           ------ ------
 |           = n_ix        TOTAL = n_xx
 |  
 |  Method resolution order:
 |      BigramAssocMeasures
 |      NgramAssocMeasures
 |      builtins.object
 |  
 |  Class methods defined here:
 |  
 |  chi_sq(n_ii, n_ix_xi_tuple, n_xx) from abc.ABCMeta
 |      Scores bigrams using chi-square, i.e. phi-sq multiplied by the number
 |      of bigrams, as in Manning and Schutze 5.3.3.
 |  
 |  fisher(*marginals) from abc.ABCMeta
 |      Scores bigrams using Fisher's Exact Test (Pedersen 1996).  Less
 |      sensitive to small counts than PMI or Chi Sq, but also more expensive
 |      to compute. Requires scipy.
 |  
 |  phi_sq(*marginals) from abc.ABCMeta
 |      Scores bigrams using phi-square, the square of the Pearson correlation
 |      coefficient.
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  dice(n_ii, n_ix_xi_tuple, n_xx)
 |      Scores bigrams using Dice's coefficient.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from NgramAssocMeasures:
 |  
 |  jaccard(*marginals) from abc.ABCMeta
 |      Scores ngrams using the Jaccard index.
 |  
 |  likelihood_ratio(*marginals) from abc.ABCMeta
 |      Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4.
 |  
 |  pmi(*marginals) from abc.ABCMeta
 |      Scores ngrams by pointwise mutual information, as in Manning and
 |      Schutze 5.4.
 |  
 |  poisson_stirling(*marginals) from abc.ABCMeta
 |      Scores ngrams using the Poisson-Stirling measure.
 |  
 |  student_t(*marginals) from abc.ABCMeta
 |      Scores ngrams using Student's t test with independence hypothesis
 |      for unigrams, as in Manning and Schutze 5.3.1.
 |  
 |  ----------------------------------------------------------------------
 |  Static methods inherited from NgramAssocMeasures:
 |  
 |  mi_like(*marginals, **kwargs)
 |      Scores ngrams using a variant of mutual information. The keyword
 |      argument power sets an exponent (default 3) for the numerator. No
 |      logarithm of the result is calculated.
 |  
 |  raw_freq(*marginals)
 |      Scores ngrams by their frequency
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from NgramAssocMeasures:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

In our example we use pointwise mutual inforamtion (pmi) to compute collocations.

In [25]:

help(bigram_measures.pmi)

Help on method pmi in module nltk.metrics.association:

pmi(*marginals) method of abc.ABCMeta instance
    Scores ngrams by pointwise mutual information, as in Manning and
    Schutze 5.4.

pmi

pmi is a rather straightforward metric, in the case of bigrams (i.e. collocations of length two and window size two):

compute the total number of tokens in a corpus, assume this is n (3435)
compute the probability of a and b appearing as a bigram. If the bigram (a,b) occurs 10 times, the probability (P(a,b) = 10/3435 = 0.0029)
compute the probability of observing a and b across the whole corpus. For example if a appears 30 times and b 45, their respective probabilities are P(a) = 30/3435 = 0.0087 and P(b) = 45/3435 = 0.0131. We then multiple P(a) and P(b) to obtain the denominator 0.0087 * 0.0131 = 0.0001
next we 0.0029 / 0.0001 = 28.9999 and log this value log2(28.9999)

In [26]:

from numpy import log2
nom = 10/3435
denom = (30/3435) * (45/3435)
mpi = log2(nom/denom)
mpi

Out[26]:

4.6692787866546315

To get collocations by their pmi scores, we apply the .from_words() method to the nltk_corpus (or any list of tokens). The result of this operation is stored in a finder object which we can subsequently used to rank and print collocations.

Note that the results below look somewhat strange, these aren't very meaningful collocates.

In [27]:

finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.nbest(bigram_measures.pmi, 10) 

Out[27]:

[('abso', 'lutely'),
 ('acidi', 'lacfc'),
 ('acquires', 'setiological'),
 ('adolph', 'mussi'),
 ('adolphus', 'massie'),
 ('adultorated', 'sanples'),
 ('adver', 'tising'),
 ('aeql', 'rrhage'),
 ('alathilde', 'christoffersen'),
 ('alio', 'wances')]

These results are rather spurious, why? If, for example a and b both appear only once and next to each other, the pmi score will be high. But such collocations aren't meaningful collocation, more a rare artefact of the data. To solve this problem, we filter by ngram frequency, removing in our case all bigrams that appear less than 3 times with .apply_freq_filter() function.

In [28]:

help(finder.apply_freq_filter)

Help on method apply_freq_filter in module nltk.collocations:

apply_freq_filter(min_freq) method of nltk.collocations.BigramCollocationFinder instance
    Removes candidate ngrams which have frequency less than min_freq.

In [29]:

finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)

Out[29]:

[('bowers', 'gifford'),
 ('carrie', 'simuelson'),
 ('culex', 'pipiens'),
 ('heatherfield', 'ascot'),
 ('holmes', 'godson'),
 ('lehman', 'ashmead'),
 ('locum', 'tenens'),
 ('nemine', 'contradicente'),
 ('quinton', 'polyclinic'),
 ('rhesus', 'incompatibility')]

Now many names appear. We can even be more strict and use a higher threshold for filtering.

In [30]:

finder.apply_freq_filter(20)
finder.nbest(bigram_measures.pmi, 10)

Out[30]:

[('braxton', 'hicks'),
 ('herman', 'olsen'),
 ('posterior', 'basal'),
 ('arterio', 'sclerosis'),
 ('brucella', 'abortus'),
 ('burnishers', 'diamond'),
 ('pillows', 'bolsters'),
 ('carvers', 'gilders'),
 ('sweetmeats', 'cosaques'),
 ('bookbinding', 'lithographers')]

It is also possible to change the window size, but the larger the window size the longer the computation takes

In [31]:

finder = BigramCollocationFinder.from_words(nltk_corpus, window_size = 5)
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.pmi, 10)

Out[31]:

[('tr', 'tr'),
 ('felix', 'twede'),
 ('barmen', 'potmen'),
 ('harbott', 'chauffeur'),
 ('axel', 'welin'),
 ('betha', 'nicholson'),
 ('malcolm', 'donaldson'),
 ('roasters', 'grinders'),
 ('spasmodic', 'stridulous'),
 ('soapmaking', 'lubricating')]

Lastly, you can focus on collocations that contains a specific token, i.e. for example get all collocations with the token "poor". We have pass function to .apply_ngram_filter(). At this point, you shouldn't worry about the code, only understand how to adapt it (see exercise below).

In [32]:

def token_filter_poor(*w):
     return 'poor' not in w

finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.apply_freq_filter(3)
finder.apply_ngram_filter(token_filter_poor)
finder.nbest(bigram_measures.pmi, 10)

Out[32]:

[('poor', 'attenders'),
 ('poor', 'palatines'),
 ('poor', 'genl'),
 ('poor', 'law'),
 ('compositions', 'poor'),
 ('poor', 'sufferers'),
 ('poor', 'quality'),
 ('sleep', 'poor'),
 ('poor', 'visibility'),
 ('failures', 'poor')]

-- Exercise¶

Copy-paste the above code and create a program that prints the first 10 collocations with the word "women".

change the frequency threshold
explore otherr association measure, to what extent do your results change?

In [33]:

# Enter code here

6.3 Feature selection¶

In the last section of this Notebook, we explore computational methods for finding words that characterize a collection: we try to select tokens (more generally features) that distinguish a particular set of documents vis-a-vis another corpus.

Such comparisons help us determine what type of language use was distinctive for a particular group or (such as a political party) period or location. We continue with the example of the MOsH reports, but compare the language of different boroughs, the affluent Westminster with the industrial, and considerably poorer, Poplar.

The code below should look familiar, but we made a few changes.\

to make sure all data are in the right place, we download and extract it again
we create two empty lists corpus and labels. In the former we store our text documents (each item in the list is one text file/string), the latter contains labels, 0 for Poplar and 1 for Westminster. We collect these labels in parallel with the text, i.e. the if the first item in corpus is a text from Westminster, the first label in labels is 1.
we use with open to automatically close each document after opening it (line 1)
lines 9 - 12 contain an if else statement: if the string westminster appears in the file name we add 1 to labels, otherwise 0.

In [34]:

%%time
import nltk # import natural language toolkit
from pathlib import Path # import Path object from pathlib
from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize

moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python

corpus = [] # save corpus here
labels = [] # save labels here

for r in moh_reports_paths: # iterate over documents
    with open(r) as in_doc: # open document (also take care close it later)
        corpus.append(in_doc.read().lower()) # append the lowercased document to corpus
        
        if 'westminster' in r.name.lower(): # check if westeminster appear in the file name
            labels.append(1) # if so, append 1 to labels
        else: # if not
            labels.append(0) # append 0 to labels

CPU times: user 247 ms, sys: 61.6 ms, total: 308 ms
Wall time: 368 ms

Each document should correspond to one label. The lists labels and corpus should have equal length.

In [35]:

print(len(labels),len(corpus))

159 159

In [36]:

print(len(labels) == len(corpus))

True

As said earlier, we collect labels for each document, 1 for Westminster and 0 for Poplar (it could also be reverse, of course!). It is important that each label corresponds correctly with a text file in corpus.

In [37]:

print(labels[:10])

[0, 1, 1, 0, 0, 0, 1, 1, 1, 0]

We can check this by printing the first hundred characters of the first document (labelled as 0)...

Note that corpus[0] returns the first document, from which we slice the first hundred character [:100].

In [38]:

corpus[0][:100]

Out[38]:

"l-'&rary pop s-/ metropolitan borough of poplar . abridged interim report on the health of the borou"

... and the second document (labelled as 0)

In [39]:

corpus[1][:100]

Out[39]:

'city of westminster. report of the medical officer of health for the year . - 1932 andrew j. shinnie'

Checking your code by eyeballing the output is always good practice. Even if your code runs, it could still contain bugs, which are commonly referred to as "semantic errors".

To obtain the most distinctive words (for both report from Westminster and Poplar) we use an external library TextFeatureSelection. Python has a very rich and fast-evolving ecosystem. If you have a problem, it's very likely someone wrote a library to help you with this problem. We first have to install this package (it's not yet part of Colab)

In [46]:

import  TextFeatureSelection

Now we can apply the TextFeatureSelection library. The documentation is available here.

Computing the features requires only a few lines of code. You only need to provide

a corpus for the input_doc_list parameter
a list of labels for the target parameter

TextFeatureSelection then uses various metrics to compute the extent to which words are associated with a label. The output of this process is a pandas.DataFrame. Working with tabular data and data frames will be extensively discussed in Part II of this course. For now, we show you how to sort information and get the most distinctive words or features.

In [47]:

help(TextFeatureSelection)

Help on module TextFeatureSelection:

NAME
    TextFeatureSelection - Text features selection.

CLASSES
    builtins.object
        TextFeatureSelection
        TextFeatureSelectionGA
    
    class TextFeatureSelection(builtins.object)
     |  TextFeatureSelection(target, input_doc_list, stop_words=None, metric_list=['MI', 'CHI', 'PD', 'IG'])
     |  
     |  Compute score for each word to identify and select words which result in better model performance.
     |  
     |      Parameters
     |      ----------
     |      target : list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.
     |  
     |      input_doc_list : List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.
     |  
     |      stop_words : Words for which you will not want to have metric values calculated. Default is blank.
     |  
     |      metric_list : List object which has the metric to be calculated. There are 4 metric which are being computed as 'MI','CHI','PD','IG'. you can specify one or more than one as a list object. Default is ['MI','CHI','PD','IG'].    
     |  
     |      Returns
     |      -------
     |      values_df : pandas dataframe with results. unique words and score from the desried metric.
     |  
     |      Examples
     |      --------
     |      The following example shows how to retrieve the 5 most informative
     |      features in the Friedman #1 dataset.
     |  
     |      >>> from sklearn.feature_selection.text import TextFeatureSelection
     |  
     |      >>> #Multiclass classification problem
     |      >>> input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
     |      >>> target=['Positive','Positive','Negative','Neutral','Neutral']
     |      >>> result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
     |      >>> print(result_df)
     |  
     |          word list  word occurence count  Proportional Difference  Mutual Information  Chi Square  Information Gain
     |      0          am                     1                      1.0            0.916291    1.875000          0.089257
     |      1          an                     1                      1.0            0.916291    1.875000          0.089257
     |      2          at                     1                      1.0            1.609438    5.000000          0.000000
     |      3     awesome                     1                      1.0            0.916291    1.875000          0.089257
     |      4        back                     1                      1.0            1.609438    5.000000          0.000000
     |      5       chips                     1                      1.0            0.916291    1.875000          0.089257
     |      6   difficult                     1                      1.0            1.609438    5.000000          0.000000
     |      7          do                     1                      1.0            0.916291    1.875000          0.089257
     |      8         had                     2                      1.0            0.223144    0.833333          0.008164
     |      9       happy                     1                      1.0            0.916291    1.875000          0.089257
     |      10       home                     1                      1.0            1.609438    5.000000          0.000000
     |      11         is                     1                      1.0            1.609438    5.000000          0.000000
     |      12       just                     2                      1.0            0.223144    0.833333          0.008164
     |      13      lunch                     1                      1.0            0.916291    1.875000          0.089257
     |      14     stayed                     1                      1.0            1.609438    5.000000          0.000000
     |      15    terrain                     1                      1.0            1.609438    5.000000          0.000000
     |      16       this                     1                      1.0            1.609438    5.000000          0.000000
     |      17         to                     1                      1.0            1.609438    5.000000          0.000000
     |      18       trek                     1                      1.0            1.609438    5.000000          0.000000
     |      19       very                     2                      1.0            0.916291    2.222222          0.008164
     |      20       want                     1                      1.0            0.916291    1.875000          0.089257
     |      21    weekend                     1                      1.0            0.916291    1.875000          0.089257
     |      22       wish                     1                      1.0            1.609438    5.000000          0.000000
     |      23        you                     1                      1.0            0.916291    1.875000          0.089257
     |  
     |  
     |  
     |      >>> #Binary classification
     |      >>> input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
     |      >>> target=[1,1,0,1]
     |      >>> result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
     |      >>> print(result_df)
     |         word list  word occurence count  Proportional Difference  Mutual Information  Chi Square  Information Gain
     |      0    algebra                     1                     -1.0            1.386294    4.000000               0.0
     |      1         am                     2                      1.0                -inf    1.333333               0.0
     |      2     cannot                     1                     -1.0            1.386294    4.000000               0.0
     |      3    content                     1                      1.0                -inf    0.444444               0.0
     |      4         go                     1                      1.0                -inf    0.444444               0.0
     |      5     having                     1                      1.0                -inf    0.444444               0.0
     |      6      learn                     1                     -1.0            1.386294    4.000000               0.0
     |      7   learning                     1                     -1.0            1.386294    4.000000               0.0
     |      8       life                     1                      1.0                -inf    0.444444               0.0
     |      9     linear                     1                     -1.0            1.386294    4.000000               0.0
     |      10  location                     1                      1.0                -inf    0.444444               0.0
     |      11   machine                     1                     -1.0            1.386294    4.000000               0.0
     |      12      mars                     1                      1.0                -inf    0.444444               0.0
     |      13        my                     1                      1.0                -inf    0.444444               0.0
     |      14        of                     1                      1.0                -inf    0.444444               0.0
     |      15       the                     1                      1.0                -inf    0.444444               0.0
     |      16      this                     1                      1.0                -inf    0.444444               0.0
     |      17      time                     1                      1.0                -inf    0.444444               0.0
     |      18        to                     1                      1.0                -inf    0.444444               0.0
     |      19      want                     1                      1.0                -inf    0.444444               0.0
     |      20      with                     1                      1.0                -inf    0.444444               0.0
     |      21   without                     1                     -1.0            1.386294    4.000000               0.0
     |      22       you                     1                     -1.0            1.386294    4.000000               0.0
     |  
     |  
     |      Notes
     |      -----
     |      Chi-square (CHI):
     |       - It measures the lack of independence between t and c.
     |       - It has a natural value of zero if t and c are independent. If it is higher, then term is dependent
     |       - It is not reliable for low-frequency terms
     |       - For multi-class categories, we will calculate X^2 value for all categories and will take the Max(X^2) value across all categories at the word level.
     |       - It is not to be confused with chi-square test and the values returned are not significance values
     |  
     |      Mutual information (MI):
     |       - Rare terms will have a higher score than common terms.
     |       - For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.
     |  
     |      Proportional difference (PD):
     |       - How close two numbers are from becoming equal. 
     |       - Helps ﬁnd unigrams that occur mostly in one class of documents or the other
     |       - We use the positive document frequency and negative document frequency of a unigram as the two numbers.
     |       - If a unigram occurs predominantly in positive documents or predominantly in negative documents then the PD will be close to 1, however if distribution of unigram is almost similar, then PD is close to 0.
     |       - We can set a threshold to decide which words to be included
     |       - For multi-class categories, we will calculate PD value for all categories and will take the Max(PD) value across all categories at the word level.
     |  
     |  Information gain (IG):
     |       - It gives discriminatory power of the word
     |  
     |      References
     |      ----------
     |      Yiming Yang and Jan O. Pedersen "A Comparative Study on Feature Selection in Text Categorization"
     |      http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=E5CC43FE63A1627AB4C0DBD2061FE4B9?doi=10.1.1.32.9956&rep=rep1&type=pdf
     |  
     |      Christine Largeron, Christophe Moulin, Mathias Géry "Entropy based feature selection for text categorization"
     |      https://hal.archives-ouvertes.fr/hal-00617969/document
     |  
     |      Mondelle Simeon, Robert J. Hilderman "Categorical Proportional Difference: A Feature Selection Method for Text Categorization"
     |      https://pdfs.semanticscholar.org/6569/9f0e1159a40042cc766139f3dfac2a3860bb.pdf
     |      
     |      Tim O`Keefe and Irena Koprinska "Feature Selection and Weighting Methods in Sentiment Analysis"
     |      https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis
     |  
     |  Methods defined here:
     |  
     |  __init__(self, target, input_doc_list, stop_words=None, metric_list=['MI', 'CHI', 'PD', 'IG'])
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  getScore(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class TextFeatureSelectionGA(builtins.object)
     |  TextFeatureSelectionGA(generations=500, population=50, prob_crossover=0.9, prob_mutation=0.1, percentage_of_token=50, runtime_minutes=120)
     |  
     |  Use genetic algorithm for selecting text tokens which give best classification results
     |  
     |  Genetic Algorithm Parameters
     |  ----------
     |  
     |  generations : Number of generations to run genetic algorithm. 500 as deafult, as used in the original paper
     |  
     |  population : Number of individual chromosomes. 50 as default, as used in the original paper
     |  
     |  prob_crossover : Probability of crossover. 0.9 as default, as used in the original paper
     |  
     |  prob_mutation : Probability of mutation. 0.1 as default, as used in the original paper
     |  
     |  percentage_of_token : Percentage of word features to be included in a given chromosome.
     |      50 as default, as used in the original paper.
     |  
     |  runtime_minutes : Number of minutes to run the algorithm. This is checked in between generations.
     |      At start of each generation it is checked if runtime has exceeded than alloted time.
     |      If case run time did exceeds provided limit, best result from generations executed so far is given as output.
     |      Default is 2 hours. i.e. 120 minutes.
     |      
     |  References
     |  ----------
     |  Noria Bidi and Zakaria Elberrichi "Feature Selection For Text Classification Using Genetic Algorithms"
     |  https://ieeexplore.ieee.org/document/7804223
     |  
     |  Methods defined here:
     |  
     |  __init__(self, generations=500, population=50, prob_crossover=0.9, prob_mutation=0.1, percentage_of_token=50, runtime_minutes=120)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  getGeneticFeatures(self, doc_list, label_list, model=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     |                     intercept_scaling=1, l1_ratio=None, max_iter=100,
     |                     multi_class='auto', n_jobs=None, penalty='l2',
     |                     random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
     |                     warm_start=False), model_metric='f1', avrg='binary', analyzer='word', min_df=2, max_df=1.0, stop_words=None, tokenizer=None, token_pattern='(?u)\\b\\w\\w+\\b', lowercase=True)
     |      Data Parameters
     |      ----------        
     |      doc_list : text documents in a python list. 
     |          Example: ['i had dinner','i am on vacation','I am happy','Wastage of time']
     |      
     |      label_list : labels in a python list.
     |          Example: ['Neutral','Neutral','Positive','Negative']
     |      
     |      
     |      Modelling Parameters
     |      ----------
     |      model : Set a model which has .fit function to train model and .predict function to predict for test data. 
     |          This model should also be able to train classifier using TfidfVectorizer feature.
     |          Default is set as Logistic regression in sklearn
     |      
     |      model_metric : Classifier cost function. Select one from: ['f1','precision','recall'].
     |          Default is F1
     |      
     |      avrg : Averaging used in model_metric. Select one from ['micro', 'macro', 'samples','weighted', 'binary'].
     |          For binary classification, default is 'binary' and for multi-class classification, default is 'micro'.
     |      
     |      
     |      TfidfVectorizer Parameters
     |      ----------
     |      analyzer : {'word', 'char', 'char_wb'} or callable, default='word'
     |          Whether the feature should be made of word or character n-grams.
     |          Option 'char_wb' creates character n-grams only from text inside
     |          word boundaries; n-grams at the edges of words are padded with space.
     |          
     |      min_df : float or int, default=2
     |          When building the vocabulary ignore terms that have a document
     |          frequency strictly lower than the given threshold. This value is also
     |          called cut-off in the literature.
     |          If float in range of [0.0, 1.0], the parameter represents a proportion
     |          of documents, integer absolute counts.
     |          This parameter is ignored if vocabulary is not None.
     |      
     |      max_df : float or int, default=1.0
     |          When building the vocabulary ignore terms that have a document
     |          frequency strictly higher than the given threshold (corpus-specific
     |          stop words).
     |          If float in range [0.0, 1.0], the parameter represents a proportion of
     |          documents, integer absolute counts.
     |          This parameter is ignored if vocabulary is not None.
     |      
     |      stop_words : {'english'}, list, default=None
     |          If a string, it is passed to _check_stop_list and the appropriate stop
     |          list is returned. 'english' is currently the only supported string
     |          value.
     |          There are several known issues with 'english' and you should
     |          consider an alternative (see :ref:`stop_words`).
     |      
     |          If a list, that list is assumed to contain stop words, all of which
     |          will be removed from the resulting tokens.
     |          Only applies if ``analyzer == 'word'``.
     |      
     |          If None, no stop words will be used. max_df can be set to a value
     |          in the range [0.7, 1.0) to automatically detect and filter stop
     |          words based on intra corpus document frequency of terms.
     |      
     |      tokenizer : callable, default=None
     |          Override the string tokenization step while preserving the
     |          preprocessing and n-grams generation steps.
     |          Only applies if ``analyzer == 'word'``
     |      
     |      token_pattern : str, default=r"(?u)\b\w\w+\b"
     |          Regular expression denoting what constitutes a "token", only used
     |          if ``analyzer == 'word'``. The default regexp selects tokens of 2
     |          or more alphanumeric characters (punctuation is completely ignored
     |          and always treated as a token separator).
     |      
     |          If there is a capturing group in token_pattern then the
     |          captured group content, not the entire match, becomes the token.
     |          At most one capturing group is permitted.
     |      
     |      lowercase : bool, default=True
     |          Convert all characters to lowercase before tokenizing.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

FILE
    /usr/local/lib/python3.7/site-packages/TextFeatureSelection.py

In [48]:

from TextFeatureSelection import TextFeatureSelection # import TextFeatureSelection
fsOBJ=TextFeatureSelection(target=labels,input_doc_list=corpus) # compute features
df=fsOBJ.getScore() # get features as a dataframe
df

Out[48]:

	word list	word occurence count	Proportional Difference	Mutual Information	Chi Square	Information Gain
0	00	103	-0.009709	0.094959	2.463282	0.004326
1	000	149	0.073826	0.008605	0.150191	0.000266
2	000000	1	-1.000000	0.778445	1.185538	0.001507
3	0001	3	1.000000	-inf	2.595483	0.000000
4	000163	1	1.000000	-inf	0.854210	0.000000
...	...	...	...	...	...	...
42232	¾gallons	1	-1.000000	0.778445	1.185538	0.001507
42233	¾ths	1	-1.000000	0.778445	1.185538	0.001507
42234	ægis	1	1.000000	-inf	0.854210	0.000000
42235	æration	1	-1.000000	0.778445	1.185538	0.001507
42236	œsophagus	1	-1.000000	0.778445	1.185538	0.001507

42237 rows × 6 columns

A pandas.DataFrame is similar to an Excel speadsheet. It contain several columns which we can use for selecting and sorting information. In fact, if you are familiar with Excel, you can export the data frame and open it as a spreadsheet. The code below takes care of this.

In [50]:

df.to_excel('data/result_features.xlsx')

If you want to know more about working with DataFrames, consul the following notebooks.

We use the following columns to select and rank words:

Word occurence count: How often a term occurs in the corpus
Proportional Difference: It helps ﬁnd unigrams that occur mostly in one class of documents or the other."
Mutual Information: The discriminatory power of a word.

In [51]:

westminster_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] > 0 )]
westminster_df.sort_values('Information Gain',ascending=False)[:10]

Out[51]:

	word list	word occurence count	Proportional Difference	Mutual Information	Chi Square	Information Gain
21070	horseferry	71	0.971831	-3.484235	102.313762	0.239720
41176	wes	64	0.968750	-3.380438	84.840438	0.205713
30188	pimlico	63	0.968254	-3.364690	82.552451	0.201071
7989	await	64	0.906250	-2.281826	73.305433	0.165213
9838	buckingham	61	0.901639	-2.233817	66.974989	0.152432
20355	harrison	63	0.873016	-1.978396	65.767627	0.144732
33224	restaurant	58	0.896552	-2.183386	61.025147	0.140108
18111	fines	91	0.692308	-1.093357	79.850990	0.139906
10445	carpentry	47	0.957447	-3.071703	51.509404	0.133109
25739	marshall	46	0.956522	-3.050197	49.861808	0.129219

In [52]:

poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]
poplar_df.sort_values('Information Gain',ascending=False)[:10]

Out[52]:

	word list	word occurence count	Proportional Difference	Mutual Information	Chi Square	Information Gain
30606	pop	59	-1.000000	0.778445	110.515890	0.184282
15282	dock	66	-0.787879	0.666327	85.911216	0.149069
22759	intimations	92	-0.478261	0.476164	68.933936	0.146510
22037	india	67	-0.761194	0.651290	82.833472	0.144441
31245	procured	70	-0.714286	0.624294	79.780218	0.141947
41149	wellington	86	-0.511628	0.498485	66.399455	0.131704
17897	ferry	68	-0.705882	0.619380	74.205624	0.129299
34547	seamen	65	-0.723077	0.629409	71.698892	0.122070
14529	devons	47	-1.000000	0.778445	78.605431	0.118591
26460	millwall	49	-0.959184	0.757825	77.262501	0.118218

In [53]:

poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]
poplar_df.sort_values('Chi Square',ascending=False)[:10]

Out[53]:

	word list	word occurence count	Proportional Difference	Mutual Information	Chi Square	Information Gain
30606	pop	59	-1.000000	0.778445	110.515890	0.184282
9432	bow	89	-0.640449	0.580268	106.152339	0.000000
42219	zymotic	94	-0.553191	0.525609	93.326942	0.000000
15282	dock	66	-0.787879	0.666327	85.911216	0.149069
22037	india	67	-0.761194	0.651290	82.833472	0.144441
31245	procured	70	-0.714286	0.624294	79.780218	0.141947
14529	devons	47	-1.000000	0.778445	78.605431	0.118591
26460	millwall	49	-0.959184	0.757825	77.262501	0.118218
33936	ruston	46	-1.000000	0.778445	76.252152	0.114306
17897	ferry	68	-0.705882	0.619380	74.205624	0.129299

6 Corpus Exploration¶

Text Mining for Historians (with Python)¶

A Gentle Introduction to Working with Textual Data in Python¶

Created by Kaspar Beelen and Luke Blaxill¶

For the German Historical Institute, London¶

6.1 Keyword in Context¶

-- Exercise¶

--Exercise¶

6.2 Collocations¶

-- Exercise¶

6.3 Feature selection¶

Fin.¶