This Notebook explores various tools for analysing and comparing texts at the corpus level. As such, these are your first ventures into "macro-analysis" with Python. The methods described here are particularly powerful in combination with the techniques for content selection explained in Notebook 5 Corpus Creation.
More specifically, we will have a closer look at:
Computers are excellent in indexing, organizing and retrieving information. However, interpreting information (especially natural language) is still a difficult task. Keyword-in-Context (KWIC) analysis, brings together the best of both worlds: the retrieval power of machines, with the close-reading skills of the historian. KWIC (or concordance) centres a corpus on a specific query term, with n
words (or characters) to the left and the right.
In this section, we investigate reports of the London Medical Officers of Health, the London's Pulse corpus.
The reports were produced each year by the Medical Officer of Health (MOH) of a district and set out the work done by his public health and sanitary officers. The reports provided vital data on birth and death rates, infant mortality, incidence of infectious and other diseases, and a general statement on the health of the population.
Source: https://wellcomelibrary.org/moh/about-the-reports/about-the-medical-officer-of-health-reports/
We start by importing the necessary libraries. Some of the code is explained in previous Notebooks, so won't discuss it in detail here.
The tools we need are:
nltk
: Natural Language Toolkint: for tokenization and concordancepathlib
: a library for managing files and foldersimport nltk # import natural language toolkit
nltk.download('stopwords')
from pathlib import Path # import Path object from pathlib
from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize
[nltk_data] Downloading package stopwords to [nltk_data] /Users/kbeelen/nltk_data... [nltk_data] Package stopwords is already up-to-date!
!ls data/MOH/ # list all files in data/MOH/python/
antconc python python.zip
# in case you unzipped data before
!rm -r data/MOH/python
!unzip data/MOH/python.zip -d data/MOH/
Archive: data/MOH/python.zip creating: data/MOH/python/ inflating: data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt inflating: data/MOH/python/CityofWestminster.1932.b18247945.txt inflating: data/MOH/python/CityofWestminster.1921.b18247830.txt inflating: data/MOH/python/PoplarandBromley.1900.b18245754.txt inflating: data/MOH/python/Poplar.1919.b18120878.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt inflating: data/MOH/python/CityofWestminster.1907.b18247726.txt inflating: data/MOH/python/CityofWestminster.1906.b18247714.txt inflating: data/MOH/python/CityofWestminster.1903.b18247684.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1903.b1824578x.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1938.b18246102.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1960.b18246321.txt inflating: data/MOH/python/CityofWestminster.1920.b18247829.txt inflating: data/MOH/python/CityofWestminster.1945.b1824807x.txt inflating: data/MOH/python/CityofWestminster.1904.b18247696.txt inflating: data/MOH/python/Westminster.1898.b19874340.txt inflating: data/MOH/python/Westminster.1900.b19823228.txt inflating: data/MOH/python/CityofWestminster.1951.b18248135.txt inflating: data/MOH/python/CityofWestminster.1902.b18247672.txt inflating: data/MOH/python/CityofWestminster.1905.b18247702.txt inflating: data/MOH/python/Poplar.1894.b17999157.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1930.b18246023.txt inflating: data/MOH/python/CityofWestminster.1942.b18248044.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1907.b18245821.txt inflating: data/MOH/python/CityofWestminster.1928.b18247908.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1943.b18246151.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1959.b1824631x.txt inflating: data/MOH/python/CityofWestminster.1959.b18248214.txt inflating: data/MOH/python/CityofWestminster.1936.b18247982.txt inflating: data/MOH/python/CityofWestminster.1901.b18247660.txt inflating: data/MOH/python/Poplar.1918.b18120866.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1942.b1824614x.txt inflating: data/MOH/python/Poplar.1898.b18222882.txt inflating: data/MOH/python/CityofWestminster.1969.b18248317.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1909.b18245845.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1913.b18245882.txt inflating: data/MOH/python/CityofWestminster.1908.b18247738.txt inflating: data/MOH/python/CityofWestminster.1966.b18248287.txt inflating: data/MOH/python/CityofWestminster.1971.b18248330.txt inflating: data/MOH/python/CityofWestminster.1922.b18247842.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1964.b18246369.txt inflating: data/MOH/python/Poplar.1898.b18222833.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1957.b18246291.txt inflating: data/MOH/python/CityofWestminster.1911.b18247763.txt inflating: data/MOH/python/CityofWestminster.1910.b18247751.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1956.b1824628x.txt inflating: data/MOH/python/Poplar.1896.b19885040.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1912.b18245870.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1915.b18245900.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1937.b18246096.txt inflating: data/MOH/python/CityofWestminster.1956.b18248184.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1940.b18246126.txt inflating: data/MOH/python/CityofWestminster.1970.b18248329.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1927.b18245997.txt inflating: data/MOH/python/CityofWestminster.1925.b18247878.txt inflating: data/MOH/python/CityofWestminster.1941.b18248032.txt inflating: data/MOH/python/CityofWestminster.1952.b18248147.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1925.b18245973.txt inflating: data/MOH/python/CityofWestminster.1924.b18247866.txt inflating: data/MOH/python/CityofWestminster.1953.b18248159.txt inflating: data/MOH/python/CityofWestminster.1912.b18247775.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1953.b18246254.txt inflating: data/MOH/python/Westminster.1889.b20057076.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1914.b18245894.txt inflating: data/MOH/python/CityofWestminster.1954.b18248160.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1941.b18246138.txt inflating: data/MOH/python/Westminster.1861.b18248408.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1963.b18246357.txt inflating: data/MOH/python/CityofWestminster.1948.b1824810x.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1910.b18245857.txt inflating: data/MOH/python/CityofWestminster.1949.b18248111.txt inflating: data/MOH/python/CityofWestminster.1913.b18247787.txt inflating: data/MOH/python/Westminster.1859.b1824838x.txt inflating: data/MOH/python/Westminster.1858.b18248378.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1922.b18245948.txt inflating: data/MOH/python/CityofWestminster.1931.b18247933.txt inflating: data/MOH/python/Poplar.1893.b17950454.txt inflating: data/MOH/python/Westminster.1893.b18018312.txt inflating: data/MOH/python/Westminster.1899.b18223011.txt inflating: data/MOH/python/CityofWestminster.1967.b18248299.txt inflating: data/MOH/python/CityofWestminster.1964.b18248263.txt inflating: data/MOH/python/CityofWestminster.1917.b18247817.txt inflating: data/MOH/python/CityofWestminster.1915.b18247805.txt inflating: data/MOH/python/CityofWestminster.1926.b1824788x.txt inflating: data/MOH/python/Westminster.1860.b18248391.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1955.b18246278.txt inflating: data/MOH/python/CityofWestminster.1965.b18248275.txt inflating: data/MOH/python/PoplarDistrictBowandStratford.1900.b18245730.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1932.b18246047.txt inflating: data/MOH/python/CityofWestminster.1940.b18248020.txt inflating: data/MOH/python/Westminster.1894.b18018324.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1926.b18245985.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1952.b18246242.txt inflating: data/MOH/python/CityofWestminster.1914.b18247799.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1936.b18246084.txt inflating: data/MOH/python/CityofWestminster.1957.b18248196.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1908.b18245833.txt inflating: data/MOH/python/CityofWestminster.1927.b18247891.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1911.b18245869.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1928.b1824600x.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1962.b18246345.txt inflating: data/MOH/python/Westminster.1896.b18038207.txt inflating: data/MOH/python/Poplar.1897.b18222869.txt inflating: data/MOH/python/Westminster.1897.b19874352.txt inflating: data/MOH/python/Westminster.1858.b18248366.txt inflating: data/MOH/python/CityofWestminster.1955.b18248172.txt inflating: data/MOH/python/CityofWestminster.1963.b18248251.txt inflating: data/MOH/python/Poplar.1916.b18120854.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1923.b1824595x.txt inflating: data/MOH/python/Westminster.1895.b19874364.txt inflating: data/MOH/python/Westminster.1888.b20057064.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1949.b18246217.txt inflating: data/MOH/python/PoplarandBromley.1895.b18245742.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1917.b18245912.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1933.b18246059.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1924.b18245961.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1954.b18246266.txt inflating: data/MOH/python/CityofWestminster.1930.b18247921.txt inflating: data/MOH/python/CityofWestminster.1962.b1824824x.txt inflating: data/MOH/python/CityofWestminster.1923.b18247854.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1929.b18246011.txt inflating: data/MOH/python/CityofWestminster.1958.b18248202.txt inflating: data/MOH/python/CityofWestminster.1937.b18247994.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1921.b18245936.txt inflating: data/MOH/python/CityofWestminster.1938.b18248007.txt inflating: data/MOH/python/CityofWestminster.1947.b18248093.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1950.b18246229.txt inflating: data/MOH/python/Westminster.1891.b2005709x.txt inflating: data/MOH/python/Westminster.1857.b18248342.txt inflating: data/MOH/python/CityofWestminster.1933.b18247957.txt inflating: data/MOH/python/Poplar.1899.b18222894.txt inflating: data/MOH/python/CityofWestminster.1944.b18248068.txt inflating: data/MOH/python/CityofWestminster.1909.b1824774x.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1946.b18246187.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1931.b18246035.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1951.b18246230.txt inflating: data/MOH/python/Westminster.1857.b18248354.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1904.b18245791.txt inflating: data/MOH/python/CityofWestminster.1960.b18248226.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1961.b18246333.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1939.b18246114.txt inflating: data/MOH/python/CityofWestminster.1961.b18248238.txt inflating: data/MOH/python/CityofWestminster.1943.b18248056.txt inflating: data/MOH/python/CityofWestminster.1950.b18248123.txt inflating: data/MOH/python/CityofWestminster.1934.b18247969.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1947.b18246199.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1905.b18245808.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1906.b1824581x.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1901.b18245766.txt inflating: data/MOH/python/Westminster.1892.b20057106.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1944.b18246163.txt inflating: data/MOH/python/CityofWestminster.1968.b18248305.txt inflating: data/MOH/python/Poplar.1893.b17997835.txt inflating: data/MOH/python/PoplarMetropolitanBorough.1958.b18246308.txt inflating: data/MOH/python/CityofWestminster.1929.b1824791x.txt inflating: data/MOH/python/CityofWestminster.1939.b18248019.txt inflating: data/MOH/python/CityofWestminster.1935.b18247970.txt inflating: data/MOH/python/Poplar.1896.b19885039.txt
The data are stored in the following folder structure:
data
|___ moh
|___ python
|____ CityofWestminster.1901.b18247660.txt
|____ ...
The code below:
.txt
files in working_data/moh/python
list
moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python
We can print the paths to ten document with list slicing: [:10]
means, get document from index positions 0
till 9
. (i.e. the first ten items).
print(moh_reports_paths[:10]) # print the first ten items
[PosixPath('data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt'), PosixPath('data/MOH/python/CityofWestminster.1932.b18247945.txt'), PosixPath('data/MOH/python/CityofWestminster.1921.b18247830.txt'), PosixPath('data/MOH/python/PoplarandBromley.1900.b18245754.txt'), PosixPath('data/MOH/python/Poplar.1919.b18120878.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt'), PosixPath('data/MOH/python/CityofWestminster.1907.b18247726.txt'), PosixPath('data/MOH/python/CityofWestminster.1906.b18247714.txt'), PosixPath('data/MOH/python/CityofWestminster.1903.b18247684.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt')]
Once we know where all the files are located we can create a corpus.
To do this, we apply the following steps:
The general flow of the program is similar to what we've seen before: we create an empty list where we store information from our text collection, in this case, all alphabetic tokens.
We use one more Notebook functionality %%time
to print how long the cell took to run.
It could take a few seconds for the cell to run, so please be a bit patient:
%%time
corpus = [] # inititialize an empty list where we will store the MOH reports
for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths
text_lower = open(p).read().lower() # read the text files and lowercase the string
tokens = wordpunct_tokenize(text_lower) # tokenize the string
for token in tokens: # iterate over the tokens
if token.isalpha(): # test if token only contains alphabetic characteris
corpus.append(token) # if the above test evaluates to True, append token to the corpus list
print('collected', len(corpus),'tokens')
collected 3550169 tokens CPU times: user 2.75 s, sys: 150 ms, total: 2.9 s Wall time: 2.97 s
While this program works perfectly fine, it's not the most efficient code. The example below is a bit better, especially if you're confronted with lots of text files.
with open
statement is a convenient way of handling the opening and closing of files (to make sure you don't keep all information in memory), which would slow down the execution of your programfor
loop but faster and more concise.We won't spend too much time discussing list comprehensions. The example below should suffice for now. We write a small program that collects odd numbers. First, we generate a list of numbers with range(10)
...
# see the output of range(10)
list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
... we test for division by 2: %
is the modulus operator: "which returns the remainder after dividing the left-hand operand by right-hand operand". n % 2
evaluates to 0
if a number n
can be divided by 2
. In Python 0
is equal to False
, meaning if n % 2
evaluates to 0
/False
we won't append the number to odd
. if it evaluates to any other integer, we'll append n
to odd
.
print(10%2)
print(15%2)
0 1
%%time
# program for find odd numbers
numbers = range(10) # get numbers 0 to 9
odd = [] # empty list where we store even numbers
for k in numbers: # iterate over numbers
if k % 2: # test if number if divisible by 2
odd.append(k) # if True append
print(odd) # print number of tokens collected
[1, 3, 5, 7, 9] CPU times: user 422 µs, sys: 444 µs, total: 866 µs Wall time: 486 µs
The same can be achieved with just one line of code using a list comprehension.
%time
odd = [k for k in range(10) if k % 2]
print(odd)
CPU times: user 3 µs, sys: 1 µs, total: 4 µs Wall time: 7.87 µs [1, 3, 5, 7, 9]
To see differences in performance, do the following:
print()
statementrange(10)
to range(1000000)
.Now returning to our example: run the slightly more efficient code and observe that it produces the same output, just faster!
%%time
corpus = [] # inititialize an empty list where we will store the MOH reports
for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths
with open(p) as in_doc: # make sure to close the document after opening it
tokens = wordpunct_tokenize(in_doc.read().lower())
corpus.extend([t for t in tokens if t.isalpha()]) # list comprehension
print('collected', len(corpus),'tokens') # print number of tokens collected
collected 3550169 tokens CPU times: user 2.44 s, sys: 150 ms, total: 2.59 s Wall time: 2.62 s
After collecting all tokens in a list
we can convert this to another data type: an NLTK Text
object. The cell below shows the results of the conversion.
print(type(corpus))
nltk_corpus = nltk.text.Text(corpus) # convert the list of tokens to a nltk.text.Text object
print(type(nltk_corpus))
<class 'list'> <class 'nltk.text.Text'>
Why is this useful? Well the NLTKText
object comes with many useful methods for corpus exploration. To inspect all the tools attached to a Text
object, apply the help()
function to nltk_corpus
or (help(nltk.text.Text)
does the same trick). You have to scroll down a bit (ignore all methods starting with __
) to inspect the class methods.
help(nltk_corpus) # show methods attached to the nltk.text.Text object or nltk_corpus variable
Help on Text in module nltk.text object: class Text(builtins.object) | Text(tokens, name=None) | | A wrapper around a sequence of simple (string) tokens, which is | intended to support initial exploration of texts (via the | interactive console). Its methods perform a variety of analyses | on the text's contexts (e.g., counting, concordancing, collocation | discovery), and display the results. If you wish to write a | program which makes use of these analyses, then you should bypass | the ``Text`` class, and use the appropriate analysis function or | class directly instead. | | A ``Text`` is typically initialized from a given document or | corpus. E.g.: | | >>> import nltk.corpus | >>> from nltk.text import Text | >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) | | Methods defined here: | | __getitem__(self, i) | | __init__(self, tokens, name=None) | Create a Text object. | | :param tokens: The source text. | :type tokens: sequence of str | | __len__(self) | | __repr__(self) | Return repr(self). | | __str__(self) | Return str(self). | | __unicode__ = __str__(self) | | collocations(self, num=20, window_size=2) | Print collocations derived from the text, ignoring stopwords. | | :seealso: find_collocations | :param num: The maximum number of collocations to print. | :type num: int | :param window_size: The number of tokens spanned by a collocation (default=2) | :type window_size: int | | common_contexts(self, words, num=20) | Find contexts where the specified words appear; list | most frequent common contexts first. | | :param word: The word used to seed the similarity search | :type word: str | :param num: The number of words to generate (default=20) | :type num: int | :seealso: ContextIndex.common_contexts() | | concordance(self, word, width=79, lines=25) | Prints a concordance for ``word`` with the specified context window. | Word matching is not case-sensitive. | | :param word: The target word | :type word: str | :param width: The width of each line, in characters (default=80) | :type width: int | :param lines: The number of lines to display (default=25) | :type lines: int | | :seealso: ``ConcordanceIndex`` | | concordance_list(self, word, width=79, lines=25) | Generate a concordance for ``word`` with the specified context window. | Word matching is not case-sensitive. | | :param word: The target word | :type word: str | :param width: The width of each line, in characters (default=80) | :type width: int | :param lines: The number of lines to display (default=25) | :type lines: int | | :seealso: ``ConcordanceIndex`` | | count(self, word) | Count the number of times this word appears in the text. | | dispersion_plot(self, words) | Produce a plot showing the distribution of the words through the text. | Requires pylab to be installed. | | :param words: The words to be plotted | :type words: list(str) | :seealso: nltk.draw.dispersion_plot() | | findall(self, regexp) | Find instances of the regular expression in the text. | The text is a list of tokens, and a regexp pattern to match | a single token must be surrounded by angle brackets. E.g. | | >>> print('hack'); from nltk.book import text1, text5, text9 | hack... | >>> text5.findall("<.*><.*><bro>") | you rule bro; telling you bro; u twizted bro | >>> text1.findall("<a>(<.*>)<man>") | monied; nervous; dangerous; white; white; white; pious; queer; good; | mature; white; Cape; great; wise; wise; butterless; white; fiendish; | pale; furious; better; certain; complete; dismasted; younger; brave; | brave; brave; brave | >>> text9.findall("<th.*>{3,}") | thread through those; the thought that; that the thing; the thing | that; that that thing; through these than through; them that the; | through the thick; them that they; thought that the | | :param regexp: A regular expression | :type regexp: str | | generate(self, words) | Issues a reminder to users following the book online | | index(self, word) | Find the index of the first occurrence of the word in the text. | | plot(self, *args) | See documentation for FreqDist.plot() | :seealso: nltk.prob.FreqDist.plot() | | readability(self, method) | | similar(self, word, num=20) | Distributional similarity: find other words which appear in the | same contexts as the specified word; list most similar words first. | | :param word: The word used to seed the similarity search | :type word: str | :param num: The number of words to generate (default=20) | :type num: int | :seealso: ContextIndex.similar_words() | | unicode_repr = __repr__(self) | | vocab(self) | :seealso: nltk.prob.FreqDist | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)
Let's have a closer look at .concordance()
. According to the official documentation this method
Prints a concordance for
word
with the specified context window. Word matching is not case-sensitive.
It takes multiple arguments: - word: query term - width: the context window, i.e. determines the number of character printed - lines: determines the number of lines to show (i.e. KWIC examples)
The first line of the output states the total number of hits for the query term (Displaying * of * matches:
)
The example code below prints the context of the word "poor".
nltk_corpus.concordance('poor',width=100,lines=10) # print the context of poor, window = 100 character
Displaying 10 of 1112 matches: mes difficult to arrange but the friends of the poor and the charity organisation society can often l analysis in one case the milk proved to be of poor quality the work is carried out in the ordinary fat fair quality between per cent and per cent poor quality between per cent and per cent adulterat able i district total good quality fair quality poor quality adulterated no percent no percent no pe in which the applicant is already in receipt of poor law relief or is considered ought to be referen ing cases previously notified under to to total poor law institutions sanatoria poor law institution der to to total poor law institutions sanatoria poor law institutions sanatoria pulmonary males fema g to tuberculosis and the treatment of cases in poor law and other hospitals advance in social well y in which the fat was between and per cent and poor or inferior quality in which the fat was betwee ood quality no per cent fair quality no percent poor quality no percent adulterated no percent south
Use KWIC analysis to compare the word "poor" in MOsH reportss from the City of Westminster and Poplar. Using everything you learned the previous Notebook
Text
object# Enter code here
While KWIC analysis is useful for investigating the context of words, it is a method that doesn't scale well: it helps with the close reading of around 100 words, but when examples run in the thousands it becomes more difficult. Collocations can help quantify the semantics of term, or how the meaning of words is different between corpora or subsamples of a corpus.
Collocations, as explained in the AntConc section, are often multi-word expressions containing tokens that tend to co-occur, such "New York City" (the span between words can be longer, they don't have to appear next to each other).
The NLTK Text
object has collocations()
function. Below we print and explain the documentation.
collocations(self, num=20, window_size=2)
Print collocations derived from the text, ignoring stopwords.
It has the following parameters:
:param num:
The maximum number of collocations to print.
The number of collocations to print (if not specified it will print 20)
:param window_size:
The number of tokens spanned by a collocation (default=2)
If window_size=2
collocations will only include bigrams (words occurring next to each other). But sometimes we wish to include longer intervals, to make the co-occurrence of words within a broader window more visible, this allows us to go beyond multiword expressions and study the distribution of words in a corpus more generally. For example, we could look if "men" and "women" are discussed in each other's context (within a span of 10), even if they don't appear next to each other.
%%time
nltk_corpus.collocations(window_size=2)
per cent; public health; county council; london county; medical officer; scarlet fever; whooping cough; males females; local government; legal proceedings; dwelling houses; poplar bromley; small pox; ice cream; sub district; government board; child welfare; city council; death rate; bromley bow CPU times: user 9.61 s, sys: 76.3 ms, total: 9.68 s Wall time: 9.72 s
%%time
nltk_corpus.collocations(window_size=5)
street street; per cent; public health; county council; months months; london county; medical officer; scarlet fever; males females; poplar bromley; london council; road street; bromley bow; see page; road road; whooping cough; officer health; medical health; poplar bow; small pox CPU times: user 29.2 s, sys: 389 ms, total: 29.6 s Wall time: 30.5 s
While the .collocations()
method provides a convenient tool for obtaining collocations from a corpus, its functionality remains rather limited. Below we will inspect the collocation functions of NLTK in more detail, giving you more power as well as precision.
Before we start we import all the required tools that nltk.collocations
provides. This is handled by the import *
, similar to a wildcard, it matches and loads all functions from nltk.collocations
.
import nltk
from nltk.collocations import *
We have to select an association measure to compute the "strength" with which two tokens are "attracted" to each other. In general, collocations are words that are likely to appear together (within a specific context or window size). This explains why "the "red wine" is a strong collocation and "the wine" less so.
NLTK provides us with different measures, which you can print and investigate in more detail. Many of the functions refer to the classic NLP Handbook of Manning and Schütze, "Foundations of statistical natural language processing".
bigram_measures = nltk.collocations.BigramAssocMeasures()
help(bigram_measures)
Help on BigramAssocMeasures in module nltk.metrics.association object: class BigramAssocMeasures(NgramAssocMeasures) | A collection of bigram association measures. Each association measure | is provided as a function with three arguments:: | | bigram_score_fn(n_ii, (n_ix, n_xi), n_xx) | | The arguments constitute the marginals of a contingency table, counting | the occurrences of particular events in a corpus. The letter i in the | suffix refers to the appearance of the word in question, while x indicates | the appearance of any word. Thus, for example: | | n_ii counts (w1, w2), i.e. the bigram being scored | n_ix counts (w1, *) | n_xi counts (*, w2) | n_xx counts (*, *), i.e. any bigram | | This may be shown with respect to a contingency table:: | | w1 ~w1 | ------ ------ | w2 | n_ii | n_oi | = n_xi | ------ ------ | ~w2 | n_io | n_oo | | ------ ------ | = n_ix TOTAL = n_xx | | Method resolution order: | BigramAssocMeasures | NgramAssocMeasures | builtins.object | | Class methods defined here: | | chi_sq(n_ii, n_ix_xi_tuple, n_xx) from abc.ABCMeta | Scores bigrams using chi-square, i.e. phi-sq multiplied by the number | of bigrams, as in Manning and Schutze 5.3.3. | | fisher(*marginals) from abc.ABCMeta | Scores bigrams using Fisher's Exact Test (Pedersen 1996). Less | sensitive to small counts than PMI or Chi Sq, but also more expensive | to compute. Requires scipy. | | phi_sq(*marginals) from abc.ABCMeta | Scores bigrams using phi-square, the square of the Pearson correlation | coefficient. | | ---------------------------------------------------------------------- | Static methods defined here: | | dice(n_ii, n_ix_xi_tuple, n_xx) | Scores bigrams using Dice's coefficient. | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __abstractmethods__ = frozenset() | | ---------------------------------------------------------------------- | Class methods inherited from NgramAssocMeasures: | | jaccard(*marginals) from abc.ABCMeta | Scores ngrams using the Jaccard index. | | likelihood_ratio(*marginals) from abc.ABCMeta | Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4. | | pmi(*marginals) from abc.ABCMeta | Scores ngrams by pointwise mutual information, as in Manning and | Schutze 5.4. | | poisson_stirling(*marginals) from abc.ABCMeta | Scores ngrams using the Poisson-Stirling measure. | | student_t(*marginals) from abc.ABCMeta | Scores ngrams using Student's t test with independence hypothesis | for unigrams, as in Manning and Schutze 5.3.1. | | ---------------------------------------------------------------------- | Static methods inherited from NgramAssocMeasures: | | mi_like(*marginals, **kwargs) | Scores ngrams using a variant of mutual information. The keyword | argument power sets an exponent (default 3) for the numerator. No | logarithm of the result is calculated. | | raw_freq(*marginals) | Scores ngrams by their frequency | | ---------------------------------------------------------------------- | Data descriptors inherited from NgramAssocMeasures: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)
In our example we use pointwise mutual inforamtion (pmi) to compute collocations.
help(bigram_measures.pmi)
Help on method pmi in module nltk.metrics.association: pmi(*marginals) method of abc.ABCMeta instance Scores ngrams by pointwise mutual information, as in Manning and Schutze 5.4.
pmi
is a rather straightforward metric, in the case of bigrams (i.e. collocations of length two and window size two):
n
(3435)a
and b
appearing as a bigram. If the bigram (a,b)
occurs 10 times, the probability (P(a,b)
= 10/3435 = 0.0029)a
and b
across the whole corpus. For example if a
appears 30
times and b 45
, their respective probabilities are P(a)
= 30/3435 = 0.0087 and P(b) = 45/3435 = 0.0131. We then multiple P(a)
and P(b)
to obtain the denominator 0.0087 *
0.0131 = 0.0001from numpy import log2
nom = 10/3435
denom = (30/3435) * (45/3435)
mpi = log2(nom/denom)
mpi
4.6692787866546315
To get collocations by their pmi
scores, we apply the .from_words()
method to the nltk_corpus
(or any list of tokens). The result of this operation is stored in a finder
object which we can subsequently used to rank and print collocations.
Note that the results below look somewhat strange, these aren't very meaningful collocates.
finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.nbest(bigram_measures.pmi, 10)
[('abso', 'lutely'), ('acidi', 'lacfc'), ('acquires', 'setiological'), ('adolph', 'mussi'), ('adolphus', 'massie'), ('adultorated', 'sanples'), ('adver', 'tising'), ('aeql', 'rrhage'), ('alathilde', 'christoffersen'), ('alio', 'wances')]
These results are rather spurious, why? If, for example a
and b
both appear only once and next to each other, the pmi
score will be high. But such collocations aren't meaningful collocation, more a rare artefact of the data.
To solve this problem, we filter by ngram frequency, removing in our case all bigrams that appear less than 3 times with .apply_freq_filter()
function.
help(finder.apply_freq_filter)
Help on method apply_freq_filter in module nltk.collocations: apply_freq_filter(min_freq) method of nltk.collocations.BigramCollocationFinder instance Removes candidate ngrams which have frequency less than min_freq.
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)
[('bowers', 'gifford'), ('carrie', 'simuelson'), ('culex', 'pipiens'), ('heatherfield', 'ascot'), ('holmes', 'godson'), ('lehman', 'ashmead'), ('locum', 'tenens'), ('nemine', 'contradicente'), ('quinton', 'polyclinic'), ('rhesus', 'incompatibility')]
Now many names appear. We can even be more strict and use a higher threshold for filtering.
finder.apply_freq_filter(20)
finder.nbest(bigram_measures.pmi, 10)
[('braxton', 'hicks'), ('herman', 'olsen'), ('posterior', 'basal'), ('arterio', 'sclerosis'), ('brucella', 'abortus'), ('burnishers', 'diamond'), ('pillows', 'bolsters'), ('carvers', 'gilders'), ('sweetmeats', 'cosaques'), ('bookbinding', 'lithographers')]
It is also possible to change the window size, but the larger the window size the longer the computation takes
finder = BigramCollocationFinder.from_words(nltk_corpus, window_size = 5)
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.pmi, 10)
[('tr', 'tr'), ('felix', 'twede'), ('barmen', 'potmen'), ('harbott', 'chauffeur'), ('axel', 'welin'), ('betha', 'nicholson'), ('malcolm', 'donaldson'), ('roasters', 'grinders'), ('spasmodic', 'stridulous'), ('soapmaking', 'lubricating')]
Lastly, you can focus on collocations that contains a specific token, i.e. for example get all collocations with the token "poor". We have pass function to .apply_ngram_filter()
. At this point, you shouldn't worry about the code, only understand how to adapt it (see exercise below).
def token_filter_poor(*w):
return 'poor' not in w
finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.apply_freq_filter(3)
finder.apply_ngram_filter(token_filter_poor)
finder.nbest(bigram_measures.pmi, 10)
[('poor', 'attenders'), ('poor', 'palatines'), ('poor', 'genl'), ('poor', 'law'), ('compositions', 'poor'), ('poor', 'sufferers'), ('poor', 'quality'), ('sleep', 'poor'), ('poor', 'visibility'), ('failures', 'poor')]
Copy-paste the above code and create a program that prints the first 10 collocations with the word "women".
# Enter code here
In the last section of this Notebook, we explore computational methods for finding words that characterize a collection: we try to select tokens (more generally features) that distinguish a particular set of documents vis-a-vis another corpus.
Such comparisons help us determine what type of language use was distinctive for a particular group or (such as a political party) period or location. We continue with the example of the MOsH reports, but compare the language of different boroughs, the affluent Westminster with the industrial, and considerably poorer, Poplar.
The code below should look familiar, but we made a few changes.\
corpus
and labels
. In the former we store our text documents (each item in the list is one text file/string), the latter contains labels, 0
for Poplar and 1
for Westminster. We collect these labels in parallel with the text, i.e. the if the first item in corpus
is a text from Westminster, the first label in labels
is 1
.with open
to automatically close each document after opening it (line 1)if else
statement: if the string westminster
appears in the file name we add 1
to labels
, otherwise 0
.%%time
import nltk # import natural language toolkit
from pathlib import Path # import Path object from pathlib
from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize
moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python
corpus = [] # save corpus here
labels = [] # save labels here
for r in moh_reports_paths: # iterate over documents
with open(r) as in_doc: # open document (also take care close it later)
corpus.append(in_doc.read().lower()) # append the lowercased document to corpus
if 'westminster' in r.name.lower(): # check if westeminster appear in the file name
labels.append(1) # if so, append 1 to labels
else: # if not
labels.append(0) # append 0 to labels
CPU times: user 247 ms, sys: 61.6 ms, total: 308 ms Wall time: 368 ms
Each document should correspond to one label. The lists labels
and corpus
should have equal length.
print(len(labels),len(corpus))
159 159
print(len(labels) == len(corpus))
True
As said earlier, we collect labels for each document, 1
for Westminster and 0
for Poplar (it could also be reverse, of course!). It is important that each label corresponds correctly with a text file in corpus
.
print(labels[:10])
[0, 1, 1, 0, 0, 0, 1, 1, 1, 0]
We can check this by printing the first hundred characters of the first document (labelled as 0
)...
Note that corpus[0]
returns the first document, from which we slice the first hundred character [:100]
.
corpus[0][:100]
"l-'&rary pop s-/ metropolitan borough of poplar . abridged interim report on the health of the borou"
... and the second document (labelled as 0
)
corpus[1][:100]
'city of westminster. report of the medical officer of health for the year . - 1932 andrew j. shinnie'
Checking your code by eyeballing the output is always good practice. Even if your code runs, it could still contain bugs, which are commonly referred to as "semantic errors".
To obtain the most distinctive words (for both report from Westminster and Poplar) we use an external library TextFeatureSelection
. Python has a very rich and fast-evolving ecosystem. If you have a problem, it's very likely someone wrote a library to help you with this problem. We first have to install this package (it's not yet part of Colab)
import TextFeatureSelection
Now we can apply the TextFeatureSelection
library. The documentation is available here.
Computing the features requires only a few lines of code. You only need to provide
input_doc_list
parametertarget
parameterTextFeatureSelection
then uses various metrics to compute the extent to which words are associated with a label. The output of this process is a pandas.DataFrame
. Working with tabular data and data frames will be extensively discussed in Part II of this course. For now, we show you how to sort information and get the most distinctive words or features.
help(TextFeatureSelection)
Help on module TextFeatureSelection: NAME TextFeatureSelection - Text features selection. CLASSES builtins.object TextFeatureSelection TextFeatureSelectionGA class TextFeatureSelection(builtins.object) | TextFeatureSelection(target, input_doc_list, stop_words=None, metric_list=['MI', 'CHI', 'PD', 'IG']) | | Compute score for each word to identify and select words which result in better model performance. | | Parameters | ---------- | target : list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object. | | input_doc_list : List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length. | | stop_words : Words for which you will not want to have metric values calculated. Default is blank. | | metric_list : List object which has the metric to be calculated. There are 4 metric which are being computed as 'MI','CHI','PD','IG'. you can specify one or more than one as a list object. Default is ['MI','CHI','PD','IG']. | | Returns | ------- | values_df : pandas dataframe with results. unique words and score from the desried metric. | | Examples | -------- | The following example shows how to retrieve the 5 most informative | features in the Friedman #1 dataset. | | >>> from sklearn.feature_selection.text import TextFeatureSelection | | >>> #Multiclass classification problem | >>> input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?'] | >>> target=['Positive','Positive','Negative','Neutral','Neutral'] | >>> result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore() | >>> print(result_df) | | word list word occurence count Proportional Difference Mutual Information Chi Square Information Gain | 0 am 1 1.0 0.916291 1.875000 0.089257 | 1 an 1 1.0 0.916291 1.875000 0.089257 | 2 at 1 1.0 1.609438 5.000000 0.000000 | 3 awesome 1 1.0 0.916291 1.875000 0.089257 | 4 back 1 1.0 1.609438 5.000000 0.000000 | 5 chips 1 1.0 0.916291 1.875000 0.089257 | 6 difficult 1 1.0 1.609438 5.000000 0.000000 | 7 do 1 1.0 0.916291 1.875000 0.089257 | 8 had 2 1.0 0.223144 0.833333 0.008164 | 9 happy 1 1.0 0.916291 1.875000 0.089257 | 10 home 1 1.0 1.609438 5.000000 0.000000 | 11 is 1 1.0 1.609438 5.000000 0.000000 | 12 just 2 1.0 0.223144 0.833333 0.008164 | 13 lunch 1 1.0 0.916291 1.875000 0.089257 | 14 stayed 1 1.0 1.609438 5.000000 0.000000 | 15 terrain 1 1.0 1.609438 5.000000 0.000000 | 16 this 1 1.0 1.609438 5.000000 0.000000 | 17 to 1 1.0 1.609438 5.000000 0.000000 | 18 trek 1 1.0 1.609438 5.000000 0.000000 | 19 very 2 1.0 0.916291 2.222222 0.008164 | 20 want 1 1.0 0.916291 1.875000 0.089257 | 21 weekend 1 1.0 0.916291 1.875000 0.089257 | 22 wish 1 1.0 1.609438 5.000000 0.000000 | 23 you 1 1.0 0.916291 1.875000 0.089257 | | | | >>> #Binary classification | >>> input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars'] | >>> target=[1,1,0,1] | >>> result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore() | >>> print(result_df) | word list word occurence count Proportional Difference Mutual Information Chi Square Information Gain | 0 algebra 1 -1.0 1.386294 4.000000 0.0 | 1 am 2 1.0 -inf 1.333333 0.0 | 2 cannot 1 -1.0 1.386294 4.000000 0.0 | 3 content 1 1.0 -inf 0.444444 0.0 | 4 go 1 1.0 -inf 0.444444 0.0 | 5 having 1 1.0 -inf 0.444444 0.0 | 6 learn 1 -1.0 1.386294 4.000000 0.0 | 7 learning 1 -1.0 1.386294 4.000000 0.0 | 8 life 1 1.0 -inf 0.444444 0.0 | 9 linear 1 -1.0 1.386294 4.000000 0.0 | 10 location 1 1.0 -inf 0.444444 0.0 | 11 machine 1 -1.0 1.386294 4.000000 0.0 | 12 mars 1 1.0 -inf 0.444444 0.0 | 13 my 1 1.0 -inf 0.444444 0.0 | 14 of 1 1.0 -inf 0.444444 0.0 | 15 the 1 1.0 -inf 0.444444 0.0 | 16 this 1 1.0 -inf 0.444444 0.0 | 17 time 1 1.0 -inf 0.444444 0.0 | 18 to 1 1.0 -inf 0.444444 0.0 | 19 want 1 1.0 -inf 0.444444 0.0 | 20 with 1 1.0 -inf 0.444444 0.0 | 21 without 1 -1.0 1.386294 4.000000 0.0 | 22 you 1 -1.0 1.386294 4.000000 0.0 | | | Notes | ----- | Chi-square (CHI): | - It measures the lack of independence between t and c. | - It has a natural value of zero if t and c are independent. If it is higher, then term is dependent | - It is not reliable for low-frequency terms | - For multi-class categories, we will calculate X^2 value for all categories and will take the Max(X^2) value across all categories at the word level. | - It is not to be confused with chi-square test and the values returned are not significance values | | Mutual information (MI): | - Rare terms will have a higher score than common terms. | - For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level. | | Proportional difference (PD): | - How close two numbers are from becoming equal. | - Helps find unigrams that occur mostly in one class of documents or the other | - We use the positive document frequency and negative document frequency of a unigram as the two numbers. | - If a unigram occurs predominantly in positive documents or predominantly in negative documents then the PD will be close to 1, however if distribution of unigram is almost similar, then PD is close to 0. | - We can set a threshold to decide which words to be included | - For multi-class categories, we will calculate PD value for all categories and will take the Max(PD) value across all categories at the word level. | | Information gain (IG): | - It gives discriminatory power of the word | | References | ---------- | Yiming Yang and Jan O. Pedersen "A Comparative Study on Feature Selection in Text Categorization" | http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=E5CC43FE63A1627AB4C0DBD2061FE4B9?doi=10.1.1.32.9956&rep=rep1&type=pdf | | Christine Largeron, Christophe Moulin, Mathias Géry "Entropy based feature selection for text categorization" | https://hal.archives-ouvertes.fr/hal-00617969/document | | Mondelle Simeon, Robert J. Hilderman "Categorical Proportional Difference: A Feature Selection Method for Text Categorization" | https://pdfs.semanticscholar.org/6569/9f0e1159a40042cc766139f3dfac2a3860bb.pdf | | Tim O`Keefe and Irena Koprinska "Feature Selection and Weighting Methods in Sentiment Analysis" | https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis | | Methods defined here: | | __init__(self, target, input_doc_list, stop_words=None, metric_list=['MI', 'CHI', 'PD', 'IG']) | Initialize self. See help(type(self)) for accurate signature. | | getScore(self) | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) class TextFeatureSelectionGA(builtins.object) | TextFeatureSelectionGA(generations=500, population=50, prob_crossover=0.9, prob_mutation=0.1, percentage_of_token=50, runtime_minutes=120) | | Use genetic algorithm for selecting text tokens which give best classification results | | Genetic Algorithm Parameters | ---------- | | generations : Number of generations to run genetic algorithm. 500 as deafult, as used in the original paper | | population : Number of individual chromosomes. 50 as default, as used in the original paper | | prob_crossover : Probability of crossover. 0.9 as default, as used in the original paper | | prob_mutation : Probability of mutation. 0.1 as default, as used in the original paper | | percentage_of_token : Percentage of word features to be included in a given chromosome. | 50 as default, as used in the original paper. | | runtime_minutes : Number of minutes to run the algorithm. This is checked in between generations. | At start of each generation it is checked if runtime has exceeded than alloted time. | If case run time did exceeds provided limit, best result from generations executed so far is given as output. | Default is 2 hours. i.e. 120 minutes. | | References | ---------- | Noria Bidi and Zakaria Elberrichi "Feature Selection For Text Classification Using Genetic Algorithms" | https://ieeexplore.ieee.org/document/7804223 | | Methods defined here: | | __init__(self, generations=500, population=50, prob_crossover=0.9, prob_mutation=0.1, percentage_of_token=50, runtime_minutes=120) | Initialize self. See help(type(self)) for accurate signature. | | getGeneticFeatures(self, doc_list, label_list, model=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, | intercept_scaling=1, l1_ratio=None, max_iter=100, | multi_class='auto', n_jobs=None, penalty='l2', | random_state=None, solver='lbfgs', tol=0.0001, verbose=0, | warm_start=False), model_metric='f1', avrg='binary', analyzer='word', min_df=2, max_df=1.0, stop_words=None, tokenizer=None, token_pattern='(?u)\\b\\w\\w+\\b', lowercase=True) | Data Parameters | ---------- | doc_list : text documents in a python list. | Example: ['i had dinner','i am on vacation','I am happy','Wastage of time'] | | label_list : labels in a python list. | Example: ['Neutral','Neutral','Positive','Negative'] | | | Modelling Parameters | ---------- | model : Set a model which has .fit function to train model and .predict function to predict for test data. | This model should also be able to train classifier using TfidfVectorizer feature. | Default is set as Logistic regression in sklearn | | model_metric : Classifier cost function. Select one from: ['f1','precision','recall']. | Default is F1 | | avrg : Averaging used in model_metric. Select one from ['micro', 'macro', 'samples','weighted', 'binary']. | For binary classification, default is 'binary' and for multi-class classification, default is 'micro'. | | | TfidfVectorizer Parameters | ---------- | analyzer : {'word', 'char', 'char_wb'} or callable, default='word' | Whether the feature should be made of word or character n-grams. | Option 'char_wb' creates character n-grams only from text inside | word boundaries; n-grams at the edges of words are padded with space. | | min_df : float or int, default=2 | When building the vocabulary ignore terms that have a document | frequency strictly lower than the given threshold. This value is also | called cut-off in the literature. | If float in range of [0.0, 1.0], the parameter represents a proportion | of documents, integer absolute counts. | This parameter is ignored if vocabulary is not None. | | max_df : float or int, default=1.0 | When building the vocabulary ignore terms that have a document | frequency strictly higher than the given threshold (corpus-specific | stop words). | If float in range [0.0, 1.0], the parameter represents a proportion of | documents, integer absolute counts. | This parameter is ignored if vocabulary is not None. | | stop_words : {'english'}, list, default=None | If a string, it is passed to _check_stop_list and the appropriate stop | list is returned. 'english' is currently the only supported string | value. | There are several known issues with 'english' and you should | consider an alternative (see :ref:`stop_words`). | | If a list, that list is assumed to contain stop words, all of which | will be removed from the resulting tokens. | Only applies if ``analyzer == 'word'``. | | If None, no stop words will be used. max_df can be set to a value | in the range [0.7, 1.0) to automatically detect and filter stop | words based on intra corpus document frequency of terms. | | tokenizer : callable, default=None | Override the string tokenization step while preserving the | preprocessing and n-grams generation steps. | Only applies if ``analyzer == 'word'`` | | token_pattern : str, default=r"(?u)\b\w\w+\b" | Regular expression denoting what constitutes a "token", only used | if ``analyzer == 'word'``. The default regexp selects tokens of 2 | or more alphanumeric characters (punctuation is completely ignored | and always treated as a token separator). | | If there is a capturing group in token_pattern then the | captured group content, not the entire match, becomes the token. | At most one capturing group is permitted. | | lowercase : bool, default=True | Convert all characters to lowercase before tokenizing. | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) FILE /usr/local/lib/python3.7/site-packages/TextFeatureSelection.py
from TextFeatureSelection import TextFeatureSelection # import TextFeatureSelection
fsOBJ=TextFeatureSelection(target=labels,input_doc_list=corpus) # compute features
df=fsOBJ.getScore() # get features as a dataframe
df
word list | word occurence count | Proportional Difference | Mutual Information | Chi Square | Information Gain | |
---|---|---|---|---|---|---|
0 | 00 | 103 | -0.009709 | 0.094959 | 2.463282 | 0.004326 |
1 | 000 | 149 | 0.073826 | 0.008605 | 0.150191 | 0.000266 |
2 | 000000 | 1 | -1.000000 | 0.778445 | 1.185538 | 0.001507 |
3 | 0001 | 3 | 1.000000 | -inf | 2.595483 | 0.000000 |
4 | 000163 | 1 | 1.000000 | -inf | 0.854210 | 0.000000 |
... | ... | ... | ... | ... | ... | ... |
42232 | ¾gallons | 1 | -1.000000 | 0.778445 | 1.185538 | 0.001507 |
42233 | ¾ths | 1 | -1.000000 | 0.778445 | 1.185538 | 0.001507 |
42234 | ægis | 1 | 1.000000 | -inf | 0.854210 | 0.000000 |
42235 | æration | 1 | -1.000000 | 0.778445 | 1.185538 | 0.001507 |
42236 | œsophagus | 1 | -1.000000 | 0.778445 | 1.185538 | 0.001507 |
42237 rows × 6 columns
A pandas.DataFrame
is similar to an Excel speadsheet. It contain several columns which we can use for selecting and sorting information. In fact, if you are familiar with Excel, you can export the data frame and open it as a spreadsheet. The code below takes care of this.
df.to_excel('data/result_features.xlsx')
If you want to know more about working with DataFrames, consul the following notebooks.
We use the following columns to select and rank words:
westminster_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] > 0 )]
westminster_df.sort_values('Information Gain',ascending=False)[:10]
word list | word occurence count | Proportional Difference | Mutual Information | Chi Square | Information Gain | |
---|---|---|---|---|---|---|
21070 | horseferry | 71 | 0.971831 | -3.484235 | 102.313762 | 0.239720 |
41176 | wes | 64 | 0.968750 | -3.380438 | 84.840438 | 0.205713 |
30188 | pimlico | 63 | 0.968254 | -3.364690 | 82.552451 | 0.201071 |
7989 | await | 64 | 0.906250 | -2.281826 | 73.305433 | 0.165213 |
9838 | buckingham | 61 | 0.901639 | -2.233817 | 66.974989 | 0.152432 |
20355 | harrison | 63 | 0.873016 | -1.978396 | 65.767627 | 0.144732 |
33224 | restaurant | 58 | 0.896552 | -2.183386 | 61.025147 | 0.140108 |
18111 | fines | 91 | 0.692308 | -1.093357 | 79.850990 | 0.139906 |
10445 | carpentry | 47 | 0.957447 | -3.071703 | 51.509404 | 0.133109 |
25739 | marshall | 46 | 0.956522 | -3.050197 | 49.861808 | 0.129219 |
poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]
poplar_df.sort_values('Information Gain',ascending=False)[:10]
word list | word occurence count | Proportional Difference | Mutual Information | Chi Square | Information Gain | |
---|---|---|---|---|---|---|
30606 | pop | 59 | -1.000000 | 0.778445 | 110.515890 | 0.184282 |
15282 | dock | 66 | -0.787879 | 0.666327 | 85.911216 | 0.149069 |
22759 | intimations | 92 | -0.478261 | 0.476164 | 68.933936 | 0.146510 |
22037 | india | 67 | -0.761194 | 0.651290 | 82.833472 | 0.144441 |
31245 | procured | 70 | -0.714286 | 0.624294 | 79.780218 | 0.141947 |
41149 | wellington | 86 | -0.511628 | 0.498485 | 66.399455 | 0.131704 |
17897 | ferry | 68 | -0.705882 | 0.619380 | 74.205624 | 0.129299 |
34547 | seamen | 65 | -0.723077 | 0.629409 | 71.698892 | 0.122070 |
14529 | devons | 47 | -1.000000 | 0.778445 | 78.605431 | 0.118591 |
26460 | millwall | 49 | -0.959184 | 0.757825 | 77.262501 | 0.118218 |
poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]
poplar_df.sort_values('Chi Square',ascending=False)[:10]
word list | word occurence count | Proportional Difference | Mutual Information | Chi Square | Information Gain | |
---|---|---|---|---|---|---|
30606 | pop | 59 | -1.000000 | 0.778445 | 110.515890 | 0.184282 |
9432 | bow | 89 | -0.640449 | 0.580268 | 106.152339 | 0.000000 |
42219 | zymotic | 94 | -0.553191 | 0.525609 | 93.326942 | 0.000000 |
15282 | dock | 66 | -0.787879 | 0.666327 | 85.911216 | 0.149069 |
22037 | india | 67 | -0.761194 | 0.651290 | 82.833472 | 0.144441 |
31245 | procured | 70 | -0.714286 | 0.624294 | 79.780218 | 0.141947 |
14529 | devons | 47 | -1.000000 | 0.778445 | 78.605431 | 0.118591 |
26460 | millwall | 49 | -0.959184 | 0.757825 | 77.262501 | 0.118218 |
33936 | ruston | 46 | -1.000000 | 0.778445 | 76.252152 | 0.114306 |
17897 | ferry | 68 | -0.705882 | 0.619380 | 74.205624 | 0.129299 |