We're now switching focus away from the Network Science (for a little bit), beginning to think about Language Processing instead. In other words, today will be all about learning to parse and make sense of textual data. This ties in nicely with our work on the network of Computational Social Scientists, because papers naturally contain text.
We've looked at the network so far - now, let's see if we can include the text. Today is about
Video Lecture. Intro to Natural Language processing. Today is all about working with NLTK, so not much lecturing - we will start with a perspective on text analysis by Sune (you will hear him talking about Wikipedia data here and there. Everything he sais applies to other textual data as well!)
from IPython.display import YouTubeVideo
YouTubeVideo("Ph0EHmFT3n4",width=800, height=450)
Reading The reading for today is Natural Language Processing with Python (NLPP) Chapter 1, Sections 1, 2, 3. It's free online.
Exercises: NLPP Chapter 1.
- First, install
nltk
if it isn't installed already (there are some tips below that I recommend checking out before doing installing)- Second, work through chapter 1. The book is set up as a kind of tutorial with lots of examples for you to work through. I recommend you read the text with an open IPython Notebook and type out the examples that you see. *It becomes much more fun if you to add a few variations and see what happens*. Some of those examples might very well be due as assignments (see below the install tips), so those ones should definitely be in a
notebook
.
Check to see if nltk
is installed on your system by typing import nltk
in a notebook
. If it's not already installed, install it as part of Anaconda by typing
conda install nltk
at the command prompt. If you don't have them, you can download the various corpora using a command-line version of the downloader that runs in Python notebooks: In the iPython notebook, run the code
import nltk
nltk.download()
Now you can hit d
to download, then type "book" to fetch the collection needed today's nltk
session. Now that everything is up and running, let's get to the actual exercises.
Exercises: NLPP Chapter 1 (the stuff that might be due in an upcoming assignment).
The following exercises from Chapter 1 are what might be due in an assignment later on.
- Try out the
concordance
method, using another text and a word of your own choosing.- Also try out the
similar
andcommon_context
methods for a few of your own examples.- Create your own version of a dispersion plot ("your own version" means another text and different word).
- Explain in your own words what aspect of language lexical diversity describes.
- Create frequency distributions for
text2
, including the cumulative frequency plot for the 75 most common words.
Ok. So Chapter 3 in NLPP is all about working with text from the real world. Getting text from this internet, cleaning it, tokenizing, modifying (e.g. stemming, converting to lower case, etc) to get the text in shape to work with the NLTK tools you've already learned about.
Video lecture: Short overview of chapter 3 + a few words about kinds of language processing that we don't address in this class.
from IPython.display import YouTubeVideo
YouTubeVideo("Rwakh-HXPJk",width=800, height=450)
Reading: NLPP Chapter 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.9, and 3.10. It's not important that you go in depth with everything here - the key thing is that you know that Chapter 3 of this book exists, and that it's a great place to return to if you're ever in need of an explanation on topics that you forget as soon as you stop using them (and don't worry, I forget about those things too).
Zipf's Law: Let $f(w)$ be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. The Zipf's law states that the frequency of a word type is inversely proportional to its rank (i.e. f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
Reading Skim through the Wikipedia page on the Zipf's law
Exercise 1: Tokenization and Zipf's law. Consider the list of abstracts of Computational Social Science papers. You shall take the paperIds of Computational Social Science papers (from Week 4), then go back to your abstract dataframe (Week 2) to get the abstracts only for those.
- Tokenize the text of each abstract. Create a column tokens in your dataframe containing the tokens. Remember the bullets below for success.
- If you dont' know what tokenization means, go back and read Chapter 3 again. The advice to go back and check Chapter 3 is valid for every cleaning step below.
- Exclude punctuation.
- Exclude URLs
- Exclude stop words (if you don't know what stop words are, go back and read NLPP1e again).
- Exclude numbers.
- Set everything to lower case.
- Note that none of the above has to be perfect. And there's some room for improvisation. You can try using stemming. Choices like that are up to you.
- Create a single list that includes the concatenation of all the tokens (from all abstracts).
- What are the top 50 most common tokens in your corpus?
- Write a function to process your list of tokens and plot word frequency against word rank using. Do you confirm Zipf's law? (Hint: it helps to use a logarithmic scale). What is going on at the extreme ends of the plotted line?
- Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, and generate the Zipf plot as before, and compare the two plots. What do you make of Zipf's Law in the light of this?
In this course, we use a "bag-of-words" approach, beucause using simple methods to explore the data is very important before applying any complex model. Here, we learn how to account for an issue that often comes up when using a bag-of-words approach when studying textual data. The concept of collocation, or pairs of words that tend to appear together more often than by chance. It is an important concept in linguistics.
In the case of collocations, words should be considered together to retain their original meaning (e.g. machine learning is not simply machine and learning. The same applies to computer science, social media, computational social science ).
How do we find out if a pair of words $w_1, w_2$ appears in a corpus more often than one would expect by chance? We study the corresponding contingency table. Given a corpus, and two words $w_1$ and $w_2$, a contingency table is a matrix with the following elements:
$$C_{w_1,w_2}= \begin{bmatrix} n_{ii} & n_{oi} \\ n_{io} & n_{oo} \end{bmatrix}$$$n_{ii}$: the number of times the bigram ($w_1$, $w_2$) appear in the corpus
$n_{io}$: the number of bigrams ($w_1$, * ), where the first element is $w_1$ and the second element is not $w_2$
$n_{oi}$: the number of bigrams ( * , $w_2$ ), where the first element is not $w_1$ and the second element is $w_2$
$n_{oo}$: the number of bigrams ( * , * ) where the first element is not $w_1$ and the second is not $w_2$.
Then, we can compare the observed number of occurrences of the bigram, $n_{ii}$, with the number of occurrences we would expect simply by random chance. The value we would expect by chance is equal to the product between three terms: the total number of bigrams, $N$; the probability that a bigram starts with $w_1$, which is equal to $(n_{ii} + n_{io})/N$; the probability that a bigram ends with $w_2$, which is equal to $(n_{ii} + n_{oi})/N$.
To check if the number of times our bigram appears, $n_{ii}$, is close to the value we expect by chance, we can run a Chi Square test.
Note: contigency tables can be used in any statistical problem where one wants to study multivariate frequency distributions (not just in the case of textual data).
Exercise 2: Bigrams and contingency tables.
- Find the list of bigrams in each of the abstracts. If you don't remember how to do it, go back to Chapter 1 of your book. Store all the bigrams in a single list.
- For each unique bigram in your list:
- compute the corresponding contingency table (see the theory just above)
- compute the p-value associated to the Chi-squared test
- What is the sum of all the elements of a contingency table?
- Find the list of bigrams with p-value smaller than 0.001.
- How many bigrams have you found? Print out 10 of them. What do you observe? Which bigrams does this list include?
- (Optional) Recompute the tokens column in your dataframe. This time, do not split pairs of words that constitute a collocation (they should be part of the same token). Hint: You can use the MWETokenizer.
- Save your filtered abstract dataframe with the new tokens column. You can also delete your old abstract dataframe, that included all papers.