In [1]:
from urllib2 import urlopen
stxt=urlopen('http://www.gutenberg.org/cache/epub/100/pg100.txt').read()
In [2]:
#rough total numer of words
len(stxt.split())
Out[2]:
904061

starts with:
1609 CYMBELINE by William Shakespeare

ends with:
End of the Project Gutenberg EBook of The Complete Works of William Shakespeare, by William Shakespeare

In [3]:
#find beginning and end
stxt.index('1609'),stxt.index('End of the Project')
Out[3]:
(7598, 5568865)
In [4]:
#check them
stxt[7598:7598+200]
Out[4]:
"1609\r\n\r\nTHE SONNETS\r\n\r\nby William Shakespeare\r\n\r\n\r\n\r\n                     1\r\n  From fairest creatures we desire increase,\r\n  That thereby beauty's rose might never die,\r\n  But as the riper should by t"
In [5]:
stxt[5568865:5568865+200]
Out[5]:
'End of the Project Gutenberg EBook of The Complete Works of William\r\nShakespeare, by William Shakespeare\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***\r\n\r\n***** Thi'
In [6]:
import re
swords=re.findall("[a-z']+",stxt[7598:5568865].lower())
len(swords)
Out[6]:
903810
In [7]:
from collections import Counter
Counter(swords).most_common(20)
Out[7]:
[('the', 27554),
 ('and', 26725),
 ('i', 20296),
 ('to', 19637),
 ('of', 18127),
 ('a', 14545),
 ('you', 13611),
 ('my', 12474),
 ('that', 11122),
 ('in', 10989),
 ('is', 9567),
 ('not', 8723),
 ('for', 8219),
 ('with', 7992),
 ('me', 7772),
 ('it', 7685),
 ('be', 7074),
 ('your', 6866),
 ('his', 6853),
 ('this', 6805)]
In [8]:
wlist,nlist=zip(*Counter(swords).most_common())
In [9]:
print wlist[:20]
('the', 'and', 'i', 'to', 'of', 'a', 'you', 'my', 'that', 'in', 'is', 'not', 'for', 'with', 'me', 'it', 'be', 'your', 'this', 'his')

brown_top10: ('the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', 'was', 'he', 'for', 'it', 'with', 'as', 'his', 'on', 'be', 'at', 'by', 'i')

24:00 min point here: http://www.radiolab.org/story/91728-words-that-change-the-world/

tiny subset of 'un-' words potentially first coined by Shakespeare:
unnerved, unaware, uncomfortable, unearthly, unhand, undress, uneducated, ungoverned, unmitigated, unwillingness, unpublished, unsolicited, unswayed, unclogged, unappeased, unchanging, unreal, ...

or things like:
madcap, ladybird, eye-drops, Eyesore, eyeball, fainthearted

or all these expressions first documented use is Shakespeare:
So truth will out. What's done is done. Crack of doom. Dead as a doornail. A dish fit for the gods.
A dog will have his day. , fool's paradise, forever and a day, foregone conclusion, the game is afoot, the game is up. Greek to me, I’m in a pickle, in my heart of hearts, in my mind’s eye, kill with kindness. knock, knock, who's there? sorry sight. love is blind, what the Dickens, all’s well that ends well. Something wicked this way comes.

Shakespeare Zipf

In [10]:
from scipy.stats import linregress
import numpy as np

xz=np.arange(len(wlist))+1
slope1,intercept1=linregress(np.log(xz)[:1000],np.log(nlist)[:1000])[:2]
slope2,intercept2=linregress(np.log(xz),np.log(nlist))[:2]
slope1,intercept1,slope2,intercept2
Out[10]:
(-1.030628927624901,
 11.770585933080547,
 -1.4677534901237208,
 14.753251384499743)
In [11]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(5,4.5))
plt.loglog(xz,nlist)
plt.loglog(xz,np.exp(intercept1)*xz**slope1,'--')
plt.loglog(xz,np.exp(intercept2)*xz**slope2,'--')
plt.ylim(1,1e5)
plt.xlabel('rank')
plt.ylabel('frequency')
plt.legend(['data','slope=-1.03','slope=-1.47']);

Shakespeare Heap

Recall swords[] contains the tokenized words, consider reading it in blocks of 10000 words (swords[m:m+10000] for m from 0 until greater than len(swords) ).

Appending each block of new words to a set will accumulate the distinct words (the 'vocabulary'), and the length of that set will give its size.

Must grow much more slowly than linear, since starts at vocab size 2243 for first 10000 words, and makes it to only 26962 for the full 903810 words:

In [12]:
#recall len(swords) was 903810
len(set(swords[:10000])),len(set(swords))
Out[12]:
(2243, 26962)
In [13]:
#same data on linear-linear, then log-log, can read off exponent from slope on right
plt.figure(figsize=(8.5,4))
...
In [ ]: