from urllib2 import urlopen
stxt=urlopen('http://www.gutenberg.org/cache/epub/100/pg100.txt').read()
#rough total numer of words
len(stxt.split())
904061
starts with:
1609
CYMBELINE
by William Shakespeare
ends with:
End of the Project Gutenberg EBook of The Complete Works of William
Shakespeare, by William Shakespeare
#find beginning and end
stxt.index('1609'),stxt.index('End of the Project')
(7598, 5568865)
#check them
stxt[7598:7598+200]
"1609\r\n\r\nTHE SONNETS\r\n\r\nby William Shakespeare\r\n\r\n\r\n\r\n 1\r\n From fairest creatures we desire increase,\r\n That thereby beauty's rose might never die,\r\n But as the riper should by t"
stxt[5568865:5568865+200]
'End of the Project Gutenberg EBook of The Complete Works of William\r\nShakespeare, by William Shakespeare\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***\r\n\r\n***** Thi'
import re
swords=re.findall("[a-z']+",stxt[7598:5568865].lower())
len(swords)
903810
from collections import Counter
Counter(swords).most_common(20)
[('the', 27554), ('and', 26725), ('i', 20296), ('to', 19637), ('of', 18127), ('a', 14545), ('you', 13611), ('my', 12474), ('that', 11122), ('in', 10989), ('is', 9567), ('not', 8723), ('for', 8219), ('with', 7992), ('me', 7772), ('it', 7685), ('be', 7074), ('your', 6866), ('his', 6853), ('this', 6805)]
wlist,nlist=zip(*Counter(swords).most_common())
print wlist[:20]
('the', 'and', 'i', 'to', 'of', 'a', 'you', 'my', 'that', 'in', 'is', 'not', 'for', 'with', 'me', 'it', 'be', 'your', 'this', 'his')
brown_top10: ('the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', 'was', 'he', 'for', 'it', 'with', 'as', 'his', 'on', 'be', 'at', 'by', 'i')
24:00 min point here: http://www.radiolab.org/story/91728-words-that-change-the-world/
tiny subset of 'un-' words potentially first coined by Shakespeare:
unnerved, unaware, uncomfortable, unearthly, unhand, undress, uneducated, ungoverned, unmitigated, unwillingness, unpublished, unsolicited, unswayed, unclogged, unappeased, unchanging, unreal, ...
or things like:
madcap, ladybird, eye-drops, Eyesore, eyeball, fainthearted
or all these expressions first documented use is Shakespeare:
So truth will out. What's done is done.
Crack of doom. Dead as a doornail. A dish fit for the gods.
A dog will have his day. , fool's paradise, forever and a day,
foregone conclusion, the game is afoot, the game is up.
Greek to me, I’m in a pickle, in my heart of hearts, in my mind’s eye, kill with kindness. knock, knock, who's there?
sorry sight. love is blind, what the Dickens, all’s well that ends well.
Something wicked this way comes.
from scipy.stats import linregress
import numpy as np
xz=np.arange(len(wlist))+1
slope1,intercept1=linregress(np.log(xz)[:1000],np.log(nlist)[:1000])[:2]
slope2,intercept2=linregress(np.log(xz),np.log(nlist))[:2]
slope1,intercept1,slope2,intercept2
(-1.030628927624901, 11.770585933080547, -1.4677534901237208, 14.753251384499743)
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(5,4.5))
plt.loglog(xz,nlist)
plt.loglog(xz,np.exp(intercept1)*xz**slope1,'--')
plt.loglog(xz,np.exp(intercept2)*xz**slope2,'--')
plt.ylim(1,1e5)
plt.xlabel('rank')
plt.ylabel('frequency')
plt.legend(['data','slope=-1.03','slope=-1.47']);
Recall swords[]
contains the tokenized words, consider reading it in blocks of 10000 words
(swords[m:m+10000]
for m from 0 until greater than len(swords) ).
Appending each block of new words to a set will accumulate the distinct words (the 'vocabulary'), and the length of that set will give its size.
Must grow much more slowly than linear, since starts at vocab size 2243 for first 10000 words, and makes it to only 26962 for the full 903810 words:
#recall len(swords) was 903810
len(set(swords[:10000])),len(set(swords))
(2243, 26962)
#same data on linear-linear, then log-log, can read off exponent from slope on right
plt.figure(figsize=(8.5,4))
...