(Note: Rough draft! Notes incomplete. But very usable!)
In this notebook, I'm going to take you through a couple of simple, well-known techniques for exploring small sequences of text (like lines of poetry or sentences). These techniques include:
We're going to work with a corpus of several million lines of poetry, scraped from Project Gutenberg. Before you continue, download the file by executing the cell below:
!curl -L -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz
I'm going to use spaCy extensively, both as a way to parse text into sentences and also as a source for pre-trained word vectors. Make sure you have it installed, along with the en_core_web_md
or en_core_web_lg
models. This notebook also assumes that you have scikit-learn, numpy and simpleneighbors installed.
Review the concept of word vectors before continuing.
Word vectors work great when we're interested in individual words. More often, though, we're interested in longer stretches of text, like sentences, lines, paragraphs. If we had a way to represent these longer stretches of text as vectors, we could perform all of the same operations on them that word vectors allow us to perform on words. But how to represent stretches of text as sequences?
There are lots of different ways! The classic technique in machine learning is to use the frequency of terms found in each sequence (methods like tfidf), or similar techniques like doc2vec. Another way is to train a neural network (like an LSTM or transformer) for the task. There are any number of pre-trained models you can download and use for this, including Google's Universal Sentence Encoder and the Sentence-Transformers Python package.
But a surprisingly effective technique is to simply average together the word vectors for each word in the sentence. A big advantage of this technique is that no further training is needed, beyond the training needed to calculate the word vectors; if you're using pre-trained vectors, even that step can be skipped. You won't get state-of-the-art results on NLP benchmarks with this technique, but it's a good baseline and still useful for many tasks.
In the section below, I sample ten thousand lines of poetry from a Project Gutenberg poetry corpus and assign each a vector using this averaging technique.
import numpy as np
import spacy
import gzip, json, random
Load spacy's language model:
nlp = spacy.load('en_core_web_md')
And then load up all of the lines of poetry:
lines = []
for line in gzip.open("./gutenberg-poetry-v001.ndjson.gz"):
data = json.loads(line)
lines.append(data['s'])
len(lines)
3085117
To make things a bit faster and less memory-intensive, I'm only going to use ten thousand lines of poetry, sampled randomly. (You can change this number to something bigger if you want! But note that some of the stuff we're doing later on in the notebook will take longer.)
sampled_lines = random.sample(lines, 10000)
Every spaCy span (i.e., documents, sentences, etc.) has a .vector
attribute, which is calculated as the average of the vectors for each token in the span. The summary()
function below parses the string you pass to it with spaCy and returns the vector that spaCy computes. (Here I disable the parser
, tagger
and ner
pipelines in order to make the process faster. We're just after the tokens and vectors—we don't need parts of speech, etc.)
def summary(sent):
return nlp(sent, disable=['parser', 'tagger', 'ner']).vector
The code in the cell below computes "summary vectors" for the lines of poetry sampled above:
embeddings = [summary(line) for line in sampled_lines]
And here's what they look like:
rand_idx = random.randrange(len(sampled_lines))
sampled_lines[rand_idx], embeddings[rand_idx]
('(Her brother he) this costly present gave.', array([-1.75622299e-01, 1.32986456e-01, -8.44215900e-02, 4.82430980e-02, 1.74502004e-02, -8.76883119e-02, 1.76149942e-02, -3.61497015e-01, -5.74538112e-02, 2.43268013e+00, -1.48494095e-01, 2.99035579e-01, 1.11827075e-01, -1.91019416e-01, -1.60742804e-01, 1.32711336e-01, -3.05573996e-02, 8.21499228e-01, -1.68846309e-01, 5.52493632e-02, 3.93133946e-02, -1.54419899e-01, -1.10142693e-01, 8.04464519e-02, 6.40499145e-02, -2.25680182e-03, -4.54931036e-02, -1.31961077e-01, -6.36553019e-02, -7.24741071e-02, -6.74557984e-02, 7.51990154e-02, -9.89093930e-02, 3.45098961e-04, 9.51817632e-02, -8.35796669e-02, 6.16111644e-02, 2.46332996e-02, 1.22698285e-01, -8.34121853e-02, 1.00834802e-01, 2.57775076e-02, 1.40923887e-01, -8.89457017e-02, 8.39405321e-03, -8.10271651e-02, -1.75619200e-01, 2.75510009e-02, 9.43362117e-02, 4.86863144e-02, 1.02795996e-02, -6.38375152e-03, 1.01606309e-01, 5.86330891e-04, -4.61001706e-04, 1.47682592e-01, -3.08114588e-02, 4.82391007e-02, 6.68640947e-03, -1.32405698e-01, -1.06554486e-01, -1.82260737e-01, 9.83719006e-02, 2.03293949e-01, 1.25748012e-02, -2.88178977e-02, 1.98377207e-01, 7.94450045e-02, -1.91405006e-02, 1.09067008e-01, 1.45120006e-02, 3.76129453e-03, -7.55779445e-03, -4.35703881e-02, -7.15392977e-02, 9.29853171e-02, 2.03223303e-01, -7.56087899e-02, -7.68765956e-02, -1.34497881e-04, -1.02373092e-02, 1.06746197e-01, -1.79291472e-01, 1.02564707e-01, -5.03048711e-02, -3.99849981e-01, 1.66823298e-01, -3.60850692e-01, 4.65191081e-02, -5.36334999e-02, -4.26915959e-02, 1.10784806e-01, -4.71474975e-02, -2.70334575e-02, 1.73046201e-01, -1.28632322e-01, -2.47203968e-02, -5.20690009e-02, -1.43306583e-01, 1.73910689e-02, 3.92426066e-02, -3.27980034e-02, -1.47105843e-01, -4.50260937e-02, -3.54429968e-02, -3.48001033e-01, 5.12190349e-02, 6.53546443e-03, 9.74060968e-02, 3.29941995e-02, 1.64787501e-01, -2.80001964e-02, -5.91099868e-03, 5.34908064e-02, -1.70863897e-01, -8.83822814e-02, -3.04464996e-02, -2.38255989e-02, 6.25573024e-02, -3.96150090e-02, 5.61850965e-02, -6.56950995e-02, 1.14461914e-01, 2.37020403e-02, -9.73643959e-02, 6.75332993e-02, -1.15192510e-01, -9.93870646e-02, 1.62405998e-01, -1.85996287e-05, 2.38820054e-02, -1.54578984e-02, -8.09850022e-02, 2.11081784e-02, 6.75455108e-02, 1.04365788e-01, 4.40144017e-02, 1.38173491e-01, 1.66973501e-01, 2.98656039e-02, -1.20869637e+00, 1.13973998e-01, 1.44356415e-01, 9.65137035e-02, -1.02534197e-01, 1.96033046e-02, -2.20774382e-01, -8.33212882e-02, 1.26080010e-02, -1.48068577e-01, -3.25329974e-02, 1.79310784e-01, 1.00549102e-01, -6.15501776e-04, -8.60620104e-03, -6.00807965e-02, -2.24144101e-01, -5.30852489e-02, 7.25568980e-02, -6.83693066e-02, -1.31818712e-01, -1.25280619e-01, -4.18565646e-02, -1.62337318e-01, -2.43057013e-01, -1.28554717e-01, -1.81620065e-02, 3.14226821e-02, 2.86392987e-01, 1.27790391e-01, -1.09302104e-01, -6.31299987e-02, 6.34147972e-02, 1.05533032e-02, -9.54764336e-02, -1.52642623e-01, -2.45969812e-03, 5.27312048e-02, -9.85375494e-02, -9.40263271e-06, -1.02721594e-01, -2.11546332e-01, 9.45660193e-03, 6.08954057e-02, -1.55054014e-02, -1.35336090e-02, -1.46346599e-01, 1.72033206e-01, 1.02590472e-02, 1.79677010e-02, 7.66730215e-03, 5.76320961e-02, -6.72889799e-02, 1.79475129e-01, 3.75980278e-03, -3.17634009e-02, -1.35662094e-01, -1.62346214e-01, -8.91004875e-02, 2.43386418e-01, -1.51126519e-01, 1.64557006e-02, -2.38958210e-01, -8.08289051e-02, 1.32009894e-01, -6.37291968e-02, 1.44125491e-01, -1.34023100e-01, -3.62050012e-02, 1.33394075e-04, -4.13244441e-02, -9.65869054e-02, -6.03779918e-03, -1.69263005e-01, 1.23964407e-01, 1.35745127e-02, 6.21954985e-02, 2.79022995e-02, -1.72945634e-01, 6.10213168e-02, 4.57272008e-02, -1.54524595e-01, -1.13604188e-01, 7.88334981e-02, 7.51482025e-02, 6.91546053e-02, 6.71735033e-02, 5.26516624e-02, -7.16200992e-02, 4.70744967e-02, 7.22332578e-03, -1.43274993e-01, 2.73157991e-02, -4.08897027e-02, -3.95579152e-02, -1.11864947e-01, -5.76461479e-02, 5.85939065e-02, -2.77539855e-03, 9.78328958e-02, -4.59123962e-02, 5.13046980e-02, -8.27012062e-02, 2.36173004e-01, 1.57255292e-01, -2.05496103e-01, -5.07364050e-02, -1.42242998e-01, -1.33783802e-01, 6.17518015e-02, 7.71891922e-02, 6.56182989e-02, -3.19690183e-02, -8.98913965e-02, 5.90162054e-02, 1.40180901e-01, 2.06935972e-01, -1.05141088e-01, -1.69202983e-02, 9.94717106e-02, 3.61782946e-02, 1.80594996e-01, 1.73227079e-02, 6.58436790e-02, -7.31033012e-02, -2.94967033e-02, -9.17861536e-02, -4.97900769e-02, 2.21351579e-01, -5.01069948e-02, 2.30176210e-01, 1.01559699e-01, -1.15857206e-01, -9.82692689e-02, -5.75364307e-02, 7.65967965e-02, -3.71842086e-02, 9.50946584e-02, -1.08238399e-01, 1.34326488e-01, 2.52946019e-01, 1.52743086e-01, 1.57660082e-01, -1.49053603e-01, 1.70904659e-02, 1.32871981e-04, 1.20634004e-01, -3.90610024e-02, 3.74647900e-02, 4.06374000e-02, -8.93395990e-02, 4.30078804e-02, -1.60323046e-02, 8.46801922e-02, 7.74760544e-03, 9.40140057e-03, 8.19397271e-02, 5.43425977e-02, 6.63953200e-02, -6.93768114e-02], dtype=float32))
Sentence vectors aren't especially interesting on their own! One thing we can use them for is to build a tiny "search engine" based on semantic similarity between lines. To do this, we need to be able to calculate the distance between a target sentence's vector (not necessarily a sentence from our corpus) and vectors of the sentences in the corpus, returning them ranked based on the distance between the two vectors. However, doing this comparison is computationally expensive and potentially very slow. Instead, we'll use an approximate nearest neighbors algorithm, which uses some tricks to make the computation faster, at the cost of a little bit of accuracy. I'm going to use the Simple Neighbors package as a way to build an approximate nearest neighbors index quickly and easily.
from simpleneighbors import SimpleNeighbors
In the cell below, I build a nearest-neighbor lookup for the sampled lines of poetry:
lookup = SimpleNeighbors(300)
for vec, line in zip(embeddings, sampled_lines):
lookup.add_one(line, vec)
lookup.build()
The .nearest()
method returns the sentences from the corpus whose vectors are closest to the vector you pass in. The code in the cell below uses the summary()
function to return the sentences most similar to the sentence you type in. The number controls how many sentences should be returned.
lookup.nearest(summary("I don't want the words, I want the sound."), 5)
['I do you wrong? I do not hear your praise', 'I think you know it. Fitzwalter, I can save you,', "Mother, if you don't mind, I should like to become the boatman of", "I tell them they can't get me through the door, though:", 'And I would tell it all to you;']
To get neighbors for a random item in the corpus:
lookup.neighbors(random.choice(lookup.corpus))
['Wild winds went straying,', 'Of wailing winds, and naked woods, and meadows brown and sere.', 'Girt with rough skins, hies to the deserts wild,', 'On Cowper Green I stray, tis a desert strange and chill,', 'To slay wild beasts and chase the roving hind,', 'And, brightly leaping down the hills,', 'With thicket overgrown, grotesque and wild,', "No--'twas the wind that, whirring, rose,", 'To bear her cloudy flame,', 'The little rills, and waters numberless,', 'Walked hog-eyed Allen, terror of the hills', 'Sat upon a grassy hillock,']
Another thing you can do with sentence vectors is visualize them. But the vectors are large (in our case, 300 dimensions), which doesn't have an obvious mapping to 2-dimensional space. Thankfully, there are a number of algorithms to reduce the dimensionality of vectors. We're going to use t-SNE ("t-distributed stochastic neighbor embedding"), but there are others to experiment with that might be just as good or better for your application (like PCA or UMAP).
Note in the code below, I'm using an even smaller subset of the data. (That's what the [:2000]
is doing—just using the first 2000 samples. This is because t-SNE is slow, as is drawing the results of a t-SNE).
from sklearn.manifold import TSNE
mapped_embeddings = TSNE(n_components=2,
metric='cosine',
init='pca',
verbose=1).fit_transform(embeddings[:2000])
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 2000 samples in 0.001s... [t-SNE] Computed neighbors for 2000 samples in 0.208s... [t-SNE] Computed conditional probabilities for sample 1000 / 2000 [t-SNE] Computed conditional probabilities for sample 2000 / 2000 [t-SNE] Mean sigma: 0.103510 [t-SNE] KL divergence after 250 iterations with early exaggeration: 82.455208 [t-SNE] KL divergence after 1000 iterations: 2.354854
The following line draws a very large image with the results of the t-SNE. (You might want to right-click to save the image and then bring it up in an image viewer to see the details.)
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(100, 100))
x = mapped_embeddings[:,0]
y = mapped_embeddings[:,1]
plt.scatter(x, y)
for i, txt in enumerate(sampled_lines[:2000]):
plt.annotate(txt, (x[i], y[i]))
The Google Embedding Projector is a handy web-based tool for exploring and visualizing data using different dimensional reduction techniques. The code in the following cells exports the data from our lines of poetry and embeddings in a format that you can upload to the tool:
with open("emb-proj-vecs.tsv", "w") as fh:
for vec in embeddings[:2000]:
fh.write("\t".join(["%0.5f" % val for val in vec]))
fh.write("\n")
with open("emb-proj-labels.tsv", "w") as fh:
fh.write("\n".join(sampled_lines[:2000]))
Click on the "Load" button in the interface, and upload the emb-proj-vecs.tsv
file as "Vectors" and emb-proj-labels.tsv
as "Metadata."
In the visualization above, you may have seen some evidence of "clustering"—groups of items that seem to be related. There are algorithms that facilitate finding such clusters automatically. This can be an interesting and valuable way to explore your data—you might find clusters of meaning that you didn't expect.
We're going to use the K-Means clustering algorithm (in particular, scikit-learn's MiniBatchKMeans.
cluster_count = 25 # adjust this until it starts giving you good results!
from sklearn.cluster import MiniBatchKMeans
clusterer = MiniBatchKMeans(n_clusters=cluster_count)
clusters = clusterer.fit_predict(embeddings)
from collections import defaultdict
group_by_cluster = defaultdict(list)
for i, item in enumerate(clusters):
group_by_cluster[item].append(sampled_lines[i])
for i in range(cluster_count):
print(f"Cluster {i} ({len(group_by_cluster[i])} items)")
print("Closest to center: ", lookup.nearest(clusterer.cluster_centers_[i], 1)[0])
print()
print("\n".join(random.sample(group_by_cluster[i],
min(8, len(group_by_cluster[i])))))
print("\n---")
Cluster 0 (2 items) Closest to center: Out of old graves arose the cry of life; Beneath whose shade the graves of heroes lie; Out of old graves arose the cry of life; --- Cluster 1 (665 items) Closest to center: scenery. The river Peneus ran through it, but not with the And where within the surface of the river With power to walk at will the ocean-floor, Could Sicily more charming forests show The crocus breaking earth; The dreary road we all must tread, Is pumped up brisk now, through the main ventricle, While the waters murmur tranquilly The sunlight leaves behind, --- Cluster 2 (352 items) Closest to center: she fated to redeem one man of them from an evil doom. So Curse their own soul. Well deemed the little God his ancient sway was o'er. And now a mighty shade of me shall go beneath the earth! And from his holy lips these accents broke: Like eternities of ice! This caravan of life passes in curious guise! Be on thy (On her soul may our Lady have gramercy!) That human hearts appall! --- Cluster 3 (471 items) Closest to center: Through the green leaves, a glimmer as of gold, Saw his bright face flashed back from golden sand, And scarlet breast-knot gay. And the white and crimson bands of dawn Though on them glows the copper tint, though African their race, Till Age snow white hairs on thee; Like a wavering summer sea. Impenetrable around their shoulders spread. As it falls and flecks an oak-tree --- Cluster 4 (734 items) Closest to center: than in the fields, and whoso chooses will give it me. For To say another is an ass--at least, to all intent; It must expelled be although with pain "Save yourself, no God will save you; But love's old burden makes their soul so weak With him, that taught them first to glow? Those eyes in tears their fruitless grief must send; "Encumbering me were sundry sick, so fallen For us be suing,-- --- Cluster 5 (769 items) Closest to center: And when he to rest has laid him, Crushed and wounded leave the heart. A lady fair, and gently her mincing steps upstay'd. That the fierce drink burned inside him. "That wild roving Bee who was hanging about her, And very calm was her reply, Forth into the village went he, Till he stands at the gate; And to Gum Webster and his wife, --- Cluster 6 (209 items) Closest to center: nam si curent, bene bonis sit, male malis; quod nunc abest. Qui solus legit ac facit poetas,-- O Vashti, noble Vashti! Summoned out The{n} her dyscressio{n} schal wel knov & fynde He schal wel finde his coveitise _La chasse au bouquin_ still pursue Est amor egra salus, vexata quies, pius error, quin et Ixion Tityosque uoltu "'Cum tot sustineas et tanta negotia solus --- Cluster 7 (223 items) Closest to center: While Sister S. still floated soft, a-gazing at the sky! And desperate encounter at mad feud Whate'er might tempt these little feet to stray, Of monsters dying neath their blows, The mistress snored loud as a pig, Urban muses stammer How touchingly inadequate shuddering shadows throng Little tail quivering, --- Cluster 8 (34 items) Closest to center: This week this person purchased a whole days' amusement; One day the pupil from the choir was gone; repeatedly two or three times. But a few fresh arrivals decided the day; From five o'clock till five o'clock Nine and seventy years ago. This week this person purchased a whole days' amusement; About three nights in every week could Ephraim's yellow mare In his solitary hours. --- Cluster 9 (630 items) Closest to center: so you pretend... you see his face up in the ceiling. Sitting in this bare chamber with my thoughts, Looking to find the inn. And here is some one So glad again of the coming rain The thorn-tree had a mind to Him On filthy straw they sit in the gloom, each face And all the ill is left behind. Sat down at the bedside. Made it sacred by their touch. --- Cluster 10 (325 items) Closest to center: Like her. An' me well knowin' she was square. "I said, 'My steed neighs in the court, "All these for Fourpence." "O go mit me," said Breitmann, Good-night to Marmion." "O Printerman of sallow face, "Ah," you say; "the large appearance "Ay!" murmur'd the Soeur Seraphine... "heart to heart! "The Ancestor remote of Man," --- Cluster 11 (40 items) Closest to center: ¡Qué tanto puede una mujer que llora! Mas como Grande del reino Solitaria flor que el viento Cuando me gozo, Señor, criado fuese entre damas Los dueños buscan, que medrosos huyen! Sostenido por sus pajes Falta matiz alguno; y bebe en ellas Por telaraña sucia y asquerosa, --- Cluster 12 (741 items) Closest to center: Hushed all the joyful voices; and we, who held you dear, To wander in her sleep, thro' ways unknown, And the most ancient heavens, through Thee, are fresh and strong. For frantic boast and foolish word, Of blessed dreams, sown in his heart, takes roots; He, with his spells and shapes of devilish kind, Still shall the valour, love, and truth, Hushed all the joyful voices; and we, who held you dear, Mingled with things forgotten. Until then, --- Cluster 13 (114 items) Closest to center: He wagged his head this way . . . that way . . . Is sweeter than my city life. Breeze-like about his face. Her frown is more to fear. But really build eidolons. Her mission." Who set my brain distraught. Sweet leaves that live no more. wedded state with joys so rife. --- Cluster 14 (373 items) Closest to center: Then what was his failing? come, tell it, and burn ye! A world God loved so well, Thinking, perchance, it was a glorious thing, “I, only I, will take the field, But Oh! what roots my feet?--what spells, what charms "I told you, Willis, when you first came in. "Now meditate with me a bolder plan, Light as Sienna's? Sure not France herself This bravery is, since these times show'd me you--DONNE. --- Cluster 15 (2 items) Closest to center: As one by a dark stair into a great light, As one by a dark stair into a great light, For a little way and a black --- Cluster 16 (572 items) Closest to center: Come, Postumus, and face it, and, facing it, confess Searched each thicket, dingle, hollow; With miracles shod, Me vallate Cydoniis, But more refin'd, more spiritous, and pure, Two bowls I have, well turned, of beechen wood; With varying vanities, from ev'ry part, A scene of tender, delicate repose, Kathrina dear, --- Cluster 17 (268 items) Closest to center: That wert a promise to me ere thy birth, Speaks thee descendant of ethereal race; Which thou must witness ere thy mortal hour, Each glory of thy Cæsar's reign, Hither come where thou art needed, We cherish freedom - back with thee and thine Then, laying 'gainst her bosom the white flower, Thy jeeres and laughter then forbeare, Life, thou soul of every blessing, --- Cluster 18 (458 items) Closest to center: To him, in place of men, for he is old, suffice A sober after-growth is seen, In ordered conflict re-arrange, and stand One, two, three, four, five. Still as he hung, by the retaining steel And another came, and another came, Courts for cowards were erected, and outer precincts, where they would also eat--and the upper To him, in place of men, for he is old, suffice --- Cluster 19 (171 items) Closest to center: I gasp for breath, and now I know I've eaten far too much; "I would, when first I knew the hardy maid I knew I deserved the whipping, I have two thousand bottles, I'm on the sea! I'm on the sea! I turned my head, Methinks I hear in accents low, All night I slept, oblivious of my pain: (I am so lonely) --- Cluster 20 (568 items) Closest to center: I see them not--yet know him well _And were I only young again_! And if in life her love she'll not agree me, There at least I am not in the way. What ails you? Are you pleased or pained? What notion---- Come to my longing arms and let me prove But I tried to take it, my ambition fired Want no content; I feed on manna too. Your name I knew not, and in love's sweet font --- Cluster 21 (711 items) Closest to center: of the mystery. But the prophet of God, skilled in the law and Who slew the savage Buffaloon In the living rock of Law. His comrades then the carcase flay'd and dress'd: As from the power of sacred lays I think the keenness of the living ray And the waves shall smile in the sun The dulcet tumult of their silver tongues.-- Of half voluptuousness and half command. --- Cluster 22 (447 items) Closest to center: may vouchsafe thee aught. But now will I speak out, and my And tenne good gobbs I will unto thee tell, Thogh it be fals and god be wroth. How pale, how cold, thy lips will be! If I were a swift cloud to fly with thee; Who loves thee not is traitor to himself, What demands thy next attention, Methinks for such as thou Yet God hath marked and sealed the spot, --- Cluster 23 (616 items) Closest to center: "Yet, in spite of his love of form, there is nothing frigid or And Quidlibet, who is a pleasant body to deal with,--only he has Is drawn unto that mockery of a land, Soft is the note, and sad the lay "A modest manner fits a maid, against a strong bearing-post along with the many other spears Which Freedom built on Virtue's plan, And sleekest broadcloth counts no more Take the horse that brought the bridegroom, --- Cluster 24 (505 items) Closest to center: And ere one dreams of it, lo! _there_ is a romance. With unreproachful stare. (Stanzas i.-xv., stanza lxxiv.) translation, of "The Cid" by Ormsby.] "A hundred torne y haffe schot with hem, And yit with al myn hole herte, _His fearful Rati._] The wife of Kama, or Love. To Faith,--to Purity! To Helenns counsayle Troy had nat ben brent. ---
Sentence vectors make it easy to classify texts—i.e., based on an existing labelled corpus, predict what category some new text will fall into. Let's build a simple classifier that tries to determine whether a given stretch of text is more like a line of poetry or more like a sentence from a recipe book. First, download the plaintext version of this book and put it in the same directory as this notebook. (You might go in and remove the Gutenberg boilerplate, but you don't have to.)
The following cell parses this text into sentences:
recipe_sents = [sent.text.strip() for sent in nlp(open("./pg26209.txt").read().replace("\n", " "),
disable=['tagger', 'ner']).sents]
len(recipe_sents)
3254
random.sample(recipe_sents, 10)
['They should be cut with a jagging iron.', 'If the sauce is too thick add a little more water.', 'When it boils stir a beaten egg quickly into it, remove at once from the fire.', 'It is better that the dough be made the day before the cakes are to be baked that it may dry a little, as they are spoiled if too much flour is added.', 'Not till one hour after the last vegetarian did the first meat-eater appear, completely exhausted.', '101', 'VENISON CAKES (a Norwegian Recipe).', 'SPICED CURRANTS.', 'Put a layer of the tomatoes in a baking dish, season with salt, pepper and a little sugar, cover with a layer of bread crumbs, dot freely with bits of butter, then put another layer of tomatoes, and lastly a layer of bread crumbs, with bits of butter, and sprinkle with a dessertspoonful of sugar.', 'And everyone who eats flesh meat has part in that brutalization; everyone who uses what they provide is guilty of this degradation of his fellow-men.']
We'll pick exactly as many lines from our poetry corpus as the number of sentences we found in the recipe book. (This is cheating a little bit, since most machine learning classification algorithms work best with "balanced classes.")
classify_poem_lines = random.sample(lines, len(recipe_sents))
A classifier tries to predict a category y based on some data X. The machine learning algorithm is essentially trying to approximate a function y = f(X) that most accurately gives the corresponding labels y for each value in X. Our X is going to be the sentence vector for each sentence, and our y is going to be 0 for poetry, 1 for recipe. (Category labels should always be integers; starting with 0 and counting up is best.)
X = [] # embeddings of lines
y = [] # categories of lines
all_text = [] # actual text
for text in classify_poem_lines:
all_text.append(text)
X.append(summary(text))
y.append(0) # 0 for poem
for text in recipe_sents:
all_text.append(text)
X.append(summary(text))
y.append(1) # 1 for recipes
Validate that our categories are even:
from collections import Counter
Counter(y).most_common()
[(0, 3254), (1, 3254)]
Any classifier has a 50% chance of getting any guess correct. So for our classifier to be any good, it has to be able to predict with a greater than 50% chance.
We'll train the model with some of our data, and set aside some of the data for testing the model. It's important to set aside testing data as a way to validate that the model is accurate on data that it has never seen. The train_test_split()
function in scikit-learn is a very easy way to do this. It first shuffles the data, and then partitions it into two parts: training data (75%) and testing data (25%).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, text_train, text_test = train_test_split(np.array(X), np.array(y), all_text)
len(X_train), len(y_train), len(text_train), len(X_test), len(y_test), len(text_test)
(4881, 4881, 4881, 1627, 1627, 1627)
Now, we're training the classifier. We won't go into the details of how the classifier works! You could drop in any other classifier, really, including a neural network. But the Random Forest Classifier works just fine and is surprisingly fast.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200, class_weight="balanced", verbose=1, n_jobs=-1)
rfc.fit(X_train, y_train)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.0s [Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 4.5s [Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed: 4.7s finished
RandomForestClassifier(class_weight='balanced', n_estimators=200, n_jobs=-1, verbose=1)
Now, we'll see how accurate the model is by predicting the categories in the test set.
preds = rfc.predict(X_test)
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.1s [Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed: 0.1s finished
from sklearn import metrics
The accuracy score is simply a percentage of how many predictions were correct:
metrics.accuracy_score(y_test, preds)
0.9655808236017209
But there are other useful measures of how good the model is at classifying things:
print(metrics.classification_report(y_test, preds))
precision recall f1-score support 0 0.94 0.99 0.97 796 1 0.99 0.94 0.97 831 accuracy 0.97 1627 macro avg 0.97 0.97 0.97 1627 weighted avg 0.97 0.97 0.97 1627
Let's try classifying individual, randomly-selected items. (Remember, 0 is poem, 1 is recipe.)
rand_idx = random.randrange(len(X_test))
print(rfc.predict([X_test[rand_idx]]), text_test[rand_idx])
[0] Which lifts plain life to the divine,
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.0s [Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed: 0.1s finished
The .predict_proba()
method gives us the probability of each category, rather than just the category. The first item in the square brackets is the probability of category 0 (i.e., poem), and the second item is the probability of category 1 (i.e., recipe). The larger the number, the more probable the text belongs to that category (according to the classifier).
rand_idx = random.randrange(len(X_test))
print(rfc.predict_proba([X_test[rand_idx]]), text_test[rand_idx])
[[0.835 0.165]] But when a day or two confirms her stay
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.1s [Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed: 0.1s finished
Predicting everything in the test set along with probabilities:
preds_with_prob = rfc.predict_proba(X_test)
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.1s [Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed: 0.1s finished
In the test set, here are the items deemed most likely to be poetry, along with their predicted probabilities and true category:
for idx in np.argsort(preds_with_prob[:,0])[-10:]:
print(text_test[idx], preds_with_prob[idx], y_test[idx])
His hand that trembled as one terrifyde; [0.995 0.005] 0 He falls, earth thunders, and his arms resound. [1. 0.] 0 "Oh, what reck I thy gold?" quoth Earl Sigurd, the bold; [1. 0.] 0 And menaced vengeance, ere he reach'd his shore; [1. 0.] 0 I'll be a missionarer like her oldest brother, Dan, [1. 0.] 0 Still he did not cease his singing, [1. 0.] 0 He, anger'd by his father, roam'd away [1. 0.] 0 That, kneeling in the silence of his tent, [1. 0.] 0 Mine eyes true-opening, and my heart much eased; [1. 0.] 0 Unto the dumb lips of the flock he lent Sad, pleading words, showing how man, who prays [1. 0.] 1
Likewise, those deemed most likely to be from the recipe book:
for idx in np.argsort(preds_with_prob[:,0])[:10]:
print(text_test[idx], preds_with_prob[idx], y_test[idx])
Serve with tomato sauce, or around a dish of stewed tomatoes. [0. 1.] 1 EGGS IN A BROWN SAUCE. [0. 1.] 1 Butter a mould, sprinkle with bread crumbs, pour the pudding into it and set it in a pan of hot water in a moderate oven. [0. 1.] 1 APPLE MERINGUE. [0. 1.] 1 | 18.1 [0. 1.] 1 Butter a mould, sprinkle with fine bread crumbs, take the buns out of the custard, lay them in the mould and pour the custard over them. [0. 1.] 1 Three egg yolks, a pint and a half of cream, three-quarters of a pound of butter, an even teaspoonful of soda, one pound and a half of sugar, and flour enough to roll. [0. 1.] 1 Pare, core and quarter eight or nine good cooking apples, put them into a double boiler with two tablespoonfuls of butter, half a cup of sugar, the juice and grated rind of a lemon; cook until tender. [0. 1.] 1 Put all into a saucepan with only water enough to cook them tender, cover tightly, when done, brown a little butter and flour together to make the gravy [0. 1.] 1 | 6.6 [0. 1.] 1
The most ambiguous items (i.e., the items whose probabilities are closest to 50% for each category):
diffs = np.abs(preds_with_prob[:,0] - preds_with_prob[:,1])
for idx in np.argsort(diffs)[:20]:
print(text_test[idx], preds_with_prob[idx], y_test[idx])
It will become clear. [0.495 0.505] 1 Hard [0.505 0.495] 1 And we’ll carry home some watter [0.505 0.495] 0 O was an orange [0.51 0.49] 0 The vegetarian can extract from his food all the principles necessary for the growth and support of the body, as well as for the production of heat and force. [0.49 0.51] 1 We do not solicit donations in locations where we have not received written confirmation of compliance. [0.51 0.49] 1 They are very nice but very troublesome to prepare. [0.51 0.49] 1 Nearly all the individual works in the collection are in the public domain in the United States. [0.485 0.515] 1 Despite these efforts, Project Gutenberg-tm electronic works, and the medium on which they may be stored, may contain "Defects," such as, but not limited to, incomplete, inaccurate or corrupt data, transcription errors, a copyright or other intellectual property infringement, a defective or damaged disk or other medium, a computer virus, or computer codes that damage or cannot be read by your equipment. [0.485 0.515] 1 The Foundation makes no representations concerning the copyright status of any work in any country outside the United States. [0.485 0.515] 1 Cypris jam nova, jam recens, [0.515 0.485] 0 Bottle and cork tight and keep in a dark place. [0.485 0.515] 1 Richmond Maids of Honor 138 [0.48 0.52] 1 Contact the Foundation as set forth in Section 3 below. [0.52 0.48] 1 Simple Cottage was clear and clean, [0.525 0.475] 0 Pointes d'Asperges [0.53 0.47] 1 Do not leave it in too long, or it will become dry. [0.53 0.47] 1 Lux coruscat fulgida: [0.535 0.465] 0 they can be raised over steam [0.465 0.535] 1 Lincoln, our more than Washington. [0.535 0.465] 0
Finally, lines of poetry deemed to be recipe-like:
recipelike = []
for idx in np.argsort(preds_with_prob[:,0]):
if y_test[idx] == 0:
recipelike.append((text_test[idx], preds_with_prob[idx]))
for item, score in recipelike[:12]:
print(item, score)
Though strawberries and raspberries, [0.115 0.885] To prune these growing Plants, & tend these Flours, [0.295 0.705] Stir of sweet enlightenments; [0.355 0.645] Is a little red house with a little straw cap [0.435 0.565] saved. [0.445 0.555] And for no less than aromatic wine [0.455 0.545] And we’ll carry home some watter [0.505 0.495] O was an orange [0.51 0.49] Cypris jam nova, jam recens, [0.515 0.485] Simple Cottage was clear and clean, [0.525 0.475] Lincoln, our more than Washington. [0.535 0.465] Lux coruscat fulgida: [0.535 0.465]
And poem-like recipe sentences:
poemlike = []
for idx in np.argsort(preds_with_prob[:,1]):
if y_test[idx] == 1:
poemlike.append((text_test[idx], preds_with_prob[idx]))
for item, score in poemlike[:12]:
print(item, score)
Unto the dumb lips of the flock he lent Sad, pleading words, showing how man, who prays [1. 0.] One day I was speaking to an authority on this subject, and I asked him how it was that he knew so decidedly that most of the murders and the crimes with the knife were perpetrated by that particular class of men, and his answer was suggestive, although horrible. [0.975 0.025] "Suffer the ox to plough, and impute his death to age and Nature's hand. [0.975 0.025] He that killeth an ox is as if he slew a man.--Isaiah lxvi. [0.965 0.035] Then said Daniel to Melzar [the steward], whom the prince of the eunuchs had set over Daniel, Hananiah, Mishael, and Azariah: Prove thy servants, I beseech thee, ten days; and let them give us pulse to eat, and water to drink. [0.96 0.04] as the one dieth, so dieth the other; yea, they have all one breath; so that a man hath no preeminence above a beast: for all is vanity. [0.945 0.055] THE CHRIST IDEAL WRITTEN WITHIN, IN THEIR OWN SOULS [0.91 0.09] I ask you to recognize your duty as men and women who should _raise_ [0.895 0.105] Being constantly in the sight and the smell of blood, their whole nature is coarsened; accustomed to kill thousands of creatures, they lose all sense of reverence for sentient life, they grow indifferent to the suffering they continually see around them; accustomed to inflict pain, they grow callous to the sight of pain; accustomed to kill swiftly, and sometimes not even waiting until the creature is dead before the skin is stripped from it, their nerves become coarsened, hardened, and brutalized, and they are less men as men because they are slaughterers of animals. [0.88 0.12] , THIS BOOK IS Affectionately Inscribed. [0.87 0.13] who should try to make it _pure_, not _foul_; and therefore, in the name of Human Brotherhood, I appeal to you to leave your own tables free from the stain of blood and your consciences free from the degradation of your fellow-men." [0.865 0.135] "If I may not appeal to you in the name of the animals--if under mistaken views you regard animals as not sharing _your kind of life_--then I appeal to you in the name of _human brotherhood_, and remind you of your duty to your fellow-men, your duty to your nation, which must be built up partly of the children of those who slaughter--who physically inherit the very signs of this brutalizing occupation. [0.855 0.145]
You can predict the category of an arbitrary sentence by first vectorizing it use the same method you used to vectorize the sentences in the model, then passing it to .predict()
:
rfc.predict_proba([summary("Roses are red, violets are blue")])
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.1s [Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed: 0.1s finished
array([[0.675, 0.325]])
This tells us the model believes the sentence we gave it is a bit more likely to be poetry than a recipe. You can repeat this for an entire text! the following cell reads in the contents of a file line by line, calculates the vectors for those lines, and runs the prediction on them:
file_lines = [line.strip() for line in open("sonnets.txt").readlines() if len(line.strip()) > 0]
file_vecs = [summary(line) for line in file_lines]
file_probs = rfc.predict_proba(file_vecs)
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.1s [Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed: 0.1s finished
And now we can sort the text according to its most recipe-like lines:
idx = np.argsort(file_probs[:,0])
for i in idx[:25]:
print(frost_probs[i], frost_lines[i])
[0.435 0.565] Like stones of worth they thinly placed are, [0.46 0.54] Growing a bath and healthful remedy, [0.49 0.51] Which should example where your equal grew. [0.495 0.505] Love's fire heats water, water cools not love. [0.5 0.5] Or bends with the remover to remove: [0.525 0.475] Or whether revolution be the same. [0.535 0.465] And brought to medicine a healthful state [0.54 0.46] So far from variation or quick change? [0.54 0.46] Which works on leases of short-number'd hours, [0.545 0.455] The worth of that is that which it contains, [0.56 0.44] With eager compounds we our palate urge; [0.57 0.43] Than public means which public manners breeds. [0.58 0.42] To set a form upon desired change, [0.585 0.415] Some fresher stamp of the time-bettering days. [0.585 0.415] And the firm soil win of the watery main, [0.59 0.41] Each changing place with that which goes before, [0.595 0.405] They are but dressings of a former sight. [0.595 0.405] For when these quicker elements are gone [0.595 0.405] Nor services to do, till you require. [0.595 0.405] And to his palate doth prepare the cup: [0.605 0.395] What strained touches rhetoric can lend, [0.605 0.395] Increasing store with loss, and loss with store; [0.605 0.395] The dedicated words which writers use [0.605 0.395] Will be a tatter'd weed of small worth held: [0.61 0.39] When your sweet issue your sweet form should bear.