!pip install fastai --upgrade
from fastai.text.all import *
import numpy as np
import pandas as pd
from IPython.core.display import display, HTML
path = untar_data(URLs.IMDB)
(path/'..').ls()
(#1) [Path('/root/.fastai/data/imdb/../imdb')]
txt_files = get_text_files(path, folders=['train', 'test', 'unsup'])
txt = txt_files[0].open().read()
txt
'I still can\'t figure out why any self-respecting person would ever attempt to make a film that is as stupid as In The Woods. Or better yet, why any decent person would ever rent, let alone buy, this piece of utter garbage.<br /><br />I think the writer should win the award for the dumbest storyline ever made into an actual movie.<br /><br />Everything about this movie just screams of stupidity. The acting is very mechanical and fake, the special effects (if you can call them that) and the "scary monster" seem like they\'re from an old 80\'s PBS tv show.<br /><br />Well, the list goes on and on. I won\'t bore you with all the details, if you want to be super bored you can go out and rent this movie!'
tokenizer = WordTokenizer()
toks = first(tokenizer([txt]))
print(f'{coll_repr(toks, 30)}\n{toks}')
(#151) ['I','still','ca',"n't",'figure','out','why','any','self','-','respecting','person','would','ever','attempt','to','make','a','film','that','is','as','stupid','as','In','The','Woods','.','Or','better'...] (#151) ['I','still','ca',"n't",'figure','out','why','any','self','-'...]
first(tokenizer(["I'm going to U.S.A tomorrow."]))
(#7) ['I',"'m",'going','to','U.S.A','tomorrow','.']
first(tokenizer([["I'm going to U.S.A tomorrow."], ["Hey there how are you ?"]]))
(#11) ['[','"','I',"'m",'going','to','U.S.A','tomorrow','.','"'...]
Fatsai's Tokenizer class adds some additional tokenizing functionality to the WordTokenizer class
tkn = Tokenizer(tokenizer)
tkn(txt)
(#163) ['xxbos','i','still','ca',"n't",'figure','out','why','any','self'...]
defaults.text_proc_rules
[<function fastai.text.core.fix_html>, <function fastai.text.core.replace_rep>, <function fastai.text.core.replace_wrep>, <function fastai.text.core.spec_add_spaces>, <function fastai.text.core.rm_useless_spaces>, <function fastai.text.core.replace_all_caps>, <function fastai.text.core.replace_maj>, <function fastai.text.core.lowercase>]
??fix_html
x = 'how are you, I && my sister'
x.replace('&&', '&').replace(',', ' ').replace(' ', ' ')
'how are you I & my sister'
L(['JDKSLK', 'djksdj', 'djlskdlk', 2])
(#4) ['JDKSLK','djksdj','djlskdlk',2]
sample_txt = L([files.open().read() for files in txt_files[:2000]])
sample_txt
(#2000) ['I still can\'t figure out why any self-respecting person would ever attempt to make a film that is as stupid as In The Woods. Or better yet, why any decent person would ever rent, let alone buy, this piece of utter garbage.<br /><br />I think the writer should win the award for the dumbest storyline ever made into an actual movie.<br /><br />Everything about this movie just screams of stupidity. The acting is very mechanical and fake, the special effects (if you can call them that) and the "scary monster" seem like they\'re from an old 80\'s PBS tv show.<br /><br />Well, the list goes on and on. I won\'t bore you with all the details, if you want to be super bored you can go out and rent this movie!',"There is so much tragedy that takes place in the world involving the military and others involved in physical conflict, yet it is rare that a soldier comes forward to tell the truth. In Shake Hands with the Devil: The Journey of Roméo Dallaire, we are lucky to have not just a soldier, but a leader who took so much responsibility for the Rwandan genocide onto himself explaining through word and deed what happened there, and its meaning. This is a wonderful documentary, and a moving story about an honest man's quest to understand the difficulty and horror he experienced. It is impossible not to be emotionally moved by Dallaire's story, and the well-crafted way in which it is told.","This was a terrible movie with a bad plot line. They repeated the same story line as the first but with a different love interest for the same main character. The girl that they got to play Sara however was a terrible actress and was like watching a bad car crash. They had her in red gothy lipstick throughout the entire film that was distracting and unlike Sara. I can't even remember the dancing or the music because all I could think of was how bad this movie is. It totally ruined a great movie and now I am going to have to try to forget that I ever saw it in the first place. I honestly only watched it because it was on t.v. and there was nothing better to watch. It did not relay the message as the first one did. It sent out weird racial messages and was a disgusting display of how Hollywood is out of original ideas. I'd hate to be the writer of this script.",'What a waste of talent and cinematography. I saw the DVD version and the photography and setting was great; very crisp and rich. The casting was fine and the script was a proven one. Must have been really bad direction?? Car chases were fine, but I especially enjoyed the one in the Christmas tree lot! I thought it also kind of totally fell apart in the last 1/3',"I thought I could never find a completely bad movie. Well, I found it. It's this movie. This has to be the' worst movie ever made. Everyone over acts, there is no flow to the movie. Shakespeare will role over his grave every time someone sees this ruined version. I strongly recommend not seeing this movie.","I first saw this film when my mother bought it on VHS in a store... I think I was about seven. I loved it from the moment I first watched it, and even now, at nineteen, it's still one of my favorite movies.<br /><br />Now, I realize the effects aren't that great and the ADR is horrible--but as fairytales go, it's among the best. I found the plot to be better translated to film than that of The Neverending Story (which is one of my favorite books--but not one of my favorite movies), and the music is wonderful. Watching the movie was very reminiscent to reading The Chronicles of Narnia, or in fact reading a British translation of Astrid Lindgren's book.<br /><br />This is still my favorite of Christian Bale's many roles, and among my favorite Christopher Lee roles (the man has over 700, so it's hard to pick faves sometimes ^_^).<br /><br />A classic for children and families, I must say. (And coming from a cynical college student, that's saying something!)",'I have a very hard time picking a favorite, favorite film, but if forced to create a Top 10 List, this film would be there. Yes, Liza Minnelli CAN act , and brilliantly. The songs are wonderful and funny, the narrative is brilliant and dark. Those who typically don\'t like musicals might enjoy this, as there are none of those cheesy "I feel a song coming on" moments - - all of the music is confined to the stage of the Kit Kat Klub. Michael York\'s role is often overshadowed by Minnelli\'s brassy Sally Bowles, but his work is equally strong. A+.',"A group of friends are out partying one night, when they decide to try out an Oujia board and contact some spirits. Whilst doing it, they do get in contact with someone, a demon who is back from the dead and has gotten into there world. The only way for the demon to go back to where it came from is if he kills whole group friends who let him out in the first place. A strange old man who happens to be the landlord of there house claims that one of them has the demon inside of them, but no one believes him. It is true though......<br /><br />Long Time Dead a good British horror/thriller movie, with some creepy moments in it, like the flashback scenes. There are also some little twists along the way as well. The acting from everyone is good too. If you enjoy movies like this, then I'm sure you will enjoy Long Time Dead. I give Long Time Dead a 8-9/10.",'Some films are mediocre, some films are bad, but still other films are so misconceived and poorly executed that the film-makers should be embarrassed of themselves. "Kull" is exactly that film, and it\'s more fun to laugh at its unintentional idiocy than to make any attempt to take it seriously as fantasy. At any rate the film-makers also took care of that possibility by filling the sorry affair with so much contemporary humor in a pathetic attempt to imitate the formula of the "Hercules" TV show that starred the same "actor" who appears in this film, Kevin Sorbo.<br /><br />It didn\'t have to be this way. King Kull is one of the more interesting characters created by the legendary 1930s pulp writer Robert E. Howard, famous for Conan. Howard\'s writing has been mostly insulted and degraded by these film versions -- but this one is so awful it makes "Conan the Destroyer" look like "Lord of the Rings." It makes "Red Sonja" look like "Jason and the Argonauts." And it makes "Conan the Barbarian" look like "Citizen Kane." I actually paid to see this in the theater because I\'m such a big Robert Howard fan. It was so disappointing that I had to give it a spin on DVD just to see if it could really be as bad as I remembered it. It\'s far worse than any memory can convey. Right away, you have probably the worst music I\'ve ever heard in a film, a combination of mock-Wagner and mock-Megadeth. Electric guitars on the soundtrack are an especially poor harbinger for a film supposedly set in ancient history. Then you start to notice that all the characters look like roadies for Spinal Tap, and the main villain dude has a mullet that would make Billy Ray Cyrus jealous. So maybe it all makes sense in a twisted way... apparently this was a demographic they were shooting for; perhaps they even advertised this film on WWF smack-down.<br /><br />There\'s no getting around what limits the movie the most -- basically the entire cast is wrong and incompetent. Kevin Sorbo always seems like a nice guy, and little else. He\'s all wrong to play a sadistic barbarian, but the film-makers have solved that problem by removing all traces of Kull\'s personality and all signs that he was created by Robert Howard, or you could say all signs that he was created by anyone other than a market research survey group. He\'s polite, soft-spoken, respectful of women, and he wants to free all the slaves. He\'s a hero -- a character who has no business in a Robert Howard story. Just once I\'d like to see this great writer\'s stories rendered in a way that isn\'t just to turn all the characters into generic knights in shining armor. Doing that to his world and his characters is like making a Disney movie about the Donner Party. I can understand why a lot of people look down on writing like his, because they assume it\'s actually junk like this movie.<br /><br />It doesn\'t stop there, but I get too tired of thinking about it to go on. Sorbo is the least of our worries in the cast actually, considering that Harvey Fierstein is painfully hammy (and his character seems modeled on a dull character from the TV show), and the director seems to have thought it was a good idea to give all the heavy acting scenes to Tia Carrere. Her misbegotten performance lends the movie most of its laugh factor. This movie is cheap and ugly looking -- I would guess that they spent more money on the lame "Merlin" miniseries in the 90s than they did on this movie which was foisted on theater audiences. Not only that, but the director has no taste and no talent for cinema at all. Everything is shot in a bland and generic way so that none of it seems infused with any kind of power or majesty. Whatever the faults of Milius\' original Conan film, and there are many, at the very least he attempted to get the dark atmosphere of Howard\'s world right and to convey some fraction of the characters\' fatalism. Unless the new "Solomon Kane" film is a huge surprise, it will be the only example of somebody even attempting something different of this type in a fantasy movie for a long time.',"i don't mind the movies not being the real world, i don't mind magical, unlikely or challenging, even illogical but half-baked, over-flagged nonsense is simply disappointing. There seems to have been a good idea for a film here as others wrote but its impossible not see what's coming. In fact I thought well perhaps there'll be some twist I'm not expecting.... no, not really. Oh well, I do like Cuba though. I just wish he didn't have to look so uncomfortable."...]
!pip install sentencepiece!=0.1.90,!=0.1.91
Collecting sentencepiece!=0.1.90,!=0.1.91 Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB) |████████████████████████████████| 1.0MB 5.4MB/s Installing collected packages: sentencepiece Successfully installed sentencepiece-0.1.86
def subword(sz):
st = SubwordTokenizer(vocab_sz=sz)
st.setup(sample_txt)
return ' '.join(first(st([txt]))[:40])
subword(1000)
"▁I ▁still ▁can ' t ▁figure ▁out ▁why ▁any ▁self - re s p ect ing ▁person ▁would ▁ever ▁attempt ▁to ▁make ▁a ▁film ▁that ▁is ▁as ▁stupid ▁as ▁In ▁The ▁W ood s . ▁O r ▁better ▁yet ,"
Increasing vocab size decreases number of characters in a token and vice-versa
subword(200)
"▁I ▁st i l l ▁c an ' t ▁f i g u re ▁ o u t ▁w h y ▁ an y ▁ s e l f - re s p e c t ing ▁p er s"
subword(10000)
"▁I ▁still ▁can ' t ▁figure ▁out ▁why ▁any ▁self - respect ing ▁person ▁would ▁ever ▁attempt ▁to ▁make ▁a ▁film ▁that ▁is ▁as ▁stupid ▁as ▁In ▁The ▁Wood s . ▁Or ▁better ▁yet , ▁why ▁any ▁decent ▁person ▁would"
Picking a subword vocab size represents a compromise: a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn
Numericalization is the process of mapping tokens to integers
tokens = tkn(txt)
#map applies tkn function each list of text in sample_txt
toks200 = sample_txt[:200].map(tkn)
toks200
(#200) [(#163) ['xxbos','i','still','ca',"n't",'figure','out','why','any','self'...],(#149) ['xxbos','xxmaj','there','is','so','much','tragedy','that','takes','place'...],(#196) ['xxbos','xxmaj','this','was','a','terrible','movie','with','a','bad'...],(#85) ['xxbos','xxmaj','what','a','waste','of','talent','and','cinematography','.'...],(#73) ['xxbos','i','thought','i','could','never','find','a','completely','bad'...],(#228) ['xxbos','i','first','saw','this','film','when','my','mother','bought'...],(#138) ['xxbos','i','have','a','very','hard','time','picking','a','favorite'...],(#211) ['xxbos','a','group','of','friends','are','out','partying','one','night'...],(#917) ['xxbos','xxmaj','some','films','are','mediocre',',','some','films','are'...],(#110) ['xxbos','i','do',"n't",'mind','the','movies','not','being','the'...]...]
num = Numericalize()
num.setup(toks200)
num.vocab[:5]
['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld']
print(coll_repr(num.vocab, 30))
(#2024) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','in','i','it','"','this','that',"'s",'\n\n','-','was','as','with','for'...]
num(tokens)[:20]
tensor([ 2, 18, 138, 190, 39, 720, 62, 154, 113, 721, 25, 0, 295, 68, 155, 502, 15, 122, 12, 30])
tokens
(#163) ['xxbos','i','still','ca',"n't",'figure','out','why','any','self'...]
Building a numericalization function
def token2idx(token):
try:
return (num.vocab.index(token))
except ValueError:
return ("unk")
token2idx('xxbos'), token2idx('his'), token2idx('jdksjdj'), token2idx(4)
(2, 43, 'unk', 'unk')
' '.join(num.vocab[idx] for idx in num(tokens)[:20])
"xxbos i still ca n't figure out why any self - xxunk person would ever attempt to make a film"
original_text = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
original_preprocessed_text = "xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \n xxmaj then we will study how we build a language model and train it for a while ."
my_preprocessed_text = ' '.join(tkn(original_text))
my_preprocessed_text
"xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \n xxmaj then we will study how we build a language model and train it for a while ."
original_preprocessed_text == my_preprocessed_text
True
bs, seq_len = 6, 15
stream = tkn(original_text)
[stream[i*seq_len: (i+1)*seq_len] for i in range(bs)]
[(#15) ['xxbos','xxmaj','in','this','chapter',',','we','will','go','back'...], (#15) ['movie','reviews','we','studied','in','chapter','1','and','dig','deeper'...], (#15) ['first','we','will','look','at','the','processing','steps','necessary','to'...], (#15) ['how','to','customize','it','.','xxmaj','by','doing','this',','...], (#15) ['of','the','preprocessor','used','in','the','data','block','xxup','api'...], (#15) ['will','study','how','we','build','a','language','model','and','train'...]]
df = pd.DataFrame(np.array([stream[i*seq_len: (i+1)*seq_len] for i in range(bs)]))
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | xxbos | xxmaj | in | this | chapter | , | we | will | go | back | over | the | example | of | classifying |
1 | movie | reviews | we | studied | in | chapter | 1 | and | dig | deeper | under | the | surface | . | xxmaj |
2 | first | we | will | look | at | the | processing | steps | necessary | to | convert | text | into | numbers | and |
3 | how | to | customize | it | . | xxmaj | by | doing | this | , | we | 'll | have | another | example |
4 | of | the | preprocessor | used | in | the | data | block | xxup | api | . | \n | xxmaj | then | we |
5 | will | study | how | we | build | a | language | model | and | train | it | for | a | while | . |
Simple tests
display(HTML(df.to_html(index=False, header=None)))
xxbos | xxmaj | in | this | chapter | , | we | will | go | back | over | the | example | of | classifying |
movie | reviews | we | studied | in | chapter | 1 | and | dig | deeper | under | the | surface | . | xxmaj |
first | we | will | look | at | the | processing | steps | necessary | to | convert | text | into | numbers | and |
how | to | customize | it | . | xxmaj | by | doing | this | , | we | 'll | have | another | example |
of | the | preprocessor | used | in | the | data | block | xxup | api | . | \n | xxmaj | then | we |
will | study | how | we | build | a | language | model | and | train | it | for | a | while | . |
sample_df = pd.DataFrame([[33, 44, 25]], columns=['epoch', 'error', 'accuracy'])
sample_df
epoch | error | accuracy | |
---|---|---|---|
0 | 33 | 44 | 25 |
for i in range(5):
sample_df = pd.DataFrame([[i+1, i*44, i*25]], columns=['epoch', 'error', 'accuracy'])
display(HTML(sample_df.to_html(index=False)))
epoch | error | accuracy |
---|---|---|
1 | 0 | 0 |
epoch | error | accuracy |
---|---|---|
2 | 44 | 25 |
epoch | error | accuracy |
---|---|---|
3 | 88 | 50 |
epoch | error | accuracy |
---|---|---|
4 | 132 | 75 |
epoch | error | accuracy |
---|---|---|
5 | 176 | 100 |
for i in range(5):
sample_df = pd.DataFrame([[i+1, i*44, i*25]], columns=['epoch', 'error', 'accuracy'])
display(sample_df)
epoch | error | accuracy | |
---|---|---|---|
0 | 1 | 0 | 0 |
epoch | error | accuracy | |
---|---|---|---|
0 | 2 | 44 | 25 |
epoch | error | accuracy | |
---|---|---|---|
0 | 3 | 88 | 50 |
epoch | error | accuracy | |
---|---|---|---|
0 | 4 | 132 | 75 |
epoch | error | accuracy | |
---|---|---|---|
0 | 5 | 176 | 100 |
Back to business
nums200 = toks200.map(num)
dl = LMDataLoader(nums200, bs=32)
x, y = first(dl)
x.shape, y.shape
(torch.Size([32, 72]), torch.Size([32, 72]))
' '.join([num.vocab[i] for i in x[1][:20]]), ' '.join([num.vocab[i] for i in y[1][:20]])
('the xxmaj xxunk xxmaj party . i can understand why a lot of people look down on writing like his', 'xxmaj xxunk xxmaj party . i can understand why a lot of people look down on writing like his ,')
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
db = DataBlock(
blocks=TextBlock.from_folder(path/'..', is_lm=True),
get_items=get_imdb,
splitter=RandomSplitter(0.1)
)
@classmethod
class MyFirstClass:
age = 26
@classmethod
def print_something(cls):
return cls.age
def show(self, num):
return num
# MyClass.print_something = classmethod(MyClass.print_something)
MyFirstClass.print_something()
26
MyFirstClass().show(4)
4
class Person:
age = 25
def printAge(cls):
print('The age is:', cls.age)
# create printAge class method
Person.printAge = classmethod(Person.printAge)
Person.printAge()
The age is: 25
Back to business
dls = db.dataloaders(path, bs=32, seq_len=80)
dls.show_batch(max_n=2)
text | text_ | |
---|---|---|
0 | xxbos xxmaj season after season , the players or characters in this show appear to be people who you 'd absolutely love to hate . xxmaj is this show rigged to be that or were they chosen for the same ? xxmaj each episode vilifies one single person specifically and he ends up getting killed off . xxmaj you enjoy seeing them get screwed although its totally wrong and sick . xxmaj you enjoy seeing them screwing others , getting | xxmaj season after season , the players or characters in this show appear to be people who you 'd absolutely love to hate . xxmaj is this show rigged to be that or were they chosen for the same ? xxmaj each episode vilifies one single person specifically and he ends up getting killed off . xxmaj you enjoy seeing them get screwed although its totally wrong and sick . xxmaj you enjoy seeing them screwing others , getting screwed |
1 | are simply hilarious but most of the film is rather lame . xxmaj at least the music score is very good but the music always ends abruptly because of the editing . xxmaj there are also a few scenes that are not logical and film also contains a very obvious ( and therefore disturbing ) continuity error . xxmaj jean xxmaj reno gives a decent performance and xxmaj christian xxmaj clavier turned out to be a very talented comedy actor | simply hilarious but most of the film is rather lame . xxmaj at least the music score is very good but the music always ends abruptly because of the editing . xxmaj there are also a few scenes that are not logical and film also contains a very obvious ( and therefore disturbing ) continuity error . xxmaj jean xxmaj reno gives a decent performance and xxmaj christian xxmaj clavier turned out to be a very talented comedy actor . |
learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()
learn.fit_one_cycle(1, 2e-2)
learn.save('1epoch')
Path('models/1epoch.pth')
learn.load('1epoch')
<fastai.text.learner.LMLearner at 0x7f153a68fe80>
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
learn.save_encoder('finetuned')
text = 'I like the movie because'
num_words = 40
learn.predict(text, num_words, temperature=0.75)
"i like the movie because it does n't work . i can not imagine the movies being mainly filmed in England . It 's an adaptation of William Shakespeare 's play of the same name to play the King"
db_classifier = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls.vocab), CategoryBlock),
get_items = partial(get_text_files, folders=['train', 'test']),
get_y=parent_label,
splitter=GrandparentSplitter(valid_name='test')
)
dls_cls = db_classifier.dataloaders(path, bs=32, seq_len=72)
dls_cls.show_batch(max_n=2)
learn = text_classifier_learner(dls_cls, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(1, 2e-2)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.559645 | 0.344728 | 0.851360 | 12:34 |
#freeze except the last two layers
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-3/(2.6**4),1e-3))