#!/usr/bin/env python # coding: utf-8 # # (MultiFiT) Portuguese Text Classifier on TCU jurisprudência dataset # ### MultiFiT configuration # - **Architecture 4 QRNN with 1550 hidden parameters by layer, SentencePiece tokenizer (15 000 tokens)** # - **Hyperparameters and training method from the MultiFiT paper** # - Author: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou) # - Date: **edition of October 15, 2019** (initial publication on September 2019) # - Post in medium: [link](https://medium.com/@pierre_guillou/nlp-fastai-portuguese-language-model-980c8ec75362) # - Ref: [Fastai v1](https://docs.fast.ai/) (Deep Learning library on PyTorch) # ## Warning (15/10/2019) # **This notebook is a modified version of the v1 published in September 2019.** Indeed (thanks to [David Vieira](https://medium.com/@davidhsv/ol%C3%A1-pierre-tudo-bom-2bc8ae36dc14)), we noticed that the fine-tuning of the LM and classifier did not use the SentencePiece model and vocab trained for the General Portuguese Language Model ([lm3-portuguese.ipynb](https://github.com/piegu/language-models/blob/master/lm3-portuguese.ipynb)). # # For example, the code used to create the fine-tuned Portuguese forward LM was : # # ```data_lm = (TextList.from_df(df_trn_val, path, cols=reviews, # processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)]) # .split_by_rand_pct(0.1, seed=42) # .label_for_lm() # .databunch(bs=bs, num_workers=1))``` # # It has been corrected by using the [SPProcessor.load()](https://github.com/fastai/fastai/blob/master/fastai/text/data.py#L481) function: # # ```data_lm = (TextList.from_df(df_trn_val, path, cols=reviews, processor=SPProcessor.load(dest)) # .split_by_rand_pct(0.1, seed=42) # .label_for_lm() # .databunch(bs=bs, num_workers=1))``` # # Therefore, we retrained the fine-tuned Portuguese forward LM and the classifier on TCU jurisprudência dataset and **got better results! :-)** (see the Results paragraph to get all results) # # - **(fine-tuned) Language Model** # - forward : (accuracy) **51.56%** instead of 44.66% | (perplexity) 11.38 instead of 15.97 # - backward: (accuracy) **52.15%** instead of 44.97% | (perplexity) 12.54 instead of 18.73 # # - **(fine-tuned) Text Classifier** # - **Accuracy** (ensemble) **97.95%** instead of 97.39% # - **f1 score** (ensemble): **0.9795** instead of 0.9737 # ## Information # ### Overview # # According to this new article "[MultiFiT: Efficient Multi-lingual Language Model Fine-tuning](https://arxiv.org/abs/1909.04761)" (September 10, 2019), the QRNN architecture and the SentencePiece tokenizer give better results than AWD-LSTM and the spaCy tokenizer respectively. # # Therefore, they have been used in this notebook to **fine-tune a Portuguese Bidirectional Language Model** by Transfer Learning of a Portuguese Bidirectional Language Model (with the QRNN architecture and the SentencePiece tokenizer, too) trained on a Wikipedia corpus of 100 millions tokens ([lm3-portuguese.ipynb](https://github.com/piegu/language-models/blob/master/lm3-portuguese.ipynb)). # # This Portuguese Bidirectional Language Model has been **fine-tuned on the [tcu_jurisp_reduzido.csv dataset about TCU jurisprudência](https://github.com/fastai-bsb/nlp-tcu-enunciados/blob/master/tcu_jurisp_reduzido.csv?raw=true)"** and **its encoder part has been transfered to a text classifier which has been finally trained on this corpus**. # # This process **LM General --> LM fine-tuned --> Classifier fine-tuned** is called [ULMFiT](http://nlp.fast.ai/category/classification.html) but we trained our 3 models with the hyperparameters values and method of the [MultiFiT](https://arxiv.org/abs/1909.04761) paper that are given at the end of the MultiFiT paper. # ### Hyperparameters values # # - Language Model # - (batch size) bs = 50 # - (QRNN) 4 QRNN (default: 3) with 1550 hidden parameters each one (default: 1152) # - (SentencePiece) vocab of 15000 tokens # - (dropout) mult_drop = 1.0 # - (weight decay) wd = 0.1 # - (number of training epochs) 20 epochs # - (learning rate) modified version of 1-cycle learning rate schedule (Smith, 2018) that uses cosine instead of linear annealing, cyclical momentum and discriminative finetuning # - (loss) FlattenedLoss of weighted LabelSmoothingCrossEntropy # # # - Sentiment Classifier # - (batch size) bs = 18 # - (SentencePiece) vocab of 15000 tokens # - (dropout) mult_drop = 0.3 # - (weight decay) wd = 0.1 # - (number of training epochs) 14 epochs (forward) and 19 epochs (backward) # - (learning rate) modified version of 1-cycle learning rate schedule (Smith, 2018) that uses cosine instead of linear annealing, cyclical momentum and discriminative finetuning # - (loss) FlattenedLoss of weighted LabelSmoothingCrossEntropy # ## Results # **We can conclude that this Bidirectional Portuguese LM model using the MultiFiT configuration is a good model to perform text classification but with about 46 millions of parameters, it is far from being a LM that can gan compete with [GPT-2](https://openai.com/blog/better-language-models/) or [BERT](https://arxiv.org/abs/1810.04805) in NLP tasks like text generation.** # # # - **About the data**: the dataset [tcu_jurisp_reduzido.csv](https://github.com/fastai-bsb/nlp-tcu-enunciados/blob/master/tcu_jurisp_reduzido.csv?raw=true) about "TCU jurisprudência" is unbalanced. Therefore, we used a weighted loss function (FlattenedLoss of weighted LabelSmoothingCrossEntropy). # - number of texts: 10263 # - class 0: 3468 (33.79%) # - class 1: 2723 (26.53%) # - class 2: 2297 (22.38%) # - class 3: 1775 (17.3%) # # # - **(fine-tuned) Language Model** # - forward : (accuracy) 51.56% | (perplexity) 11.38 # - backward: (accuracy) 52.15% | (perplexity) 12.54 # # # - **(fine-tuned) Text Classifier** # # - **Accuracy** # - forward : (global) 97.08% | (class 0) 98.49% | (class 1) 98.24% | (class 2) 96.71% | (class 3) 93.40% # - backward: (global) 97.07% | (class 0) 99.10% | (class 1) 97.89% | (class 2) 96.71% | (class 3) 92.89% # - ensemble: (global) **97.95%** | (class 0) **99.40%** | (class 1) **99.30%** | (class 2) **97.18%** | (class 3) **94.42%** # # - **f1 score** # - forward: 0.9707 # - backward: 0.9708 # - ensemble: **0.9795** # # (neg = negative reviews | pos = positive reviews) # ## Initialisation # In[1]: get_ipython().run_line_magic('reload_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') get_ipython().run_line_magic('matplotlib', 'inline') from fastai import * from fastai.text import * from fastai.callbacks import * import matplotlib.cm as cm # In[2]: get_ipython().system('python -m fastai.utils.show_install') # In[2]: # bs=48 # bs=24 bs=50 # In[3]: torch.cuda.set_device(0) # In[4]: data_path = Config.data_path() # This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents. (For other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias).) # In[5]: lang = 'pt' # In[6]: name = f'{lang}wiki' path = data_path/name path.mkdir(exist_ok=True, parents=True) lm_fns3 = [f'{lang}_wt_sp15_multifit', f'{lang}_wt_vocab_sp15_multifit'] lm_fns3_bwd = [f'{lang}_wt_sp15_multifit_bwd', f'{lang}_wt_vocab_sp15_multifit_bwd'] # In[7]: from sklearn.metrics import f1_score @np_func def f1(inp,targ): return f1_score(targ, np.argmax(inp, axis=-1), average='weighted') # In[8]: # source: https://github.com/fastai/fastai/blob/master//fastai/layers.py#L300:7 # blog: https://bfarzin.github.io/Label-Smoothing/ class WeightedLabelSmoothingCrossEntropy(nn.Module): def __init__(self, weight, eps:float=0.1, reduction='mean'): super().__init__() self.weight,self.eps,self.reduction = weight,eps,reduction def forward(self, output, target): c = output.size()[-1] log_preds = F.log_softmax(output, dim=-1) if self.reduction=='sum': loss = -log_preds.sum() else: loss = -log_preds.sum(dim=-1) if self.reduction=='mean': loss = loss.mean() return loss*self.eps/c + (1-self.eps) * F.nll_loss(log_preds, target, weight=self.weight, reduction=self.reduction) # In[9]: import warnings warnings.filterwarnings('ignore') # "error", "ignore", "always", "default", "module" or "on # ## Data # TCU jurisprudência: # - reduzido: https://github.com/fastai-bsb/nlp-tcu-enunciados/blob/master/tcu_jurisp_reduzido.csv # - completo: https://github.com/fastai-bsb/nlp-tcu-enunciados/blob/master/tcu_jurisp.csv # ### Download # In[8]: import urllib.request from converter import * # In[9]: # create TCU folder name_data = 'TCU' path_data = data_path/name_data path_data.mkdir(exist_ok=True, parents=True) # In[10]: get_ipython().run_cell_magic('time', '', "# Download each file from url and save it locally under file_name\n\nurl = 'https://github.com/fastai-bsb/nlp-tcu-enunciados/blob/master/tcu_jurisp_reduzido.csv?raw=true'\nfile_name = 'tcu_jurisp_reduzido.csv'\nurl_file = path_data/file_name\nurllib.request.urlretrieve(url, url_file)\n\nurl = 'https://raw.githubusercontent.com/fastai-bsb/nlp-tcu-enunciados/master/tcu_jurisp.csv'\nfile_name = 'tcu_jurisp.csv'\nurl_file = path_data/file_name\nurllib.request.urlretrieve(url, url_file)\n") # In[11]: path_data.ls() # In[12]: get_ipython().system('head -n4 {path_data.ls()[0]}') # ### Overview # In[13]: # to solve display error of pandas dataframe get_ipython().config.get('IPKernelApp', {})['parent_appname'] = "" # In[14]: df = pd.read_csv(path_data/'tcu_jurisp_reduzido.csv', encoding='utf-8') print(len(df)) print(Counter(df.labels)) df.head() # In[15]: df = pd.read_csv(path_data/'tcu_jurisp.csv', encoding='utf-8') print(len(df)) print(Counter(df.labels)) df.head() # ### Analysis (reduzido file) # In[16]: df = pd.read_csv(path_data/'tcu_jurisp_reduzido.csv', encoding='utf-8') print(len(df)) print(Counter(df.labels)) df.head() # In[17]: # columns names reviews = "text" label = "labels" # keep columns df2 = df[[reviews,label]].copy() # In[18]: # number of reviews print(f'(orginal csv) number of all reviews: {len(df2)}') # keep not null reviews ## delete nan reviews empty_nan = (df2[reviews].isnull()).sum() df2 = df2[df2[reviews].notnull()] ## delete empty reviews list_idx_none = [] for idxs, row in df2.iterrows(): if row[reviews].strip() == "": df2.drop(idxs, axis=0, inplace=True) list_idx_none.append(idxs) empty_none = len(list_idx_none) ## print results empty = empty_nan+empty_none if empty != 0: print(f'{empty} empty reviews were deleted') else: print('there is no empty review.') # # check that there is no twice the same review # # keep the first of unique review_id reviews # same = len(df2) - len(df2[idx].unique()) # if same != 0: # df2.drop_duplicates(subset=[idx], inplace=True) # print(f'from the {same} identical reviews ids, only the first one has been kept.') # else: # print('there is no identical review id.') ## delete nan labels empty_label_nan = (df2[label].isnull()).sum() df2 = df2[df2[label].notnull()] print(f'{empty_label_nan} reviews with nan label were deleted') # number of reviews by class counter = Counter(df2[label]) clas_0, clas_1, clas_2, clas_3 = counter[0], counter[1], counter[2], counter[3] num = len(df2) pc_clas_0, pc_clas_1 = round((clas_0/num)*100,2), round((clas_1/num)*100,2) pc_clas_2, pc_clas_3 = round((clas_2/num)*100,2), round((clas_3/num)*100,2) print(f'\nnumber of text of class 0: {clas_0} ({pc_clas_0}%)') print(f'number of text of class 1: {clas_1} ({pc_clas_1}%)') print(f'number of text of class 2: {clas_2} ({pc_clas_2}%)') print(f'number of text of class 3: {clas_3} ({pc_clas_3}%)') print(f'\n(final) number of all texts: {num}') # convert HTML caracters to normal letters df2[reviews] = df2[reviews].apply(convert) df2.head(5) # In[19]: df_trn_val = df2.copy() # number of reviews by class counter = Counter(df_trn_val[label]) clas_0, clas_1, clas_2, clas_3 = counter[0], counter[1], counter[2], counter[3] num = len(df_trn_val) pc_clas_0, pc_clas_1 = round((clas_0/num)*100,2), round((clas_1/num)*100,2) pc_clas_2, pc_clas_3 = round((clas_2/num)*100,2), round((clas_3/num)*100,2) print(f'\nnumber of text of class 0: {clas_0} ({pc_clas_0}%)') print(f'number of text of class 1: {clas_1} ({pc_clas_1}%)') print(f'number of text of class 2: {clas_2} ({pc_clas_2}%)') print(f'number of text of class 3: {clas_3} ({pc_clas_3}%)') print(f'\n(final) number of all texts: {num}') # plot histogram keys = list(df_trn_val[label].value_counts().keys()) values = list(df_trn_val[label].value_counts().array) plt.bar(keys, values[::-1]) plt.xticks(keys, keys[::-1]) # print(df_trn_val['label'].value_counts()) plt.show() # In[20]: df_trn_val.head() # In[21]: df_trn_val.to_csv(path_data/'tcu_jurisp_reduzido_preprocessed.csv', index = None, header=True) # ## Fine-tuning "forward LM" # In[10]: name_data = 'TCU' path_data = data_path/name_data # Load csv df_trn_val = pd.read_csv(path_data/'tcu_jurisp_reduzido_preprocessed.csv') # columns names reviews = "text" label = "labels" # In[11]: dest = path/'corpus2_100' (dest/'tmp').ls() # ### Databunch # In[22]: get_ipython().run_cell_magic('time', '', 'data_lm = (TextList.from_df(df_trn_val, path, cols=reviews, processor=SPProcessor.load(dest))\n .split_by_rand_pct(0.1, seed=42)\n .label_for_lm() \n .databunch(bs=bs, num_workers=1))\n') # In[23]: data_lm.save(f'{path}/{lang}_databunch_lm_tcu_jurisp_reduzido_sp15_multifit_v2') # ### Training # In[24]: data_lm = load_data(path, f'{lang}_databunch_lm_tcu_jurisp_reduzido_sp15_multifit_v2', bs=bs) # In[25]: config = awd_lstm_lm_config.copy() config['qrnn'] = True config['n_hid'] = 1550 #default 1152 config['n_layers'] = 4 #default 3 # In[26]: get_ipython().run_cell_magic('time', '', 'perplexity = Perplexity()\nlearn_lm = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained_fnames=lm_fns3, drop_mult=1., \n metrics=[error_rate, accuracy, perplexity]).to_fp16()\n') # In[27]: # number of model parameters sum([p.numel() for p in learn_lm.model.parameters()]) # In[28]: learn_lm.model # #### Change loss function # In[29]: learn_lm.loss_func # In[30]: learn_lm.loss_func = FlattenedLoss(LabelSmoothingCrossEntropy) # In[31]: learn_lm.loss_func # #### Training # In[32]: learn_lm.lr_find() # In[33]: learn_lm.recorder.plot() # In[35]: lr = 2e-2 lr *= bs/48 wd = 0.1 # In[36]: learn_lm.fit_one_cycle(2, lr*10, wd=wd, moms=(0.8,0.7)) # In[37]: learn_lm.save(f'{lang}fine_tuned1_tcu_jurisp_reduzido_sp15_multifit_v2') learn_lm.save_encoder(f'{lang}fine_tuned1_enc_tcu_jurisp_reduzido_sp15_multifit_v2') # In[38]: learn_lm.unfreeze() learn_lm.fit_one_cycle(18, lr, wd=wd, moms=(0.8,0.7), callbacks=[ShowGraph(learn_lm)]) # In[39]: learn_lm.save(f'{lang}fine_tuned2_lenerbr_sp15_multifit_v2') learn_lm.save_encoder(f'{lang}fine_tuned2_enc_lenerbr_sp15_multifit_v2') # Save best LM learner and its encoder # In[40]: learn_lm.save(f'{lang}fine_tuned_tcu_jurisp_reduzido_sp15_multifit_v2') learn_lm.save_encoder(f'{lang}fine_tuned_enc_tcu_jurisp_reduzido_sp15_multifit_v2') # ## Fine-tuning "backward LM" # ### Databunch # In[41]: get_ipython().run_cell_magic('time', '', 'data_lm = (TextList.from_df(df_trn_val, path, cols=reviews, processor=SPProcessor.load(dest))\n .split_by_rand_pct(0.1, seed=42)\n .label_for_lm() \n .databunch(bs=bs, num_workers=1, backwards=True))\n') # In[42]: data_lm.save(f'{path}/{lang}_databunch_lm_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # ### Training # In[43]: get_ipython().run_cell_magic('time', '', "data_lm = load_data(path, f'{lang}_databunch_lm_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', bs=bs, backwards=True)\n") # In[44]: config = awd_lstm_lm_config.copy() config['qrnn'] = True config['n_hid'] = 1550 #default 1152 config['n_layers'] = 4 #default 3 # In[45]: get_ipython().run_cell_magic('time', '', 'perplexity = Perplexity()\nlearn_lm = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained_fnames=lm_fns3_bwd, drop_mult=1., \n metrics=[error_rate, accuracy, perplexity]).to_fp16()\n') # #### Change loss function # In[46]: learn_lm.loss_func # In[47]: learn_lm.loss_func = FlattenedLoss(LabelSmoothingCrossEntropy) # In[48]: learn_lm.loss_func # #### Training # In[49]: learn_lm.lr_find() # In[50]: learn_lm.recorder.plot() # In[51]: lr = 2e-2 lr *= bs/48 wd = 0.1 # In[52]: learn_lm.fit_one_cycle(2, lr*10, wd=wd, moms=(0.8,0.7)) # In[53]: learn_lm.save(f'{lang}fine_tuned1_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') learn_lm.save_encoder(f'{lang}fine_tuned1_enc_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[54]: learn_lm.unfreeze() learn_lm.fit_one_cycle(18, lr, wd=wd, moms=(0.8,0.7), callbacks=[ShowGraph(learn_lm)]) # In[55]: learn_lm.save(f'{lang}fine_tuned2_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') learn_lm.save_encoder(f'{lang}fine_tuned2_enc_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # Save best LM learner and its encoder # In[56]: learn_lm.save(f'{lang}fine_tuned_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') learn_lm.save_encoder(f'{lang}fine_tuned_enc_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # ## Fine-tuning "forward Classifier" # In[12]: bs = 18 # ### Databunch # In[13]: get_ipython().run_cell_magic('time', '', "data_lm = load_data(path, f'{lang}_databunch_lm_tcu_jurisp_reduzido_sp15_multifit_v2', bs=bs)\n") # In[14]: get_ipython().run_cell_magic('time', '', 'data_clas = (TextList.from_df(df_trn_val, path, vocab=data_lm.vocab, cols=reviews, processor=SPProcessor.load(dest))\n .split_by_rand_pct(0.1, seed=42)\n .label_from_df(cols=label)\n .databunch(bs=bs, num_workers=1))\n') # In[15]: get_ipython().run_cell_magic('time', '', "data_clas.save(f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_v2')\n") # ### Get weights to penalize loss function of the majority class # In[17]: get_ipython().run_cell_magic('time', '', "data_clas = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_v2', bs=bs, num_workers=1)\n") # In[18]: num_trn = len(data_clas.train_ds.x) num_val = len(data_clas.valid_ds.x) num_trn, num_val, num_trn+num_val # In[19]: trn_LabelCounts = np.unique(data_clas.train_ds.y.items, return_counts=True)[1] val_LabelCounts = np.unique(data_clas.valid_ds.y.items, return_counts=True)[1] trn_LabelCounts, val_LabelCounts # In[20]: trn_weights = [1 - count/num_trn for count in trn_LabelCounts] val_weights = [1 - count/num_val for count in val_LabelCounts] trn_weights, val_weights # ### Training (Loss = FlattenedLoss of weighted LabelSmoothingCrossEntropy) # In[46]: get_ipython().run_cell_magic('time', '', "data_clas = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_v2', bs=bs, num_workers=1)\n") # In[47]: config = awd_lstm_clas_config.copy() config['qrnn'] = True config['n_hid'] = 1550 #default 1152 config['n_layers'] = 4 #default 3 # In[48]: learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, pretrained=False, drop_mult=0.3, metrics=[accuracy,f1]).to_fp16() learn_c.load_encoder(f'{lang}fine_tuned_enc_tcu_jurisp_reduzido_sp15_multifit_v2'); # #### Change loss function # In[49]: learn_c.loss_func # In[50]: loss_weights = torch.FloatTensor(trn_weights).cuda() learn_c.loss_func = FlattenedLoss(WeightedLabelSmoothingCrossEntropy, weight=loss_weights) # In[51]: learn_c.loss_func # #### Training # In[52]: learn_c.freeze() # In[28]: learn_c.lr_find() # In[29]: learn_c.recorder.plot() # In[53]: lr = 2e-1 lr *= bs/48 wd = 0.1 # In[54]: learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7)) # In[55]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') # In[56]: learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7)) # In[57]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') # In[58]: learn_c.freeze_to(-2) learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), wd=wd, moms=(0.8,0.7)) # In[59]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') # In[60]: learn_c.freeze_to(-3) learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), wd=wd, moms=(0.8,0.7)) # In[61]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') # In[62]: learn_c.unfreeze() learn_c.fit_one_cycle(4, slice(lr/10/(2.6**4),lr/10), wd=wd, moms=(0.8,0.7)) # In[63]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') # In[64]: learn_c.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') learn_c.fit_one_cycle(4, slice(lr/100/(2.6**4),lr/100), wd=wd, moms=(0.8,0.7)) # In[65]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') # In[69]: learn_c.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') learn_c.fit_one_cycle(2, slice(lr/1000/(2.6**4),lr/1000), wd=wd, moms=(0.8,0.7)) # In[70]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2') # ### Confusion matrix # In[71]: get_ipython().run_cell_magic('time', '', "data_clas = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_v2', bs=bs, num_workers=1);\n\nconfig = awd_lstm_clas_config.copy()\nconfig['qrnn'] = True\nconfig['n_hid'] = 1550 #default 1152\nconfig['n_layers'] = 4 #default 3\n\nlearn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config)\n") # In[72]: learn_c.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2', purge=False); # In[73]: preds,y,losses = learn_c.get_preds(with_loss=True) predictions = np.argmax(preds, axis = 1) interp = ClassificationInterpretation(learn_c, preds, y, losses) interp.plot_confusion_matrix() # In[74]: from sklearn.metrics import confusion_matrix cm = confusion_matrix(np.array(y), np.array(predictions)) print(cm) ## acc print(f'accuracy global: {(cm[0,0]+cm[1,1]+cm[2,2]+cm[3,3])/(cm.sum())}') # acc neg, acc pos print(f'accuracy on class 0: {cm[0,0]/(cm.sum(1)[0])*100}') print(f'accuracy on class 1: {cm[1,1]/(cm.sum(1)[1])*100}') print(f'accuracy on class 2: {cm[2,2]/(cm.sum(1)[2])*100}') print(f'accuracy on class 3: {cm[3,3]/(cm.sum(1)[3])*100}') # In[75]: learn_c.show_results() # ### Predictions some random sentences # In[76]: # Get the prediction test_text = "A medida cautelar do TCU que determina a suspensão de licitação por falhas no edital não impede o órgão ou a entidade de rever seu ato convocatório, valendo-se do poder de autotutela (art. 49 da Lei 8.666/1993 c/c o art. 9º da Lei 10.520/2002) , com o objetivo de, antecipando-se a eventual deliberação do Tribunal, promover de modo próprio a anulação da licitação e o refazimento do edital, livre dos vícios apontados." pred = learn_c.predict(test_text) print(pred) # In[77]: # The darker the word-shading in the below example, the more it contributes to the classification. txt_ci = TextClassificationInterpretation.from_learner(learn_c) txt_ci.show_intrinsic_attention(test_text,cmap=plt.cm.Purples) # In[78]: txt_ci.intrinsic_attention(test_text)[1] # In[79]: # tabulation showing the first k texts in top_losses along with their prediction, actual,loss, and probability of actual class. # max_len is the maximum number of tokens displayed. If max_len=None, it will display all tokens. txt_ci.show_top_losses(5) # ## Fine-tuning "backward Classifier" # In[80]: import warnings warnings.filterwarnings('ignore') # "error", "ignore", "always", "default", "module" or "on # In[81]: bs = 18 # ### Databunch # In[82]: get_ipython().run_cell_magic('time', '', "data_lm = load_data(path, f'{lang}_databunch_lm_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', bs=bs, backwards=True)\n") # In[83]: get_ipython().run_cell_magic('time', '', 'data_clas = (TextList.from_df(df_trn_val, path, cols=reviews, processor=SPProcessor.load(dest), vocab=data_lm.vocab)\n .split_by_rand_pct(0.1, seed=42)\n .label_from_df(cols=label)\n .databunch(bs=bs, num_workers=1, backwards=True))\n') # In[84]: get_ipython().run_cell_magic('time', '', "data_clas.save(f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_bwd_v2')\n") # ### Get weights to penalize loss function of the majority class # In[85]: get_ipython().run_cell_magic('time', '', "data_clas = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', bs=bs, num_workers=1, backwards=True)\n") # In[86]: num_trn = len(data_clas.train_ds.x) num_val = len(data_clas.valid_ds.x) num_trn, num_val, num_trn+num_val # In[87]: trn_LabelCounts = np.unique(data_clas.train_ds.y.items, return_counts=True)[1] val_LabelCounts = np.unique(data_clas.valid_ds.y.items, return_counts=True)[1] trn_LabelCounts, val_LabelCounts # In[88]: trn_weights = [1 - count/num_trn for count in trn_LabelCounts] val_weights = [1 - count/num_val for count in val_LabelCounts] trn_weights, val_weights # ### Training (Loss = FlattenedLoss of weighted LabelSmoothingCrossEntropy) # In[89]: get_ipython().run_cell_magic('time', '', "data_clas = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', bs=bs, num_workers=1, backwards=True)\n") # In[90]: config = awd_lstm_clas_config.copy() config['qrnn'] = True config['n_hid'] = 1550 #default 1152 config['n_layers'] = 4 #default 3 # In[91]: learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, drop_mult=0.3, metrics=[accuracy,f1]).to_fp16() learn_c.load_encoder(f'{lang}fine_tuned_enc_tcu_jurisp_reduzido_sp15_multifit_bwd_v2'); # #### Change loss function # In[92]: learn_c.loss_func # In[93]: loss_weights = torch.FloatTensor(trn_weights).cuda() learn_c.loss_func = FlattenedLoss(WeightedLabelSmoothingCrossEntropy, weight=loss_weights) # In[94]: learn_c.loss_func # #### Training # In[95]: learn_c.freeze() # In[96]: learn_c.lr_find() # In[97]: learn_c.recorder.plot() # In[98]: lr = 2e-1 lr *= bs/48 wd = 0.1 # In[99]: learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7)) # In[100]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[101]: learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7)) # In[102]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[103]: learn_c.freeze_to(-2) learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), wd=wd, moms=(0.8,0.7)) # In[104]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[105]: learn_c.freeze_to(-3) learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), wd=wd, moms=(0.8,0.7)) # In[106]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[107]: learn_c.unfreeze() learn_c.fit_one_cycle(4, slice(lr/10/(2.6**4),lr/10), wd=wd, moms=(0.8,0.7)) # In[108]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[109]: learn_c.fit_one_cycle(4, slice(lr/100/(2.6**4),lr/100), wd=wd, moms=(0.8,0.7)) # In[110]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[115]: learn_c.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') learn_c.fit_one_cycle(1, slice(lr/1000/(2.6**4),lr/1000), wd=wd, moms=(0.8,0.7)) # In[116]: learn_c.fit_one_cycle(1, slice(lr/1000/(2.6**4),lr/1000), wd=wd, moms=(0.8,0.7)) # In[117]: learn_c.save(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # In[118]: learn_c.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2'); learn_c.to_fp32().export(f'{lang}_classifier_tcu_jurisp_reduzido_sp15_multifit_bwd_v2') # ### Confusion matrix # In[119]: get_ipython().run_cell_magic('time', '', "data_clas = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', bs=bs, num_workers=1, backwards=True)\n\nconfig = awd_lstm_clas_config.copy()\nconfig['qrnn'] = True\nconfig['n_hid'] = 1550 #default 1152\nconfig['n_layers'] = 4 #default 3\n\nlearn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config)\n") # In[120]: learn_c.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', purge=False); # In[121]: preds,y,losses = learn_c.get_preds(with_loss=True) predictions = np.argmax(preds, axis = 1) interp = ClassificationInterpretation(learn_c, preds, y, losses) interp.plot_confusion_matrix() # In[122]: from sklearn.metrics import confusion_matrix cm = confusion_matrix(np.array(y), np.array(predictions)) print(cm) ## acc print(f'accuracy global: {(cm[0,0]+cm[1,1]+cm[2,2]+cm[3,3])/(cm.sum())}') # acc neg, acc pos print(f'accuracy on class 0: {cm[0,0]/(cm.sum(1)[0])*100}') print(f'accuracy on class 1: {cm[1,1]/(cm.sum(1)[1])*100}') print(f'accuracy on class 2: {cm[2,2]/(cm.sum(1)[2])*100}') print(f'accuracy on class 3: {cm[3,3]/(cm.sum(1)[3])*100}') # In[123]: learn_c.show_results() # ### Predictions some random sentences # In[124]: # Get the prediction test_text = "A medida cautelar do TCU que determina a suspensão de licitação por falhas no edital não impede o órgão ou a entidade de rever seu ato convocatório, valendo-se do poder de autotutela (art. 49 da Lei 8.666/1993 c/c o art. 9º da Lei 10.520/2002) , com o objetivo de, antecipando-se a eventual deliberação do Tribunal, promover de modo próprio a anulação da licitação e o refazimento do edital, livre dos vícios apontados." pred = learn_c.predict(test_text) print(pred) # In[125]: # The darker the word-shading in the below example, the more it contributes to the classification. txt_ci = TextClassificationInterpretation.from_learner(learn_c) txt_ci.show_intrinsic_attention(test_text,cmap=plt.cm.Purples) # In[126]: txt_ci.intrinsic_attention(test_text)[1] # In[127]: # tabulation showing the first k texts in top_losses along with their prediction, actual,loss, and probability of actual class. # max_len is the maximum number of tokens displayed. If max_len=None, it will display all tokens. txt_ci.show_top_losses(5) # ## Ensemble # In[128]: bs = 18 # In[129]: config = awd_lstm_clas_config.copy() config['qrnn'] = True config['n_hid'] = 1550 #default 1152 config['n_layers'] = 4 #default 3 # In[130]: data_clas = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_v2', bs=bs, num_workers=1) learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, drop_mult=0.3, metrics=[accuracy,f1]).to_fp16() learn_c.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_v2', purge=False); # In[131]: preds,targs = learn_c.get_preds(ordered=True) accuracy(preds,targs),f1(preds,targs) # In[132]: data_clas_bwd = load_data(path, f'{lang}_textlist_class_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', bs=bs, num_workers=1, backwards=True) learn_c_bwd = text_classifier_learner(data_clas_bwd, AWD_LSTM, config=config, drop_mult=0.3, metrics=[accuracy,f1]).to_fp16() learn_c_bwd.load(f'{lang}clas_tcu_jurisp_reduzido_sp15_multifit_bwd_v2', purge=False); # In[133]: preds_b,targs_b = learn_c_bwd.get_preds(ordered=True) accuracy(preds_b,targs_b),f1(preds_b,targs_b) # In[134]: preds_avg = (preds+preds_b)/2 # In[135]: accuracy(preds_avg,targs_b),f1(preds_avg,targs_b) # In[136]: from sklearn.metrics import confusion_matrix predictions = np.argmax(preds_avg, axis = 1) cm = confusion_matrix(np.array(targs_b), np.array(predictions)) print(cm) ## acc print(f'accuracy global: {(cm[0,0]+cm[1,1]+cm[2,2]+cm[3,3])/(cm.sum())}') # acc neg, acc pos print(f'accuracy on class 0: {cm[0,0]/(cm.sum(1)[0])*100}') print(f'accuracy on class 1: {cm[1,1]/(cm.sum(1)[1])*100}') print(f'accuracy on class 2: {cm[2,2]/(cm.sum(1)[2])*100}') print(f'accuracy on class 3: {cm[3,3]/(cm.sum(1)[3])*100}') # In[ ]: