#!/usr/bin/env python
# coding: utf-8

# # Tuning Learning Rates with *ktrain*
# 
# Neural networks have many hyperparameters that need to be set before training begins. While, in practice, many hyperparameters have fairly reasonable defaults (e.g., ReLu activation, Xavier initialization, a kernel size of 3 in Convolutional Neural Networks), some do not and should be tuned. One of these is the learning rate, which governs the degree to which weights are adjusted during training.  Even after arriving at a good initial learning rate, it has been shown that varying the learning rate during training is effective in helping to minimize loss and improve generalization. *ktrain* provides a number of built-in methods to make it easy to tune and adjust learning rates to more effectively minimize loss during training.
# 
# To demonstrate these capabilities, we will begin by loading some text data into NumPy arrays and defining a simple text classification model.

# In[1]:


get_ipython().run_line_magic('reload_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')
get_ipython().run_line_magic('matplotlib', 'inline')
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 
import ktrain


# In[2]:


# load  and prepare data as you normally would in Keras
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets import imdb
NUM_WORDS = 20000
MAXLEN = 400
def load_data():
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=NUM_WORDS)
    x_train = sequence.pad_sequences(x_train, maxlen=MAXLEN)
    x_test = sequence.pad_sequences(x_test, maxlen=MAXLEN)
    return (x_train, y_train), (x_test, y_test)
(x_train, y_train), (x_test, y_test) = load_data()


# In[3]:


# build a model as you normally would in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
def get_model():
    model = Sequential()
    model.add(Embedding(NUM_WORDS, 50, input_length=MAXLEN))
    model.add(GlobalAveragePooling1D())
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
model = get_model()


# To use *ktrain*, one simply wraps the model and the data in a Learner object using the ```get_learner``` function.  This Learner object will be used to help tune and train our network.

# In[4]:


learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data = (x_test, y_test))


# The wrapped model and data are both directly accessible.  For instance, the model can be saved and loaded like normal in Keras (e.g,. ```learner.model.save('my_model.h5')```).  

# ### A Learning Rate Finder
# 
# The Learner object can be used to find the best learning rate for your model.  First, we use ```lr_find``` to track the loss as the learning rate is increased and then use ```lr_plot``` to identify the maximal learning rate associated with a falling loss.

# In[5]:


learner.lr_find()


# In[6]:


learner.lr_plot()


# We would like the maximal learning rate associated with a still-falling loss (prior the loss diverging).  Based on the plot, we will start with a learning rate of 0.005.

# ## Interactive Training
# 
# It is sometimes advantageous to train interactively.  For instance, one can train a model for one or two epochs using one learning rate.  Then, based on the results, a higher or lower learning rate can be used for subsequent epochs.  *ktrain* makes such interactive training easy.  Here, using the fit method of the Learner object, we train a single epoch at the learning rate found previously and a second epoch at a slightly lower learning rate.  The first argument is the learning rate and the second argument is the number of epochs.

# In[7]:


# reinitialize the model to train from scratch 
learner.set_model(get_model())

hist = learner.fit(0.005, 1)
hist = learner.fit(0.0005, 1)


# ## Using Learning Rate Schedules
# 
# In the example above, a static learning rate is used throughout each epoch.  It is sometimes beneficial to employ the use of learning rate schedules to *automatically* adjust the learning rate during the course training to more effectively minimize loss. Such adjustments can help jump out of suboptimal areas in the loss landscape and get to "sweet spots" with minimal loss that generalize well.  *ktrain* allows you to easily employ a variety of demonstrably effective learning rate policies during training. These include:
# 
# * a [triangular learning rate policy](https://arxiv.org/abs/1506.01186) available via the ```autofit``` method
# * a [1cycle policy](https://arxiv.org/abs/1803.09820) available via the ```fit_onecycle``` method
# * an [SGDR](https://arxiv.org/abs/1608.03983) (Stochastic Gradient Descent with Restart) schedule available using the ```fit``` method by supplying a *cycle_len* argument.
# 
# 
# ### SGDR
# We will begin by covering SGDR. *ktrain* allows you to easily employ an SGDR learning rate policy in a similar style to that of the *fastai* library. We will begin with covering the cycle_len parameter.
# 
# **cycle_len:** When *cycle_len* is not None, the second argument fo ```fit``` is interpreted as the number of cycles instead of the number of epochs.  For instance, the following call runs 2 cycles each of length 2 epochs - totaling 4 (or 2 * 2) epochs.  The learning rate gradually decreases throughout the 2-epoch cycle and then restarts at 5e-3 at the start of a new 2-epoch cycle.  Decreases follow a functional form (cosine annealing). More information can be found in the original [SGDR paper](https://arxiv.org/abs/1608.03983).

# In[8]:


# reinitialize the model to train from scratch 
learner.set_model(get_model())

# training using cycle_len 
learner.fit(5e-3, 2, cycle_len=2)


# The ```learner.plot``` method can be used to plot the training-validation loss (with 'loss' as argument) in addition to plotting the learning rate schedule with ('lr' as argument) and momentum schedule (with 'momentum' as argument) where applicable. Here, we plot the learning rate schedule employed by the previous call to ```learner.fit```.

# In[9]:


learner.plot('lr')


# **cycle_mult:** The cycle_mult parameter allows you to increase the cycle length as training progresses.  For instance, cycle_mult=2 will double the length of the cycle.  In the example below, seven epochs are run:
# - first cycle has length of one epoch
# - second cycle has length two epochs
# - third cycle has length of four epochs
# Each cycle will begin at a learning rate of 5e-3 and gradually decrease until it resets at the beginning of the next cycle.
# 
# Note that the example below overfits. It is shown to merely illustrate the *cycle_mult* parameter.

# In[10]:


# rebuild the model to train from scratch 
learner.set_model(get_model())

# training using cycle_len 
learner.fit(5e-3, 3, cycle_len=1, cycle_mult=2)


# Here is what the learning rate schedule looks like when using the **cycle_mult** parameter.

# In[11]:


learner.plot('lr')


# ## Triangular Learning Rate Policy via ```autofit```
# 
# The ```autofit``` method in *ktrain* employs a default cyclical learning rate schedule that tends to work well in practice.  The default learning rate schedule in ```autofit``` is currently the [triangular learning rate policy](https://arxiv.org/abs/1506.01186), which some slight modifications.
# 
# The ```autofit``` method accepts two primary arguments. The first (required) is the learning rate (**lr**) to be used, which can be found using the learning rate finder above. The second is optional and indicates the number of epochs (**epochs**) to train.  If **epochs** is not supplied as a second argument, then ```autofit``` will train until the validation loss no longer improves after a certain period. This period can be configured using the **early_stopping** argument.  At the end of training, the weights producing the lowest validation loss are automatically loaded into the model, when **early_stopping** is enabled.  The ```autofit``` method can also automatically reduce the maximum (and base) learning rates in the triangular policy when validation loss no longer improves. This can be configured using the **reduce_on_plateau** and **reduce_factor** arguments to ```autofit```.  
# 
# Example:
# ```
# learner.autofit(0.001, 20, reduce_on_plateau=2, reduce_factor=10)
# ```
# 
# The above will reduce the maximum and base learning rates in the triangular policy by a factor of 10 after two consecutive epochs of no improvement in validation loss.  Validation loss (i.e., val_loss) is the default criterion for both **early_stopping** and **reduce_on_plateau**.  To use validation accuracy instead, use invoke ```autofit``` with ```monitor='val_acc'```.
# 
# Here, we will use the ```autofit``` method and run the main training phase for two epochs.  

# In[23]:


# rebuild the model to train from scratch 
learner.set_model(get_model())

# training using autofit
learner.autofit(0.005, 2)


# The ```autofit``` method runs a triangular learning rate schedule with two modifications.  First, it annihilates the learning rate at the end of each cycle:

# In[24]:


learner.plot('lr')


# Second, if using the Adam, Nadam, or Adamax optimizers, it cycles the momentum between 0.85 and 0.95 in such a way that higher learning rates have lower momentum and lower learning rates have higher momentum, as suggested in [this paper](https://arxiv.org/pdf/1803.09820.pdf).

# In[25]:


learner.plot('momentum')


# ### Additional Cooldowns
# Since we are not [overfitting](https://en.wikipedia.org/wiki/Overfitting#Machine_learning) yet (i.e., validation loss is not increasing while training loss decreases), let's do a few more "cooldowns" starting at a smaller learning rate to improve the accuracy score further using the regular ```fit``` method that employs SGDR. These "cooldown" epochs will start the learning rate at 0.005/10 and gradually decrease it to a very small value.  We will use the **checkpoint_folder** argument covered earlier, so that we can restore the weights from any epoch in case we train too much and overfit.  If you are not using Linux, you should set this to your folder path of choice.

# In[26]:


learner.fit(0.005, 1, cycle_len=1, checkpoint_folder='/tmp')


# In[27]:


learner.fit(0.005/10, 1, cycle_len=1, checkpoint_folder='/tmp')


# Note that we are running multiple short cooldown phases here - cycles of only one epoch. This essentially amounts to SGDR.  Although we are not doing it here, we can also run one longer cooldown by simply calling ```fit``` with a larger value for ```cycle_len``` and leaving the number of cycles at 1. 
# 

# ## The 1cycle Policy
# 
# The [1cycle policy](https://arxiv.org/pdf/1803.09820.pdf) was proposed by Leslie Smith (as was the triangular learning rate policy).  The 1cycle policy runs a single triangular cycle over the course of training and then annihilates the learning rate to a near-zero value towards the end.

# In[28]:


# rebuild the model to train from scratch 
learner.set_model(get_model())

# training using the 1cycle policy
learner.fit_onecycle(0.005, 3)


# The 1cycle policy runs a single triangular cycle over all the epochs and also cycles the momentum in the shape of a *V*.

# In[29]:


learner.plot('lr')


# In[30]:


learner.plot('momentum')


# The final accuracy here is **~89%** using only unigram feaures and a simple model. In the [text classification notebook](https://github.com/amaiya/ktrain/blob/master/tutorials/tutorial-04-text-classification.ipynb), we show that an accuracy of **~92.3%** can be acheived on this dataset in mere seconds using built-in convenience methods in *ktrain*.

# In[ ]: