Optimization refers to the task of minimizing/maximizing an objective function $f(x)$ parameterized by $x$. In machine/deep learning terminology, it's the task of minimizing the cost/loss function $J(w)$ parameterized by the model's parameters $w \in \mathbb{R}^d$. Optimization algorithms (in case of minimization) have one of the following goals:
There are three kinds of optimization algorithms:
Gradient Descent is the most common optimization algorithm in machine learning and deep learning. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function $J(w)$ w.r.t to the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate $\alpha$. Therefore, we follow the direction of the slope downhill until we reach a local minimum.
In this notebook, we'll cover gradient descent algorithm and its variants: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent.
Let's first see how gradient descent and its associated steps works on logistic regression before going into the details of its variants. For the sake of simplicity, let's assume that the logistic regression model has only two parameters: weight $w$ and bias $b$.
Initialize weight $w$ and bias $b$ to any random numbers.
Pick a value for the learning rate $\alpha$. The learning rate determines how big the step would be on each iteration.
Therefore, plot the cost function against different values of $\alpha$ and pick the value of $\alpha$ that is right before the first value that didn't converge so that we would have a very fast learning algorithm that converges (see figure 2).
Now let's discuss the three variants of gradient descent algorithm. The main difference between them is the amount of data we use when computing the gradients for each learning step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter's update (learning step).
Batch Gradient Descent is when we sum up over all examples on each iteration when performing the updates to the parameters. Therefore, for each update, we have to sum over all examples: $$w = w - \alpha \nabla_w J(w)\tag{6}$$
for i in range(num_epochs):
grad = compute_gradient(data, params)
params = params - learning_rate * grad
The main advantages:
The main disadvantages:
Instead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on batch size. Therefore, learning happens on each mini-batch of $b$ examples:
$$w = w - \alpha \nabla_w J(x^{\{i:i + b\}}, y^{\{i: i + b\}}; w)\tag{7}\\{}$$for i in range(num_epochs):
np.random.shuffle(data)
for batch in radom_minibatches(data, batch_size=32):
grad = compute_gradient(batch, params)
params = params - learning_rate * grad
The batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve better runtime with common batch sizes such as power of 2.
The main advantages:
The main disadvantages:
With large training datasets, we don't usually need more than 2-10 passes over all training examples (epochs). Note: with batch size $b = m$, we get the Batch Gradient Descent.
Instead of going through all examples, Stochastic Gradient Descent (SGD) performs the parameters update on each example $(x^i, y^i)$. Therefore, learning happens on every example:
$$w = w - \alpha \nabla_w J(x^i, y^i; w)\tag{7}$$for i in range(num_epochs):
np.random.shuffle(data)
for example in data:
grad = compute_gradient(example, params)
params = params - learning_rate * grad
It shares most of the advantages and the disadvantages with mini-batch version. Below are the ones that are specific to SGD:
Below is a graph that shows the gradient descent's variants and their direction towards the minimum:
Below are some challenges regarding gradient descent algorithm in general as well as its variants - mainly batch and mini-batch:
Gradient descent is a first-order optimization algorithm, which means it doesn't take into account the second derivatives of the cost function. However, the curvature of the function affects the size of each learning step. The gradient measures the steepness of the curve but the second derivative measures the curvature of the curve. Therefore, if:
As a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge.
If Hessian matrix has poor conditioning number, i.e. the direction of the most curvature has much more curvature than the direction of the lowest curvature. This will lead the cost function to be very sensitive in some directions and insensitive in other directions. As a result, it will make it harder on the gradient because the direction that looks promising for the gradient may not lead to big changes in the cost function (see figure 7).