Sascha Spors, Professorship Signal Theory and Digital Signal Processing, Institute of Communications Engineering (INT), Faculty of Computer Science and Electrical Engineering (IEF), University of Rostock, Germany
Winter Semester 2023/24 (Master Course #24512)
Feel free to contact lecturer frank.schultz@uni-rostock.de
Suppose, that the (made up) loss function of a model with two parameters $\beta_1$ and $\beta_2$ is analytically given as
$$\mathcal{L}(\beta_1, \beta_2) = (\beta_1 - 2)^2 + (\beta_2 - 1)^4 - (\beta_2 -1)^2$$In this toy example there is no data dependency involved, which is not how things work in practice, but it is good to understand the essence of finding a minimum numerically.
In order to find potential minima, and thereby the optimum model parameters $\hat{\beta_1}$ and $\hat{\beta_2}$, we need to solve gradient for zero
$$\nabla \mathcal{L} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \beta_1}\\ \frac{\partial \mathcal{L}}{\partial \beta_2} \end{bmatrix}= \mathbf{0}$$The required partial derivatives of first order are
$$\frac{\partial \mathcal{L}}{\partial \beta_1} = 2 (\beta_1 - 2)^1$$$$\frac{\partial \mathcal{L}}{\partial \beta_2} = 4 (\beta_2 - 1)^3 - 2(\beta_2 -1)^1$$A check with the Hessian of $\mathcal{L}(\beta_1, \beta_2)$ yields whether we deal with a minimum, maximum, saddle point or neither of them for each of the zero gradient conditions.
We get first minimum at $$\beta_{1,min1} = 2\qquad \beta_{2,min1} = 1+\frac{1}{\sqrt{2}}$$
We get second minimum at $$\beta_{1,min2} = 2\qquad \beta_{2,min2} = 1-\frac{1}{\sqrt{2}}$$
Both minima yield the same function value $$\mathcal{L}(\beta_{1,min}, \beta_{2,min}) = -\frac{1}{4},$$ so we deal actually with two optimum models, as there is no global minimum with only one lowest function value.
We have one saddle point, that actually separates the two minima at $$\beta_{1,saddle} = 2\qquad \beta_{2,saddle} = 1$$
with function value $$\mathcal{L}(\beta_{1,saddle}, \beta_{2,saddle}) = 0$$
We could (and in ML problems we often need) to find the minima numerically. The most straightforward and simple numerical solver is the so called gradient descent (GD), a first order method.
It uses the (analytically known) gradient $\nabla\mathcal{L}$, evaluates it for an actual $\beta_{actual}$ and updates subsequently into direction of negative gradient, i.e. the (or rather a?!?) minimum that we want to find.
This iterative procedure can be written as $$(1):\quad \beta_{new} = \beta_{actual} - \mathrm{step size} \cdot \nabla\mathcal{L}\bigg|_{\beta_{actual}}\quad(2):\quad\beta_{new} \rightarrow \beta_{actual}\quad(3): \mathrm{go to}\,(1)$$ repeated until we hopefully converged to the $\beta$ that represents the minimum.
In practice GD is not often used, as it is not very robust and many things can go wrong. Let us check this with some illustrative examples below.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
matplotlib_widget_flag = False
# analytical solutions from above
minimum1, fminimum1 = np.array([2, 1 + 1 / np.sqrt(2)]), -1 / 4
minimum2, fminimum2 = np.array([2, 1 - 1 / np.sqrt(2)]), -1 / 4
saddle, fsaddle = np.array([2, 1]), 0
def get_gradient(beta):
beta_gradient = np.array(
[2 * (beta[0] - 2), 4 * (beta[1] - 1) ** 3 - 2 * (beta[1] - 1)]
)
return beta_gradient
def f(x, y):
# our loss function from above
# only with nicer to read variables x,y
# instead of beta1, beta2
return (x - 2) ** 2 + (y - 1) ** 4 - (y - 1) ** 2
x, y = np.linspace(0, 4, 2**7), np.linspace(-1 / 2, 5 / 2, 2**7)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
col_min, col_max, no_col = -1 / 2, 5, 12
col_tick = np.linspace(col_min, col_max, no_col, endpoint=True)
cmap = plt.cm.magma_r
norm = mpl.colors.BoundaryNorm(col_tick, cmap.N)
if matplotlib_widget_flag:
%matplotlib widget
fig = plt.figure()
ax = plt.axes(projection="3d")
c = ax.plot_surface(
X, Y, Z, rstride=1, cstride=1, cmap=cmap, norm=norm, edgecolor="none"
)
ax.plot(minimum1[0], minimum1[1], fminimum1, "C0o")
ax.plot(minimum2[0], minimum2[1], fminimum2, "C0o")
ax.plot(saddle[0], saddle[1], fsaddle, "C0o")
cbar = fig.colorbar(
c, ax=ax, ticks=col_tick[:: no_col // 10], label=r"$\mathcal{L}$"
)
ax.set_xlim(x[0], x[-1])
ax.set_ylim(y[0], y[-1])
ax.set_zlim(-1 / 4, 5)
ax.set_xlabel(r"$\beta_1$")
ax.set_ylabel(r"$\beta_2$")
ax.set_zlabel(r"$\mathcal{L}$")
ax.view_init(elev=60, azim=-40)
def my_gradient_descent(beta, beta_gradient, step_size):
# simple update rule for GD
beta -= beta_gradient * step_size
return beta
def plot_contour_of_my_gradient_descent():
fig, ax = plt.subplots()
ax.contour(X, Y, Z, cmap="magma_r")
ax.plot(beta_path[0], beta_path[1], "C0.-", lw=0.3)
ax.plot(minimum1[0], minimum1[1], "kx", ms=10)
ax.plot(minimum2[0], minimum2[1], "kx", ms=10)
ax.plot(saddle[0], saddle[1], "kx", ms=10)
ax.set_xlabel(r"$\beta_1$")
ax.set_ylabel(r"$\beta_2$")
ax.set_title(
r"last calculated beta1 = {f1:5.4f}, beta2 = {f2:5.4f}".format(
f1=beta_path[0, -1], f2=beta_path[1, -1]
)
)
ax.axis("equal")
ax.grid("True")
Let us start with an example giving a nice result.
The parameters that are inherent to the straightforward GD algorithm
work nicely out to find the minimum 1 at $$\beta_{1,min1} = 2\qquad \beta_{2,min1} = 1+\frac{1}{\sqrt{2}}$$
steps = 2**5
beta = np.array([4, 1.075])
step_size = 1 / 5
beta_path = np.zeros((2, steps + 1))
beta_path[:, 0] = beta # store chosen init beta
for step in range(steps):
# this is the GD
beta_gradient = get_gradient(beta)
beta = my_gradient_descent(beta, beta_gradient, step_size)
# store our GD path
beta_path[:, step + 1] = beta
# and plot
plot_contour_of_my_gradient_descent()
The chosen number of steps, the chosen step size and the chosen starting point work out to approach the minimum 2 at
$$\beta_{1,min2} = 2\qquad \beta_{2,min2} = 1-\frac{1}{\sqrt{2}},$$which we however not reached yet. Increasing the number of steps and/or slightly decreasing the step size should safely bring us to the minimum 2.
steps = 2**6
beta = np.array([0.0, -0.5])
step_size = 1 / 50
print(step_size)
beta_path = np.zeros((2, steps + 1))
beta_path[:, 0] = beta # store chosen init beta
for step in range(steps):
# this is the GD
beta_gradient = get_gradient(beta)
beta = my_gradient_descent(beta, beta_gradient, step_size)
# store our GD path
beta_path[:, step + 1] = beta
# and plot
plot_contour_of_my_gradient_descent()
This GD path is special: We start very close to the saddle point ridge, i.e. we start with a chosen starting value $\beta_2=0.9999$, which is very close to $\beta_{2,saddle} = 1$. The GD then initially moves along the saddle point ridge, but fortunately drops of to find its way to minimum 2. The chosen values for number of steps and step size are much more critical here compared to above examples.
What happens if we choose beta = np.array([4., 1.0001])
? Make first an expectation, then try...
steps = 2**6
beta = np.array([4.0, 0.9999])
step_size = 1 / 10
beta_path = np.zeros((2, steps + 1))
beta_path[:, 0] = beta # store chosen init beta
for step in range(steps):
# this is the GD
beta_gradient = get_gradient(beta)
beta = my_gradient_descent(beta, beta_gradient, step_size)
# store our GD path
beta_path[:, step + 1] = beta
# and plot
plot_contour_of_my_gradient_descent()
In the examples above fortunately the GDs approached or even finally found a minimum, which is what we desire when learning optimum model parameters.
Things could go wrong, if we set up things badly. We should realize, that ML libraries such as TensorFlow are nowadays very sophisticated, mature, highly double-checked and debugged and it is thus more likely that things go wrong because of us using these ML algorithms. So, human intelligence and know how is still needed to build the best models.
One prominent and illustrative example for bad things that happen, is given below. Here, we never reach one of the two minima, but rather we die on the saddle point. This is because the GD can by its gradient concept never drop of the ridge of the saddle point. Here in this toy example, we know that and why this happens. But what about practical, very complicated loss functions?!?! Dying on a saddle point is probably not a nicely learned model.
steps = 2**6
beta = np.array([4.0, 1.0])
step_size = 1 / 10
beta_path = np.zeros((2, steps + 1))
beta_path[:, 0] = beta # store chosen init beta
for step in range(steps):
# this is the GD
beta_gradient = get_gradient(beta)
beta = my_gradient_descent(beta, beta_gradient, step_size)
# store our GD path
beta_path[:, step + 1] = beta
# and plot
plot_contour_of_my_gradient_descent()
Another bad thing occurs in this example. The chosen large step size of $1$ and the chosen $\beta_2=1$ on the saddle point ridge yield a GD that oscillates between $$[\beta_1, \beta_2]^\mathrm{T} = [4, 1]^\mathrm{T}$$ and $$[\beta_1, \beta_2]^\mathrm{T} = [0, 1]^\mathrm{T}$$
We shortly should check the analytical equations, these numbers are not by accident. The gradient is
$$\frac{\partial \mathcal{L}}{\partial \beta_1} = 2 (\beta_1 - 2)^1$$$$\frac{\partial \mathcal{L}}{\partial \beta_2} = 4 (\beta_2 - 1)^3 - 2(\beta_2 -1)^1$$We start with initial $$\beta_{actual} = [\beta_1, \beta_2]^\mathrm{T} = [4, 1]^\mathrm{T},$$ hence $$\frac{\partial \mathcal{L}}{\partial \beta_1}\bigg|_{\beta_{actual}} = 2 (4 - 2)^1 = 4$$ $$\frac{\partial \mathcal{L}}{\partial \beta_2}\bigg|_{\beta_{actual}} = 4 (1 - 1)^3 - 2(1 -1)^1 = 0$$ The update rule in GD is $$\beta_{new} = \beta_{actual} - \mathrm{stepsize} \cdot \nabla\mathcal{L}\bigg|_{\beta_{actual}},$$ with inserted numbers $$\beta_{new} = \begin{bmatrix}4\\1 \end{bmatrix}
= \begin{bmatrix}0\\1 \end{bmatrix} $$ GD moves to the next step, so the model parameter vector which was newly calculated becomes the actual vector $\beta_{new} \rightarrow \beta_{actual}$.
So, next we deal with $$\beta_{actual} = [\beta_1, \beta_2]^\mathrm{T} = [0, 1]^\mathrm{T},$$ hence $$\frac{\partial \mathcal{L}}{\partial \beta_1}\bigg|_{\beta_{actual}} = 2 (0 - 2)^1 = -4$$ $$\frac{\partial \mathcal{L}}{\partial \beta_2}\bigg|_{\beta_{actual}} = 4 (1 - 1)^3 - 2(1 -1)^1 = 0$$ The update rule in GD is $$\beta_{new} = \beta_{actual} - \mathrm{stepsize} \cdot \nabla\mathcal{L}\bigg|_{\beta_{actual}},$$ with inserted numbers $$\beta_{new} = \begin{bmatrix}0\\1 \end{bmatrix}
= \begin{bmatrix}4\\1 \end{bmatrix} $$ In the next step, we again get $$\beta_{new} = \begin{bmatrix}0\\1 \end{bmatrix} $$, and so on...we see the oscillation. In this toy example the oscillation is very obvious and easy to check by the simple equations. In practice, oscillations could be more complex (more than two states) and hard to trace. We should be aware of such phenomenon as a minimum is never reached.
steps = 2**3
beta = np.array([4.0, 1])
step_size = 1
beta_path = np.zeros((2, steps + 1))
beta_path[:, 0] = beta # store chosen init beta
for step in range(steps):
# this is the GD
beta_gradient = get_gradient(beta)
beta = my_gradient_descent(beta, beta_gradient, step_size)
# store our GD path
beta_path[:, step + 1] = beta
# and plot
plot_contour_of_my_gradient_descent()
beta_path
We have seen a stable GD oscillation in the example directly above. Things could get even worse. In this example we deal with a dangerous phenomenon often referred to as exploding gradient descent in textbooks. Instead of approaching one minimum the GD starts to oscillate and with successively climbing up out of the valley. In the example here we have chosen the numbers, such that double precision still can handle the numbers. Very often such exploding gradients end in Nan, Inf for the gradients and the model parameters. Obviously, that's not useful at all.
We should be aware, that changing step_size = 0.29
to step_size = 0.28
the GD does not explode, but finds its way to minimum 1. So, a tiny change in one number rules over success vs. failure.
This indicates the importance of being cautious against chosen parameters and actually understanding what and why algorithms do the things they do.
steps = 2**3
beta = np.array([4.0, 2.5])
if True:
step_size = 0.29 # GD explodes
flag = True
else:
step_size = 0.28 # GD approaches minimum1
flag = False
beta_path = np.zeros((2, steps + 1))
beta_path[:, 0] = beta # store chosen init beta
for step in range(steps):
# this is the GD
beta_gradient = get_gradient(beta)
print("step", step, "gradient", beta_gradient)
beta = my_gradient_descent(beta, beta_gradient, step_size)
# store our GD path
beta_path[:, step + 1] = beta
# and plot
plot_contour_of_my_gradient_descent()
if flag:
plt.title("") # get rid of the super long title for larbe beta2
print("beta vector along the GD")
beta_path
In this example, we deal with a more complex oscillation which however is stable, i.e. the GD does not explode. It is nevertheless a meaningless solution to our problem.
We should take some time to figure out how to change the parameters steps
and step_size
Which minimum is found in the latter case? Why must this always be the case?
steps = 2**7 # default: 2**7
beta = np.array([4.0, 1.25])
step_size = 1 # default: 1
beta_path = np.zeros((2, steps + 1))
beta_path[:, 0] = beta # store chosen init beta
for step in range(steps):
# this is the GD
beta_gradient = get_gradient(beta)
beta = my_gradient_descent(beta, beta_gradient, step_size)
# store our GD path
beta_path[:, step + 1] = beta
# and plot
plot_contour_of_my_gradient_descent()