In this tutorial, we will cover:
Continious optimization problems are fundumental in Computer Science.
May be either unconstrained: $$ \min_x f(x) $$ $$ f: \mathbb{R}^d \rightarrow \mathbb{R} $$ Or constrained: $$ \min_x f(x) \text{ subject to } x \in \mathcal{K} $$ $$ f: \mathbb{R}^d \rightarrow \mathbb{R} \text{, } \mathcal{K} \subseteq \mathbb{R}^d \text{ is closed and convex} $$
Many problems in CS can be written as a continous optimization problems:
In some cases, continious optimization problems may be solved analytically:
Modern deep architectures include millions (sometimes billions) of parameters... the loss function is summed over all the dataset (memory burden) and the loss surface is often very noisy!
Therefore, efficient iterative optimization algorithms are required!
"GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models"
We will be mainly interested in functions $f: \mathbb{R}^d \rightarrow \mathbb{R}$.
The generalization of the derivative in the multivariate case is denoted as the gradient, which is composed of the partial derivatives: $$ \nabla_x f = (\frac{\partial f}{\partial x_1},...,\frac{\partial f}{\partial x_d}) \in \mathbb{R}^d $$
The gradient gives us local information about the direction of the largest ascent:
If the gradient at some point $x \in \mathbb{R}^d$ is $\vec{0}$ then $x$ is called a stationary point.
The second derivative of a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is defined by computing the gradient of each of the partial derivatives.
The resulting matrix is defined as the Hessian of $f$: $$ \nabla^2_x f = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_d \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_d \partial x_d} \\ \end{pmatrix} \in \mathbb{R}^{d \times d} $$
We denote $ \eta $, which determines the step size as the learning rate.
By using first order Taylor's approximation around $x_k$: $$ f(x_k + \delta) = f(x_k) + \nabla_x f(x_k)^T \delta + o(\| \delta\|)$$ Substituting $\delta = - \eta \nabla_x f (x_k)$: $$ f(x_{k+1}) = f(x_k) - \eta \| \nabla_x f(x_k) \|^2 + o(\| \delta\|)$$ If $x_k$ is not a stationary point, then for a small enough $\eta > 0 $ we have that $f$ strictly decreases. This however does not prove that GD converges to a local minimum, but rather gives a motivation. The convergence analysis of GD is given in: https://courses.cs.washington.edu/courses/cse546/15au/lectures/lecture09_optimization.pdf.
Heavy ball momentum
$$ v_{k+1} \leftarrow \gamma v_k + (1-\gamma) g_k $$ $$ x_{k+1} \leftarrow x_k - \eta v_{k+1} $$
AdaGrad
$$ G_{k+1} \leftarrow G_k + g_k \cdot g_k $$ $$ x_{k+1} \leftarrow x_k - \frac{\eta}{\sqrt{G_{k+1} + \epsilon}} \cdot g_k $$
Rmsprop
$$ v_{k+1} \leftarrow \beta v_k + (1-\beta) g_k \cdot g_k $$ $$ x_{k+1} \leftarrow x_k - \frac{\eta}{\sqrt{v_{k+1} + \epsilon}} \cdot g_k $$
Let's demonstrate SGD for training a simple MLP architecture for performing hand-written digit recognition.
# Imports
import torch
from torchvision import datasets, transforms
import torch.nn as nn
import matplotlib.pyplot as plt
# Define an MLP architecture
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.in_dim = 784
self.hidden_dim = 120
self.out_dim = 10
self.flatten = nn.Flatten() # (B,H,W) -> (B,D)
self.linear = nn.Linear(self.in_dim, self.hidden_dim)
self.activation = nn.ReLU()
self.classifier = nn.Linear(self.hidden_dim, self.out_dim)
def forward(self, x):
x = self.flatten(x)
x = self.linear(x)
x = self.activation(x)
x = self.classifier(x)
return x
model = Net() # Instantiate model
# Define the training dataset
transform = transforms.Compose([
transforms.ToTensor(), # Convert to tensor
transforms.Normalize((0.1307,), (0.3081,)) # Subtract from values 0.13 then divide by 0.31
])
dataset = datasets.MNIST('./data', train=True, download=True, transform=transform) # MNIST train set
# Define dataloader
batch_size = 64
loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True) # Different order in each epoch
On each batch, optimization can be summarized as follows:
# Actual Training loop
num_epochs = 1
lr = 1e-1
loss_fn = nn.CrossEntropyLoss()
losses = [] # For plotting
model.train() # Training mode
for epoch in range(num_epochs):
for batch_idx, (x, y) in enumerate(loader):
# 1. Compute loss
logits = model(x)
loss = loss_fn(logits, y)
# 2. Magically compute gradient
grad = torch.autograd.grad(loss, model.parameters())
# 3. Perform optimization step
for param, g in zip(model.parameters(), grad):
param.grad = g
param.data -= lr * param.grad
losses.append(loss.item())
Lets plot the loss over time!
plt.plot(losses)
[<matplotlib.lines.Line2D at 0x28fb5e760>]
Let's see what happens when we decrease the batch size!
# This time, let's try with a smaller batch size!
model = Net() # re-initialize net
# re-define dataloader
batch_size = 16
loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Actual Training loop
num_epochs = 1
lr = 1e-1
loss_fn = nn.CrossEntropyLoss()
losses = [] # For plotting
model.train() # Training mode
for epoch in range(num_epochs):
for batch_idx, (x, y) in enumerate(loader):
# 1. Compute loss
logits = model(x)
loss = loss_fn(logits, y)
# 2. Magically compute gradient
grad = torch.autograd.grad(loss, model.parameters())
# 3. Perform optimization step
for param, g in zip(model.parameters(), grad):
param.grad = g
param.data -= lr * param.grad
losses.append(loss.item())
plt.plot(losses)
[<matplotlib.lines.Line2D at 0x28fc210a0>]
As we can observe, smaller batch yields to a more noisy optimization process. This is due to high gradient variance!
from torch.optim import SGD
model = Net() # re-initialize net
batch_size = 64
loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True) # re-define dataloader
# define the optimizer
optimizer = SGD(model.parameters(), lr=lr)
model.train()
for epoch in range(num_epochs):
for batch_idx, (x, y) in enumerate(loader):
# compute loss
logits = model(x)
loss = loss_fn(logits, y)
# The three Musketeers!
optimizer.zero_grad() # sets p.grad = 0 for all params
loss.backward() # sets p.grad += dloss/dp
optimizer.step() # performs actual optimization step
from torch.optim.lr_scheduler import MultiStepLR
model = Net()
num_epochs = 2
# define the optimizer and the scheduler
optimizer = SGD(model.parameters(), lr=lr)
scheduler = MultiStepLR(optimizer, milestones=[1], gamma=0.1) # reduce lr by 0.1 after 1 epoch
model.train()
for epoch in range(num_epochs):
for batch_idx, (x, y) in enumerate(loader):
# compute loss
logits = model(x)
loss = loss_fn(logits, y)
# The three Musketeers!
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (batch_idx + 1) % 300 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}] | Batch {batch_idx+1} | \
loss: {loss.item():.4f} | lr: {optimizer.param_groups[0]["lr"]:.4f}')
losses.append(loss.item())
# Inform the scheduler an epoch was done!
scheduler.step()
Epoch [1/2] | Batch 300 | loss: 0.4164 | lr: 0.1000 Epoch [1/2] | Batch 600 | loss: 0.1513 | lr: 0.1000 Epoch [1/2] | Batch 900 | loss: 0.1877 | lr: 0.1000 Epoch [2/2] | Batch 300 | loss: 0.1112 | lr: 0.0100 Epoch [2/2] | Batch 600 | loss: 0.1343 | lr: 0.0100 Epoch [2/2] | Batch 900 | loss: 0.1085 | lr: 0.0100
Additional learning rate scheduling strategies include:
Common projections:
$$ \Pi_{\mathcal{B}(R)}(x) = \min\{\frac{R}{\| x \|}, 1\} \cdot x $$
$$ \Pi_{W}(x) = \sum_{i=1}^m \langle x , \; w_i \rangle w_i $$ where $\{ w_1, ..., w_m\}$ is an orthonormal basis for $W$.
# Define the test dataset
transform = transforms.Compose([
transforms.ToTensor(), # Convert to tensor
transforms.Normalize((0.1307,), (0.3081,)) # Subtract 0.13 then divide by 0.31
])
# MNIST test set
dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
sample_loader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False)
sample, true_y = next(iter(sample_loader))
sample = sample.detach()
# Visualize the sample
with torch.no_grad():
logit = model(sample)[0]
proba = torch.softmax(logit, dim=0)
pred = torch.argmax(proba)
fig = plt.figure()
plt.imshow(sample.reshape(28,28), cmap='gray', interpolation='none')
plt.title("Ground Truth: {}\nPrediction: {}, confidence: {:.2f}%". \
format(true_y.item(), pred, proba[pred]*100))
Text(0.5, 1.0, 'Ground Truth: 7\nPrediction: 7, confidence: 99.89%')
attacked_sample = sample.clone()
attacked_sample.requires_grad = True
# maximize loss instead of minimizing it!
adversarial_optimizer = SGD([attacked_sample], lr=1e-1)#, maximize=True)
eps = 7
n_iters = 10_000
loss_fn = nn.CrossEntropyLoss()
for iter_idx in range(n_iters):
logits = model(attacked_sample)
loss = -1. * loss_fn(logits, true_y)
# Gradient step
adversarial_optimizer.zero_grad()
loss.backward()
adversarial_optimizer.step()
# Projection!
delta = attacked_sample.data - sample.data
delta *= min(1,eps/torch.norm(delta))
attacked_sample.data = sample + delta
if (iter_idx + 1) % 1000 == 0:
print(f'Iteration [{iter_idx+1}] | loss: {-1*loss.item():.4f}')
Iteration [1000] | loss: 0.0016 Iteration [2000] | loss: 0.0030 Iteration [3000] | loss: 0.3741 Iteration [4000] | loss: 7.4901 Iteration [5000] | loss: 7.4889 Iteration [6000] | loss: 7.4892 Iteration [7000] | loss: 7.4885 Iteration [8000] | loss: 7.4891 Iteration [9000] | loss: 7.4895 Iteration [10000] | loss: 7.4891
# Visualize the attacked sample
with torch.no_grad():
logit = model(attacked_sample)[0]
proba = torch.softmax(logit, dim=0)
pred = torch.argmax(proba)
fig = plt.figure()
plt.imshow(attacked_sample.detach().numpy().reshape(28,28), cmap='gray', interpolation='none')
plt.title("Ground Truth: {}\nPrediction: {}, confidence: {:.2f}%" \
.format(true_y.item(), pred, proba[pred]*100))
Text(0.5, 1.0, 'Ground Truth: 7\nPrediction: 3, confidence: 99.81%')