%matplotlib inline
from fastai.basics import *
In this part of the lecture we will explain Stochastic Gradient Descent (SGD) which is an optimization method commonly used in neural networks We will ilustrate the concepts with concrete examples.
在这部分,我们将会解释随机梯度下降算法(SGD),它在神经网络应用中是常用的优化算法。我们将通过实例来解释其原理和概念。
The goal of linear regression is to fit a line to a set of points.
线性回归的目标是将一条直线拟合到一组点。
n=100
x = torch.ones(n,2)
x[:,0].uniform_(-1.,1)
x[:5]
tensor([[-0.1957, 1.0000], [ 0.1826, 1.0000], [-0.1008, 1.0000], [-0.1449, 1.0000], [ 0.7091, 1.0000]])
a = tensor(3.,2); a
tensor([3., 2.])
y = x@a + torch.rand(n)
plt.scatter(x[:,0], y);
You want to find parameters (weights) a
such that you minimize the error between the points and the line x@a
. Note that here a
is unknown. For a regression problem the most common error function or loss function is the mean squared error.
你希望找到这样的 参数(权重) a
,使得数据点和直线x@a
之间的 误差 尽可能小。需要注意的是这里a
是未知的。对于回归问题最常用的 误差函数 或者说 损失函数 是 均方误差(MSE) 。
def mse(y_hat, y): return ((y_hat-y)**2).mean()
Suppose we believe a = (-1.0,1.0)
then we can compute y_hat
which is our prediction and then compute our error.
假设我们取a = (-1.0,1.0)
,那么我们就可以计算 预测值 y_hat
,随后我们可以算出误差来。
a = tensor(-1.,1)
y_hat = x@a
mse(y_hat, y)
tensor(7.9356)
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],y_hat);
So far we have specified the model (linear regression) and the evaluation criteria (or loss function). Now we need to handle optimization; that is, how do we find the best values for a
? How do we find the best fitting linear regression.
到现在我们已经指定了 模型 的类型(线性回归),以及 评估标准 (或者说 损失函数 ),接下来我们需要处理 优化 过程;即,我们如何才能找到最优的a
呢?我们如何才能找到 拟合 最好的线性回归模型呢?
We would like to find the values of a
that minimize mse_loss
.
我们希望找到最小化mse_loss
值的a
的值。
Gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient.
梯度下降 是一个用于优化函数的算法。给定一个由一组参数决定的函数,梯度下降从一组初始的参数值开始,不断向能够最小化函数值的参数值迭代。这个迭代式最小化的结果是,通过向函数梯度的负方向不断递进而得到的。
Here is gradient descent implemented in PyTorch.
这里是 PyTorch中梯度下降算法实施的细节。
a = nn.Parameter(a); a
Parameter containing: tensor([-1., 1.], requires_grad=True)
def update():
y_hat = x@a
loss = mse(y, y_hat)
if t % 10 == 0: print(loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
a.grad.zero_()
lr = 1e-1
for t in range(100): update()
tensor(7.9356, grad_fn=<MeanBackward1>) tensor(1.4609, grad_fn=<MeanBackward1>) tensor(0.4824, grad_fn=<MeanBackward1>) tensor(0.1995, grad_fn=<MeanBackward1>) tensor(0.1147, grad_fn=<MeanBackward1>) tensor(0.0893, grad_fn=<MeanBackward1>) tensor(0.0816, grad_fn=<MeanBackward1>) tensor(0.0793, grad_fn=<MeanBackward1>) tensor(0.0786, grad_fn=<MeanBackward1>) tensor(0.0784, grad_fn=<MeanBackward1>)
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],x@a);
from matplotlib import animation, rc
rc('animation', html='jshtml')
a = nn.Parameter(tensor(-1.,1))
fig = plt.figure()
plt.scatter(x[:,0], y, c='orange')
line, = plt.plot(x[:,0], x@a)
plt.close()
def animate(i):
update()
line.set_ydata(x@a)
return line,
animation.FuncAnimation(fig, animate, np.arange(0, 100), interval=20)
In practice, we don't calculate on the whole file at once, but we use mini-batches.
实际上,我们并没有立刻计算整个数据集,相反,我们采用 mini-batches(小批次) 的策略。
For classification problems, we use cross entropy loss, also known as negative log likelihood loss. This penalizes incorrect confident predictions, and correct unconfident predictions.
对于分类问题,我们使用 交叉熵损失 ,也被称为 负对数似然损失 。该损失函数将惩罚那些置信高的错误预测和置信低的正确预测。