class Linear(Node):
def __init__(self, x, w, b):
Node.__init__(self, [x, w, b])
def forward(self):
self.cache[0] = self.inbound_nodes[0].value
self.cache[1] = self.inbound_nodes[1].value
self.cache[2] = self.inbound_nodes[2].value
self.value = np.dot(self.cache[0], self.cache[1]) + self.cache[2]
def backward(self):
self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
for n in self.outbound_nodes:
grad = n.gradients[self]
self.gradients[self.inbound_nodes[0]] += np.dot(grad, self.cache[1].T)
self.gradients[self.inbound_nodes[1]] += np.dot(self.cache[0].T, grad)
self.gradients[self.inbound_nodes[2]] += np.sum(grad, axis=0, keepdims=False)
These are the shapes:
For the forward pass we compute
$$ f(X, W, b) = XW + b $$and store the values of $X, W, b$ in the cache.
Note the different initialization here, np.zeros_like
creates an array the same shape as the input filled with 0s.
self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
For the backward pass it's very helpful to write out $f$ explicitly with the elements of the matrices and vector.
$$ Z_{ij} = \sum_{k=1}^n X_{ik}W_{kj} + b_j $$This is a dot product between the ith row of $X$ and the jth column of $W$ followed by an additon of the jth element of $b$.
Let's find the partial for $X_{ik}$. Remember, we treat everything other than $X_{ik}$ as a constant.
$$ \frac {\partial Z_{ij}} {\partial X_{ik}} = X_{ik}W_{kj} + b_j = X_{ik}W_{kj} = W_{kj} $$If we take a closer look we'll see that $X_{ik}$ is used to compute not only $Z_{ij}$, but all $Z_{ij}$ for $j = 1,2,...,p$. Thus, $\partial X_{ik}$ is the sum of all $W_{kj}$
$$ \frac {\partial Z} {\partial X_{ik}} = \sum_{j=1}^p W_{kj} $$All that's left is to incorporate the values propagated backwards through the chain rule.
$$ \frac {\partial L} {\partial X_{ik}} = \sum_{j=1}^p ( W_{kj} \frac {\partial L} {\partial Z_{ij}} ) $$In order to express this in code we need to transpose $W$ so it's $p$ by $n$.
$$ \frac {\partial L} {\partial X} = \frac {\partial L} {\partial Z} W^T $$Similar reasoning leads to:
$$ \frac {\partial L} {\partial W} = X^T \frac {\partial L} {\partial Z} $$Ok, now let's find the partial for $b_j$.
$$ \frac {\partial Z_{ij}} {\partial b_j} = X_{ik}W_{kj} + b_j = b_j = 1 $$Once again note we use the same $b_j$ in every row $i$ of $Z$.
$$ \frac {\partial Z} {\partial b_j} = \sum_{i=1}^m 1 = m $$Incorporating the chain rule
$$ \frac {\partial L} {\partial b_{j}} = \sum_{i=1}^m (1 * \frac {\partial L} {\partial Z_{ij}}) $$