In [ ]:

class Linear(Node):
    def __init__(self, x, w, b):
        Node.__init__(self, [x, w, b])

    def forward(self):
        self.cache[0] = self.inbound_nodes[0].value
        self.cache[1] = self.inbound_nodes[1].value
        self.cache[2] = self.inbound_nodes[2].value
        self.value = np.dot(self.cache[0], self.cache[1]) + self.cache[2]

    def backward(self):
        self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
        for n in self.outbound_nodes:
            grad = n.gradients[self]
            self.gradients[self.inbound_nodes[0]] += np.dot(grad, self.cache[1].T)
            self.gradients[self.inbound_nodes[1]] += np.dot(self.cache[0].T, grad)
            self.gradients[self.inbound_nodes[2]] += np.sum(grad, axis=0, keepdims=False)

These are the shapes:

$X$ is $m$ by $n$
$Z$ is $n$ by $p$
$b$ has $p$ elements (row vector)
$Z$ is $m$ by $p$

Forward Pass¶

For the forward pass we compute

$$ f(X, W, b) = XW + b $$

and store the values of $X, W, b$ in the cache.

Backward Pass¶

Note the different initialization here, np.zeros_like creates an array the same shape as the input filled with 0s.

self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}

For the backward pass it's very helpful to write out $f$ explicitly with the elements of the matrices and vector.

$$ Z_{ij} = \sum_{k=1}^n X_{ik}W_{kj} + b_j $$

This is a dot product between the ith row of $X$ and the jth column of $W$ followed by an additon of the jth element of $b$.

Let's find the partial for $X_{ik}$. Remember, we treat everything other than $X_{ik}$ as a constant.

$$ \frac {\partial Z_{ij}} {\partial X_{ik}} = X_{ik}W_{kj} + b_j = X_{ik}W_{kj} = W_{kj} $$

If we take a closer look we'll see that $X_{ik}$ is used to compute not only $Z_{ij}$, but all $Z_{ij}$ for $j = 1,2,...,p$. Thus, $\partial X_{ik}$ is the sum of all $W_{kj}$

$$ \frac {\partial Z} {\partial X_{ik}} = \sum_{j=1}^p W_{kj} $$

All that's left is to incorporate the values propagated backwards through the chain rule.

$$ \frac {\partial L} {\partial X_{ik}} = \sum_{j=1}^p ( W_{kj} \frac {\partial L} {\partial Z_{ij}} ) $$

In order to express this in code we need to transpose $W$ so it's $p$ by $n$.

$$ \frac {\partial L} {\partial X} = \frac {\partial L} {\partial Z} W^T $$

Similar reasoning leads to:

$$ \frac {\partial L} {\partial W} = X^T \frac {\partial L} {\partial Z} $$

Ok, now let's find the partial for $b_j$.

$$ \frac {\partial Z_{ij}} {\partial b_j} = X_{ik}W_{kj} + b_j = b_j = 1 $$

Once again note we use the same $b_j$ in every row $i$ of $Z$.

$$ \frac {\partial Z} {\partial b_j} = \sum_{i=1}^m 1 = m $$

Incorporating the chain rule

$$ \frac {\partial L} {\partial b_{j}} = \sum_{i=1}^m (1 * \frac {\partial L} {\partial Z_{ij}}) $$

Linear Solution¶

Forward Pass¶

Backward Pass¶