Similarity between Backpropagation and Dynamic Programming
I recently read the lecture notes of Stanford CS294A to learn about autoencoder, but I thought the derivation of partial derivatives for backpropagation algorithm in the lecture notes is a bit unfriendly, so I will try to fill a gap.
On page 7, he describes one iteration of gradient descent updates as follows:
$$W_{ij}^{(l)} :=
W_{ij}^{(l)} - \alpha \frac{\partial}{\partial W_{ij}^{(l)}}J\left(W,b\right)$$
We want to find the partial derivative. On the next page, he explains how the backpropagation algorithm can find partial derivatives, but he just set the error term $\delta$ out of nowhere, and it is hard to see where it actually comes from.
We’ll try to see how the error term delta appears from the gradient update equation.
So, the partial derivative in the update equation can be written as follows by chain rule:
$$\frac {\partial J\left(W,b\right)} {\partial W_{ij}^{(l)}} = \frac {\partial J\left(W,b\right)} {\partial z_{j}^{(l+1)}} \frac {\partial z_{j}^{(l+1)}} {\partial W_{ij}^{(l)}}$$From equation (6) on page 5 $z^{(l+1)} = W^{(l)}a^{(l)} +b^{(l)}$, we can see that the partial derivative Z/W is equal to a. So,
$$\frac {\partial J\left(W,b\right)} {\partial W_{ij}^{(l)}} = \frac {\partial J\left(W,b\right)} {\partial z_{j}^{(l+1)}} a_j^{(l)}$$Now, we introduce the delta term by letting
$$\delta_i^{(l)}
= \frac {\partial J\left(W,b\right)} {\partial z_{i}^{(l)}}
=\frac {\partial J\left(W,b\right)} {\partial a_i^{(l)}} \frac {\partial a_i^{(l)}} {\partial z_{i}^{(l)}}
= \frac {\partial J\left(W,b\right)} {\partial a_i^{(l)}} f'(z_{i}^{(l)})$$
We look at the partial derivative we get above: $\frac {\partial J\left(W,b\right)} {\partial a_i^{(l)}}$. We can expand this partial derivative by chain rule as follows:
When l = $n_l$:
$$\frac {\partial J\left(W,b\right)} {\partial a_i^{(l)}} = \frac {\partial} {\partial a_{i}^{n_l}}{\frac {1} {2} \left\| y-h_{W,b}(x)\right\|^2} = -(y_i-a_i^{n_l})$$Note that
$$a^{n_l} = h_{W,b}(x) \\
J(W,b) ={\frac {1} {2} \left\| y-h_{W,b}(x)\right\|^2}$$
as defined on page 6 in the lecture notes.
When l $\neq n_l$:
$$\frac {\partial J\left(W,b\right)} {\partial a_{i}^{(l)}} = \frac {\partial J\left(W,b\right)} {\partial z_{1}^{(l+1)}} \frac {\partial z_{1}^{(l+1)}} {\partial a_i^{(l)}} + \frac {\partial J\left(W,b\right)} {\partial z_{2}^{(l+1)}} \frac {\partial z_{2}^{(l+1)}} {\partial a_i^{(l)}} + ... + \frac {\partial J\left(W,b\right)} {\partial z_{s_{l+1}}^{(l+1)}} \frac {\partial z_{s_{l+1}}^{(l+1)}} {\partial a_i^{(l)}}
= \sum _{j=1}^{s_{l+1}}W_{ji}^{(l)}\delta_i^{(l+1)}$$
Note that from the equation (6) in the lecture notes and as we define earlier,
$$\frac {\partial J\left(W,b\right)} {\partial z_{j}^{(l+1)}} = W_{ji}^{(l)} \\
\delta_i^{(l+1)} = \frac {\partial J\left(W,b\right)} {\partial z_{i}^{(l+1)}}$$
The idea behind the derivation for $l < n_l$ (or actually, the whole algorithm of backpropagation) is sort of similar to dynamic programming. We want to know the partial derivative $\frac {\partial J\left(W,b\right)} {\partial a_{i}^{(l)}}$. That is, we want to know how a small change in $a_i^{(l)}$ might affect the overall cost function $J(W,b)$. In order to find that, we look for the consequences in the layer right after the layer l, which is (l+1). These values in the layer l+1 have to be computed prior to the current node in the layer l, which is $a_i^{(l)}$. Indeed, these values are already computed when we compute the value for $a_i^{(l)}$ because we start with the last layer, and move backward. This is exactly the idea behind dynamic programming.
Finally, from above, we can get to the expressions as the lecture notes have on page 8 in the backpropagation algorithm:
$$\delta_i^{(n_l)} = -(y_i-a_i^{n_l})f'(z_i^{(n_l)})
\delta_i^{(l)} = \left( \sum _{j=1}^{s_{l+1}}W_{ji}^{(l)}\delta_j^{(l+1)}\right)f'(z_i^{(l)})
\frac {\partial } {\partial W_{ij}^{(l)}} J(W,b) = a_j^{(l)}\delta_i^{(l+1)}$$
Overall, the reason why the partial derivatives can be represented using the delta term is a bit hard to follow is that the logical flow of the derivation is flipped. The way the lecture notes set the delta term seems magical because it doesn’t explain why we would set the term as it is. The answer is that we just set it because that way the equations (in the derivation) look much simpler (and indeed, by setting the delta term, we can visualize filling up the 2D table, where each entry is a corresponding delta term as we do in dynamic programming.) If the delta term were introduced just for the convenience to make the equations look simpler in the sequence of computation from the update equation, it would have been much easier to follow.