Neural Network Practice Problem: 5.25

I decided to do all the practice problems in Chapter 5 of Pattern Recognition and Machine Learning by Bishop. This chapter is about neural nets. Although the material per se is a little bit old, all the fundamentals of neural network are very well explained in here, so even today, when a bunch of new papers on deep neural network come out on arxiv almost everyday, it’s worth reading in my opinion.

So today I did 5.25. The problem reads:


Consider a quadratic error function of the form $ E = E_0 + (w - w^{\ast} )^T H (w - w^{\ast}) $, where $w^{\ast}$ represents the minimum, and the Hessian matrix $H$ is positive definite and constant. Suppose the initial weight vector $w^{(0)}$ is chosen to be at the origin and is updated using simple gradient descent
$$
w^{(\tau)} = w^{(\tau−1)} − \rho \nabla E
$$
where $\tau$ denotes the step number, and $\rho$ is the learning rate (which is assumed to be small). Show that, after $\tau$ steps, the components of the weight vector parallel to the eigenvectors of $H$ can be written
$$
w_j^{(\tau)} = (1 − (1 − \rho \lambda_j )^{\tau} ) w_j^{\ast}
$$
where $w_j = w^Tu_j$ , and $u_j$ and $\lambda_j$ are the eigenvectors and eigenvalues, respectively, of $H$. Show that as $\tau \rightarrow \infty$, this gives $w^{(\tau)} \rightarrow w^*$ as expected, provided $|1 − \rho \lambda_j | < 1$.


I’m assuming that since $H$ is a constant, it is evaluated at $w^{\ast}$, and $E_0 = E(w^{\ast})$.

First, since $H$ is a symmetric matrix, we can say that the eigenvalues are all real, and the eigenvectors are orthogonal to each other.

WLOG, we assume the $u_i$s are orthonormal. So the $u_i$s form the orthonormal basis, meaning we can express any vector with these basis vectors. This means that we can express $w - w^{\ast} = \sum_i^n v_i u_i$, where $v_i$s are appropriate coefficients. We can rewrite the equation using matrix form, which is $w - w^{\ast} = U v \iff v = U^T (w - w^{\ast})$, where $U$’s columns are $u_i$s. This expression gives us new perspective: we can view the weight vector $w$ in the original coordinate with a different coordinate, which is obtained by moving $w^{\ast}$ to the origin and rotate the original coordinate by the rotation matrix $U$.

Let’s plug this into the given error function E. We get
$$
E = E_0 + \frac{1}{2} (Uv)^T H (Uv) = E_0 +\frac{1}{2} v^T U^T H U v
$$

Note that $H$ is a symmetric, meaning it is diagonalizable. So we get
$$
E = E_0 + \frac{1}{2} v^T U^T UDU^T U v
$$

Since U is an orthonormal matrix, we have this identity $U^T = U^{-1}$ so everything cancels out, yielding $E = E_0 + \frac{1}{2} v^T D v$.

Note that $E$ is a function of $w$: $E(w)$. What if we look at it from the new coordinate? We see that $E_0(w^{\ast})$ should be 0 because in the new coordinate $w^{\ast}$ is the origin. Then, we have
$$
E(v) = \frac{1}{2} v^T D v = \frac{1}{2} \sum_i^n \lambda_i v_i^2
$$
(Note that D is a diagonal matrix with diagonal elements being eigenvalues.)

Then,
$$
\nabla E_v(v) = D v = \sum_i^n \lambda_i v_i
$$
So, the update equation becomes
$$
v^{(\tau)} = v^{(\tau-1)} - \rho D v^{(\tau-1)} = (I - \rho D) v^{(\tau - 1)}
$$

If we look at the last equation coordinate wise, we get $v_i^{(\tau)} = (1 - \rho \lambda_i) v_i^{(\tau - 1)} $, so by recursion, we get
$$
v_i^{(\tau)} = (1 - \rho \lambda_i)^{\tau} v_i^{(0)}
$$
Now, let’s go back to the original coordinate, and we get
$$
w_i^{(\tau)} - w_i^{\ast} = (1 - \rho \lambda_i)^{\tau} w_i^{(0)} - w_i^{\ast}
$$
Since w^{(0)} is the origin, the left term is 0 and moving $w_i^{\ast}$ to left, we get
$$
w_i^{(\tau)} = (1 - (1 - \rho \lambda_i)^{\tau})w_i^{\ast}
$$
, which is what we wanted to show.