Krylov Subspace

Introduction

Problem setting

  • linear solver : $ Ax = b $

  • eigenvalue problem: $ Ax = \lambda x$

Krylov subspace method is considered to be one of the hallmarks of modern numerical linear algebra. It is often effective when a matrix-vector product can be easily obtained (i.e. A is sparse, Ax can be accessed via one function call, etc.) There is also a nice convergence analysis.

Krylov subspace

$K_d := \text{span } \{ b, Ab, A^2b, ..., A^{d-1}b \}$

The high-level overview of the method consists of two components: 1. It projects the problem onto the Krylov subspace and solves the problem there. 2. Then projects the solution back to the original space.

Some characteristics:

  • When $d$ increases, numerical errors will decrease (like all other iterative methods)
  • For large enough $d$, we can achieve the exact solution (as if direct methods would do!)
  • To avoid numerical errors, we need preconditioning in practice.

Arnoldi process

Goal: obtain a set of orthonormal vectors $\{ q_1, ..., q_d \}$ so that $span \{ q_1, ..., q_d \} = span \{ b, Ab, ..., A^db \}$ via Gram-Schmidt type iterations using the recurrence : $AQ_d = Q_d H_d + r_d e_d^t$ (when $||r_d||$ gets small enough, we terminate the process), where $Q_d$ is a set of orthonormal vectors and $H_d$ is in a form of Hessenberg. For $A = A^T$, $H_d$ becomes tridiagonal (Lanczos process.)

Projection onto Krylov subspace

$H_d = Q_d^* A Q_d$

This matrix can be considered as the orthogonal projection of $A$ onto $K_d$ in the basis of ${ q_1, …, q_d }$. (Recall: similarity transformation = change of basis)

Because $H_d$ is in a Hessenberg form, which is computationally easier to work with, we might as well solve the problem of interest with respect to $H_d$ instead of $A$.

Example: (generalized) lesast squares

Instead of working directly with $A$, we will project $A$ onto $K_d$ and obtain $H_d$. Then, the problem becomes : $ \hat{x} := \min || H_d x - \hat{b} ||$. We can obatin the solution in the original space by: $x = V^* \hat{x}$.

Application to training Deep Neural Network?

In the next post, I’ll go over how Krylov subspace method can be applied to deep learning.