l1-nn-1

Notes for l1-legularized Neural Networks are Improperly Learnable in Polynomial Time

Problem Setting

Model:

For $|| x ||_2 \le 1$, the class of functions we consider is

$\mathcal{N}_{k,L,\sigma} = \left\{ f(x) = \sigma ( \cdots (\sigma(x\mathbf{W^{(1)}}) \cdots \mathbf{W^{(k)}} ) , \left\lVert \mathbf{W}^{(0)}_i \right\rVert_2 \le L, \left\lVert \mathbf{W}^{(l)}_i \right\rVert_1 \le L \text{ for } 1 \le l \le k \right\}$

where $\mathbf{W}_i$ represents its $i$th row vector.

Assumption (in words):

  • The input vector $x$ has bounded $\ell_2$ norm: $\left\lVert x \right\rVert \le 1$
  • The edge weights in the first layer have bounded $\ell_2$ norms.
    ($\rightarrow$ each row of $\mathbf{W^{(1)}}$ is bounded)

  • The edge weights in the other layers have bounded $\ell_1$ norms.
    ($\rightarrow$ The $\ell_1$-regularization induces sparsity on the neural network.)

Goal:

Learn a predictor $f: \mathcal{X} \rightarrow \mathbb{R}$ whose generalization loss is at most $\epsilon$ worse than that of the best neural network in $\mathcal{N}_{k,L,\sigma}$

What’s shown in the paper:

Claim:

Theorem 1 implies that the neural network activated by $\sigma_{erf}$ or $\sigma_{sh}$ is learnable in polynomial time given any constant $(k, L)$.

Proof Steps:

  • First show that the RKHS $F_{k,B}$ (induced by the kernel $K^k(x,y)$ which I will describe shortly) contains $\mathcal{N_{k,L,\sigma}}$ by arguing that any input to a neuron (which is a function of $x \in \mathcal{X}$) is an element of RKHS $F_{k,B}$.

  • Then show that if we use $\sigma_{erf}$ or $\sigma_{sh}$, the norm of an input of an neuron in any layer will be bounded (which is equivalent to saying that $||f|| \le B$ for all $f \in F_{k,B}$.) (This part corresponds to Proposition 1 in the paper.)

  • Invoke Theorem 2.2 from (Shalev-Shwartz, 2011) to establish the claim.

To be more precise in Step 2, they first construct this quantity $H(\lambda)$, which gives the upper bound on the norm of an input of any neuron. (This is shown in the proof for Lemma 1 in the Appendix A.) They then show that $H(\lambda)$ is bounded if we use $\sigma_{erf}$ or $\sigma_{sh}$.

Kernel function

The kernel function they used is the following:
$$K(x,y) := \frac{1}{2 - <x,y>}$$

The validity of this kernel is proven by explicitly constructing a mapping $\psi(x)$ s.t. $k(x,y) = <\psi(x), \psi(y)>$.
That is, they consider $\psi(x)$ as an infinite dimensional vector, indexed by $i$, where $i = (k_1, … , k_j)$ for all $(k_1,...,k_j) \in \{1,...,n\}$ and $j=1,..,\infty$. (So if we treat $\psi(x)$ as a $M-$dimensional vector, there exsits $\sum_{j=0}^M n^j$ number of entries in $\psi(x)$. Of course, this is just to grasp the picture of this vector. In reality, $M$ is infinity.)

Recursive Kernel Method