l1-nn-1
Notes for l1-legularized Neural Networks are Improperly Learnable in Polynomial Time
Problem Setting
Model:
For $|| x ||_2 \le 1$, the class of functions we consider is
$\mathcal{N}_{k,L,\sigma} = \left\{ f(x) = \sigma ( \cdots (\sigma(x\mathbf{W^{(1)}}) \cdots \mathbf{W^{(k)}} ) , \left\lVert \mathbf{W}^{(0)}_i \right\rVert_2 \le L, \left\lVert \mathbf{W}^{(l)}_i \right\rVert_1 \le L \text{ for } 1 \le l \le k \right\}$where $\mathbf{W}_i$ represents its $i$th row vector.
Assumption (in words):
- The input vector $x$ has bounded $\ell_2$ norm: $\left\lVert x \right\rVert \le 1$
The edge weights in the first layer have bounded $\ell_2$ norms.
($\rightarrow$ each row of $\mathbf{W^{(1)}}$ is bounded)The edge weights in the other layers have bounded $\ell_1$ norms.
($\rightarrow$ The $\ell_1$-regularization induces sparsity on the neural network.)
Goal:
Learn a predictor $f: \mathcal{X} \rightarrow \mathbb{R}$ whose generalization loss is at most $\epsilon$ worse than that of the best neural network in $\mathcal{N}_{k,L,\sigma}$
What’s shown in the paper:
Claim:
Theorem 1 implies that the neural network activated by $\sigma_{erf}$ or $\sigma_{sh}$ is learnable in polynomial time given any constant $(k, L)$.
Proof Steps:
First show that the RKHS $F_{k,B}$ (induced by the kernel $K^k(x,y)$ which I will describe shortly) contains $\mathcal{N_{k,L,\sigma}}$ by arguing that any input to a neuron (which is a function of $x \in \mathcal{X}$) is an element of RKHS $F_{k,B}$.
Then show that if we use $\sigma_{erf}$ or $\sigma_{sh}$, the norm of an input of an neuron in any layer will be bounded (which is equivalent to saying that $||f|| \le B$ for all $f \in F_{k,B}$.) (This part corresponds to Proposition 1 in the paper.)
Invoke Theorem 2.2 from (Shalev-Shwartz, 2011) to establish the claim.
To be more precise in Step 2, they first construct this quantity $H(\lambda)$, which gives the upper bound on the norm of an input of any neuron. (This is shown in the proof for Lemma 1 in the Appendix A.) They then show that $H(\lambda)$ is bounded if we use $\sigma_{erf}$ or $\sigma_{sh}$.
Kernel function
The kernel function they used is the following:
$$K(x,y) := \frac{1}{2 - <x,y>}$$
The validity of this kernel is proven by explicitly constructing a mapping $\psi(x)$ s.t. $k(x,y) = <\psi(x), \psi(y)>$.
That is, they consider $\psi(x)$ as an infinite dimensional vector, indexed by $i$, where $i = (k_1, … , k_j)$ for all $(k_1,...,k_j) \in \{1,...,n\}$ and $j=1,..,\infty$. (So if we treat $\psi(x)$ as a $M-$dimensional vector, there exsits $\sum_{j=0}^M n^j$ number of entries in $\psi(x)$. Of course, this is just to grasp the picture of this vector. In reality, $M$ is infinity.)