l1-nn-1

Notes for l1-legularized Neural Networks are Improperly Learnable in Polynomial Time

Problem Setting

Model:

For $|| x ||_2 \le 1$, the class of functions we consider is

$\mathcal{N}_{k,L,\sigma} = \left\{ f(x) = \sigma ( \cdots (\sigma(x\mathbf{W^{(1)}}) \cdots \mathbf{W^{(k)}} ) , \left\lVert \mathbf{W}^{(0)}_i \right\rVert_2 \le L, \left\lVert \mathbf{W}^{(l)}_i \right\rVert_1 \le L \text{ for } 1 \le l \le k \right\}$

where $\mathbf{W}_i$ represents its $i$th row vector.

Assumption (in words):

  • The input vector $x$ has bounded $\ell_2$ norm: $\left\lVert x \right\rVert \le 1$
  • The edge weights in the first layer have bounded $\ell_2$ norms.
    ($\rightarrow$ each row of $\mathbf{W^{(1)}}$ is bounded)

  • The edge weights in the other layers have bounded $\ell_1$ norms.
    ($\rightarrow$ The $\ell_1$-regularization induces sparsity on the neural network.)

Goal:

Learn a predictor $f: \mathcal{X} \rightarrow \mathbb{R}$ whose generalization loss is at most $\epsilon$ worse than that of the best neural network in $\mathcal{N}_{k,L,\sigma}$

What’s shown in the paper:

Claim:

Theorem 1 implies that the neural network activated by $\sigma_{erf}$ or $\sigma_{sh}$ is learnable in polynomial time given any constant $(k, L)$.

Proof Steps:

  • First show that the RKHS $F_{k,B}$ (induced by the kernel $K^k(x,y)$ which I will describe shortly) contains $\mathcal{N_{k,L,\sigma}}$ by arguing that any input to a neuron (which is a function of $x \in \mathcal{X}$) is an element of RKHS $F_{k,B}$.

  • Then show that if we use $\sigma_{erf}$ or $\sigma_{sh}$, the norm of an input of an neuron in any layer will be bounded (which is equivalent to saying that $||f|| \le B$ for all $f \in F_{k,B}$.) (This part corresponds to Proposition 1 in the paper.)

  • Invoke Theorem 2.2 from (Shalev-Shwartz, 2011) to establish the claim.

To be more precise in Step 2, they first construct this quantity $H(\lambda)$, which gives the upper bound on the norm of an input of any neuron. (This is shown in the proof for Lemma 1 in the Appendix A.) They then show that $H(\lambda)$ is bounded if we use $\sigma_{erf}$ or $\sigma_{sh}$.

Kernel function

The kernel function they used is the following:
$$K(x,y) := \frac{1}{2 - <x,y>}$$

The validity of this kernel is proven by explicitly constructing a mapping $\psi(x)$ s.t. $k(x,y) = <\psi(x), \psi(y)>$.
That is, they consider $\psi(x)$ as an infinite dimensional vector, indexed by $i$, where $i = (k_1, … , k_j)$ for all $(k_1,...,k_j) \in \{1,...,n\}$ and $j=1,..,\infty$. (So if we treat $\psi(x)$ as a $M-$dimensional vector, there exsits $\sum_{j=0}^M n^j$ number of entries in $\psi(x)$. Of course, this is just to grasp the picture of this vector. In reality, $M$ is infinity.)

Recursive Kernel Method

model uncertainty implementation notes

  • Model uncertainty can be obtained from dropout NN models.

Our approximate predictive distirbution is
$$q(y^* | x^*) = \int p(y^*| x^*, w) q(w) dw$$
where $w = \{ W_i \}_{i=1}^L$ is the set of random variables for a model with $L$ layers.

Sample $T$ sets of Bernoulli distributed random vectors, which gives us $\{ W_1^t,...,W_L^t \}_{t=1}^T$. Then
$$E_{q(y^*|x^*)} (y^*) \approx \frac{1}{T} \sum_{t=1}^T \hat{y}^*(x^*, W_1^t,...,W_L^t)$$

This is equivalent to performing $T$ stochastic forward passes through the network and averaging the results.

As for variance,

This is equivelent to the sample variance of $T$ stochastic forward passes through the NN plus the inverse model precision.

Model precision can be found by the following identity:
$$\tau = \frac{p l^2}{2 N \lambda}$$

We can estimate our predictive log-likelihood by Monte Carlo integration of the predictive probability of $y$, which is an estimate of how well the model fits the mean and uncertainty.

For regression, this is given by
$$\log p(y^*|x^*, X, Y) \approx logsumexp \left ( - \frac{1}{2} \tau ||y-\hat{y}||^2$$

Procedure

Given point $x$:

  • Drop units at test time
  • Repeat $T$ times
  • Look at mean and sample variance

RKHS Intuition

At first RKHS looked scary, but after I realized that it’s just a space where every point in the space is a linear combination of (positive-definite) kernels, which allows us to replace the inner product calculation in this space with the kernel evaluation, I feel like things are much simpler now.

After all, the following two facts lead to the reproducing property (which can be verified by simple algebra):

  • the RKHS is spanned by kernels

  • the “regular” inner product definition

For example, take two functions $f(\cdot) = \sum_{i=1}^m \alpha_i k(\cdot, x_i)$ and $g(\cdot) = \sum_{j=1}^{m'} \beta_j k(\cdot, x'_j)$.
Define the inner product:
$<f, g>_{\mathcal{H}_k} = \sum_{i=1}^m \sum_{j=1}^{m'} \alpha_i \beta_j k(x_i, x'_j)$

Then,
$<f, k(\cdot, x'_k)> = \sum_{i=1}^m \sum_{j=1}^{m'} \alpha_i \beta_j k(x_i, x'_j) = \sum_{i=1}^m \alpha_i k(x_i,x'_k) = f(x'_k)$
where I set $\beta_j = 0$ when $j \neq k$ and $\beta_j = 1$ when $j=k$ to represent $k(\cdot, x_k)$ as a linear combinations of kernels in the Hilbert space.

Often times, the reproducing property is introduced to define RKHS (because it’s simpler?), but at least to me, viewing the reproducing property as a natural consequence from these two facts above was easier to grasp the concept of RKHS than thinking it as a condition required to construct RKHS, although this is more like the chicken and egg thing.

On Cross Validation

Setting

  • We have models $\mathcal{M}_1, …, \mathcal{M}_k$.
  • There are $2n$ data points
  • Split the data randomly into two: $D = (Y_1, …, Y_n)$ and $T = (Y^_1,…,Y^_n)$.

Definition 1

A random variable $X$ with mean $\mu = E[X]$ is sub-Gaussian if there is a positive number $\sigma$ such that

$$E[e^{\lambda (X - \mu)}] \le e^{\frac{\sigma^2 \lambda^2}{2}}$$

Step 1:

  • Find MLE $\hat{\theta}_j$ using $D$.

(WIP)

Reference

AIC,BIC,Cross-Validation:

Concentration inequalities:

Supplementary Notes for Dropout as a Bayesian Approximation

(notes to myself)

Summary

The dropout objective minimises the KL divergence between an approximate distribution and the posterior of a deep Gaussian process (marginalized over its finite rank covariance function parameters).

Gaussian Processes

  • models distributions over functions.
  • offers a way to get uncertainty estimates over the function values, robustness to over-fitting, and principled ways of hyper-parameter tuning.

How to get uncertainty information from these deep learning models for free – without changing a thing

  • it’s long been known that these deep tools can be related to Gaussian processes
  • What are the uncertainty estimates that we can get from Gaussian proceses?
    – Variance of a predictive distribution.

  • So when we say, we can get uncertainty information from dropout network, it means that we can just calculate the variance of the predictive distribution induced by our dropout network.

How exactly do we calculate the variance of the predictive distribution?

    1. Define a prior length-scale $l$, which captures our belief over the function frequency. A short length-scale ll corresponds to high frequency data, and a long length-scale corresponds to low frequency data.
    1. Take the length-scale squared, and divide it by the weight decay. We then scale the result by half the dropout probability over the number of data points. That is,
$$\tau = \frac{l^2 p}{2 N \lambda}$$

where $p$ indicates the probability of the units not being dropped. (usually $p$ is the probability of units being dropped, so be careful!)

    1. This $\tau$ is the Gaussian Process precision. (to see why see the derivation part.)
    1. Next, simulate a network with a test point $x^$ with dropout on. Repeat this $T$ times with different units dropped, and collect the results ${ yt }{t=1}^T$. These are empirical samples from our approximate predictive posterior distribution.
    1. We can get an empirical estimator for the predictive mean of our approximate posterior as well as the predictive variance (our uncertainty) from these samples as follows:
$$E[ y^*] \approx \frac{1}{T} \sum_{t=1}^T \hat{y}^*_t(x^*) \\ Var[y^*] \approx \tau^{-1} I + \frac{1}{T} \sum_{t=1}^T \hat{y}^*_t (x^*)^T \hat{y}^*_t (x^*) - E[y^*]^T E[y^*]$$

Bayesian approach to function approximation

Given a training set $\{ X_i, Y_i \}_{i=1}^n$, we want to estimate a function $y = f(x)$ that is likely to describe the relationship between $X$ and $Y$.

Following the Bayesian approach, we can put some prior distribution over the space of functions $p(f)$.
We then look for the posterior distribution given data $(X,Y)$:

$$p(f|X,Y) \propto p(Y|X, f) p(f)$$

This distribution captures the most likely functions given $(X,Y)$.
A prediction can be done in the following way:

$$p(y^*|x^*, X,Y) = \int p(y^*|x^*,f) p(f|X,Y) df \\ \iff p(y^*|x^*, X,Y) = \int p(y^*|f) p(f|x,X,Y) df$$

Covariance function is like a kernel function: it takes two arguments, and returns a similarity of these two. Therefore, given some $n by p$ data matrix, this function induces an $n \times n$ covariance matrix.

How do these functions correspond to the Gaussian process?

Why can we think that the noise as approximate integration?

  • Because averaging forward passes through the network is equivalent to Monte Carlo integration over a Gaussian process posterior approximation.

Derivation

  • averaging forward passes:
$$p(y^* | x^*, X, Y) = \int p(y^*|x^*,\omega) p(\omega|X,Y) d\omega$$

where one forward pass calculation (given all the weights $\omega$ and the test input $x^*$) is equal to one sample draw from this distribution $p(y^*|x^*,\omega)$, where we define

$$p(y^*|x^*,\omega) = N(y^*; \sqrt{\frac{1}{K}} \sigma(x^* W_1 + b) W_2, \tau^{-1} I_n)$$

Gaussian Process

  • A Gaussian process is completely specified by its mean function and covariance function.

  • The predictions from a GP model take the form of a full predictive distribution

  • A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

  • The random variables represent the value of the function $f(x)$ at location $x$.

  • Often, GP is defined over time, where the index set of random variables is time, but this is not normally the case.

  • Usually, the index set $\mathcal{X}$ is the set of all possible inputs, i.e. $\mathbb{R}^D$.

Question: When you say “distribution over functions”, it means we put some probability mass on every possible function in the function space of interest. How do we put such a mass in what way?

-> We can see this from the fact that we can draw samples from the distribution of functions evaluated at any number of points; i.e. we choose a number of input points $X_*$ and write out the corresponding covariance matrix using pre-defined covariance function elementwise. Then we generate a random Gaussian vector with this covariance matrix:

$$f_* \sim N(0, K(X_*, X_*))$$

(WIP)

Reference:

Yarin Gal’s blog post about the paper

The paper :
“Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”

Textbook: Gaussian Processes for Machine Learning

Pathwise Estimator

Assumption

  • $z = t(\epsilon, v)$ for $\epsilon \sim s(\epsilon)$ implies $z \sim q(z; v)$.

Example

$$\epsilon \sim N(0,1) \\ z = t(\epsilon, v) \text{, where } t(\epsilon,v) = \epsilon v_1 + v_0 \\ \Rightarrow z \sim q(z; v) \text{, where } q(z; v) = N(v_0, v_1^2)$$

That is, the distribution of $Z$: $q(z)$ is parameterized by $v = [v_0, v_1^2] = [\mu, \sigma^2]$.

  • $\log p(x,z)$ and $\log q(z; v)$ are differentiable with respect to $z$.

Usage

Recall that EBLO is the following:

$$L(v) = E_{q(z;v)}[\log p(x,z) - \log q(z;v)]$$

If we define $g(z,v) = \log p(x,z) - \log q(z;v)$, then we can express $\nabla_v L(v)$ as:

$$\nabla_v L(v) = \nabla_v \int g(z,v) q(z; v) dz \\ = \int \nabla_v q(z; v) g(z,v) + q(z; v) \nabla_v g(z,v) dz \\ = \int q(z;v) \nabla_v \log q(z;v) g(z,v) + q(z;v) \nabla_v g(z,v) dz \\ = E_{q(z;v)} \left [ \nabla_v \log q(z;v) g(z,v) + \nabla_v g(z,v) \right ]$$

We can rewrite the above gradient using $z = t(\epsilon, v)$:

$$\nabla_v L(v) = \int [\nabla_v \log q(z;v) g(z,v) + \nabla_v g(z,v)] q(z;v) dz \\ = E_{s(\epsilon)} \left [ \nabla_v \log s(\epsilon) g(t(\epsilon, v),v) + \nabla_v g(t(\epsilon, v),v) \right ]$$

Since $\log s(\epsilon)$ doesn’t depend on $v$, the first term will disappear. Then,

$$\nabla_v L(v) = E_{s(\epsilon)} [\nabla_v g(t(\epsilon, v),v)] \\ = E_{s(\epsilon)} [ \nabla_v (\log p(x,z) - \log q(z;v)) ] \\ = E_{s(\epsilon)} [ \frac{1}{p(x,z)} \log \nabla_v p(x,z) - \frac{1}{q(z;v)} \log \nabla_v q(z;v)) ] \\ = E_{s(\epsilon)} [ \frac{1}{p(x,z)} \log \nabla_z p(x,z) \frac{\partial z}{\partial v} - \frac{1}{q(z;v)} \log \nabla_v q(z;v)) ] \\$$ $$\nabla_v L(v) = E_{s(\epsilon)} [\nabla_v g(t(\epsilon, v),v)] \\ = E_{s(\epsilon)} [ \nabla_v (\log p(x,z) - \log q(z;v)) ] \\ = E_{s(\epsilon)} [ \nabla_z (\log p(x,z) - \frac{1}{q(z;v)) \nabla_v t(\epsilon, v) - \nabla_v \log q(z; v) ] \\ = E_{s(\epsilon)} [ \frac{1}{p(x,z)} \log \nabla_z p(x,z) \frac{\partial z}{\partial v} - \frac{1}{q(z;v)} \log \nabla_v q(z;v)) ] \\$$

(WIP)

Reference

Variational Inferene NIPS Tutorial:

Blog post about MONTE CARLO GRADIENT ESTIMATORS

Nonparametric Regression 03

In this post, we will modify the blockwise estimation procedure we discussed last time, to reduce the constant factor appeared in our upper bound on the expected risk.

Alternative Blockwise Estimator

The issue is that the number of blocks could be very small compared to the number of observations. For example, when $n=1024$, the number of blocks $K = 10$.

Previously, we set each block size to be the power of 2. That is, $|B_k| = 2^k$.

To have a smaller blocksize, we can set $|B_k| = \lfloor (1+a)^k \rfloor$, where $0 < a < 1$.

In fact, if we set $a = \log_2 n$, we see that:

$$n = \sum_{k=1}^K (1 + \frac{1}{\log_2 n} )^k =$$

(WIP)

Nonparametric Regression 02

In the last post, we investigated a Gaussian Sequence model and showed minimax rate:

$$\inf_{\hat{\theta}} \sup_{\Theta} E || \hat{\theta} - \theta ||^2 \asymp M^{\frac{1}{2\alpha + 1}} n^{-\frac{2\alpha}{2\alpha+1}}$$

The crucial assumption to derive this bound is that we assumed $\alpha$ and $M$ are known, which defines a Sobolev ellipsoid, our parameter space.

In this post, we remove that assumption and attempt to achieve the same risk bound (up to some constant).

To recap, the followings are our model description and parameter space.

Model

$$Y_i = \theta_i + \frac{1}{\sqrt{n}} \sigma Z_i, \text{ where } Z_i \sim^{iid} N(0,1)$$

Parameter space

$$\Theta = \{ \theta : \sum_{i=1}^{\infty} i^{2 \alpha} \theta^2_i \le M \}$$

Adaptive Estimation

Recall that the naive estimation procedure in the previous post just uses an observation as our estimator for small $i$s and 0 for large $i$s.
It worked because we were able to optimize $I$ based on $\alpha$ and $M$.
Now since we don’t have access to $\alpha$ and $M$, we have to take an alternative estimation procedure.
Since the only available information we have is data, we want our procedure to reflect this information in a meaningful way.

To get a hint, we will look into how James-Stein estimator was conceived.

James-Stein Estimator

James-Stein estimator is a form of $\delta(X) = a^T X$ (linear procedure) where $a$ is a data-dependent parameter.
(That is, JS estimator is adaptive to the $|| \theta ||^2$.)

If we have $m$ observations such that

$$X_i = \theta_i + \sigma Z_i , Z_i \sim^{i.i.d} N(0,1)$$

Then JS estimator is

$$\hat{\theta}_{JS} = (1 - \frac{(m-2)\sigma^2}{\sum_{i=1}^m X_i^2} ) X$$

where $X$ represents a $m$ dimensional vector (so we put all the $m$ observations into one vector).

We say that JS-estimator is mimicking linear estimator because the difference of the two risks is bounded by some constant:

$$E || \hat{\theta}_{JS} - \theta ||^2 - \inf_c E || c X - \theta ||^2 \le \text{constant}$$

To see this, we will investigate each term and show the difference is bounded.
(assume $\sigma=1$ without loss of generality.)

The First Term

$$E || \hat{\theta}_{JS} - \theta ||^2 = E_{X|\theta} \sum_{i=1}^m (X_i - \frac{m-2}{||X||^2} - \theta_i)^2 \\ = E_{X|\theta} \sum_{i=1}^m ((X_i - \theta)^2 + \frac{(m-2)^2}{(||X||^2)^2} X_i^2 - 2(X_i - \theta_i) \frac{m-2}{||X||^2} X_i ) \\ = m + E_{X|\theta} (\frac{(m-2)^2}{||X||^2} ) - 2 \sum_{i=1}^m E_{X|\theta} (X_i - \theta) \frac{m-2}{||X||^2} X_i$$

To go further, we will need Stein’s identity (simplified version):

$$\text{For } Y \sim N(\theta, 1), \text{ we have } E_{Y|\theta} (Y - \theta) g(Y) = E_{Y|\theta} g'(Y)$$

under some regularity assumption for $g$.

In our case, $Y=X_i$ and $g(X_i) = \frac{m-2}{||X||^2} X_i$.

By the above identity, we have

$$\sum_{i=1}^m E_{X|\theta} (X_i - \theta) \frac{m-2}{||X||^2} X_i = \sum_{i=1}^m E_{X|\theta} \frac{\partial}{\partial X_i}(\frac{m-2}{||X||^2}X_i) \\ = \sum_{i=1}^m E_{X|\theta}( \frac{m-2}{||X||^2} + \frac{-(m-2) 2X_i}{(\sum_{i=1}^m X_i^2)^2}X_i ) \\ = E_{X|\theta} (\frac{m(m-2)}{||X||^2} - \frac{2 (m-2)}{||X||^2} ) \\ = E_{X|\theta} (\frac{m(m-2)^2}{||X||^2} ) \\$$

This gives us

$$E || \hat{\theta}_{JS} - \theta ||^2 = m + E_{X|\theta} (\frac{(m-2)^2}{||X||^2} ) - 2 \sum_{i=1}^m E_{X|\theta} (X_i - \theta) \frac{m-2}{||X||^2} X_i \\ = m - E_{X|\theta} \frac{(m-2)^2}{||X||^2}$$

Now to get an upper bound on this, we will apply Jensen’s inequality but before that we will use the following property of noncentral Chi-squared distribution to simplify (noncentral because each $X_i$ has a different mean $\theta_i$):

$$||X||^2 = \sum_{i=1}^m X_i^2 \sim \chi_{m+2N}, \text{ where } N \sim Poisson \left(\frac{||\theta||^2}{2} \right)$$

By conditioning on $N$, we have

$$m - E_{X|\theta} \frac{(m-2)^2}{||X||^2} = m - E_N \left[ E_{X|N} \frac{(m-2)^2}{||X||^2} | N \right] \\ = m - E_N \frac{(m-2)^2}{m+2N-2} \text{ (Note that } E\frac{1}{\chi^2_d} = \frac{1}{d-2} \text{)} \\ \le m - \frac{(m-2)^2}{m-2 + 2E_N[n]} \text{ Note (Jensen's inequality):} E \frac{1}{Y} \ge \frac{1}{EY} \\ = m - \frac{(m-2)^2}{m-2 + ||\theta||^2} \\ = 2 + m - 2 - \frac{(m-2)^2}{m-2 + ||\theta||^2} \\ = 2 + \frac{(m-2)||\theta||^2}{m-2 + ||\theta||^2} \\ \le 2 + \frac{m||\theta||^2}{m + ||\theta||^2}$$

When $\sigma \neq 1$, it’s going to be

$$E ||\hat{\theta} - \theta||^2 \le 2\sigma^2 + \frac{m \sigma^2 ||\theta||^2}{m\sigma^2 + ||\theta||^2}$$

The Second Term

We can see that
$$\inf_c E || c X - \theta ||^2 = \frac{m \sigma^2 ||\theta||^2}{||\theta||^2 + m \sigma^2}$$

by optimizing $c$ to get $inf$.

First, consider the bias-variance decomposition for squared loss:

$$E || cX - \theta ||^2 = (c-1)^2 ||\theta||^2 + c^2 m \sigma^2$$

By taking the derivative w.r.t. $c$ and setting it equal to zero, we have

$$2(c-1) ||\theta||^2 + 2cm\sigma^2 = 0 \\ c = \frac{||\theta||^2}{||\theta||^2 + m \sigma^2}$$

By plugging this value into $E||cX - \theta||^2$, we have

$$(c-1)^2 ||\theta||^2 + c^2 m \sigma^2 = (\frac{-m\sigma^2}{||\theta||^2 + m \sigma^2})^2 ||\theta||^2 + (\frac{||\theta||^2}{||\theta||^2 + m \sigma^2})^2 m \sigma^2 \\ = \frac{m \sigma^2 ||\theta||^2 }{||\theta||^2 + m \sigma^2 }$$

By combining the results from First Term and Second Term, we have

$$E || \hat{\theta}_{JS} - \theta ||^2 - \inf_c E || c X - \theta ||^2 \le 2 \sigma^2$$

which shows that the risk difference is bounded by $2 \sigma^2$, a constant that doesn’t depend on $c$.

Shrinkage Estimator

Now let’s go back to our original problem of constructing an estimation procedure that achieves

$$E || \hat{\theta} - \theta ||^2 \asymp M^\frac{1}{1+2 \alpha} n^{- \frac{2 \alpha}{2 \alpha+1}}$$

By leveraging the idea of James-Stein estimator, we can think of a form of shrinkage estimator for this problem as well.
However, simply using $\hat{\theta} = \hat{\theta}_{JS}$ doesn’t work, because JS implicitely assumes that $\theta_i$ ($i=1,…,m$) are all in the same magnitude, which means that when some $\theta_i$ is big and others are small, the performance will be really bad.
This makes sense because we equally shrink $X$ by the same factor: $(1 - \frac{c}{||X||^2})$

The natural extention of the JS procedure is to divide the observation into many blocks and apply a different shrinkcage factor for each block, according to its magnitude. (WLOG, we can assume our observations are sorted in a decreasing order.)

Risk Calculation

The modified JS-like estimation procedure leads us to the following risk:

$$E || \hat{\theta} - \theta ||^2 = E \sum_{i=1}^n (\hat{\theta}_i - \theta_i)^2 + \sum_{i=n+1}^{\infty} \theta_i^2 \\ \le \sum_{k=1}^K \left[ \frac{|B_k| \frac{1}{n} ||\theta_{B_k}||^2}{|B_k|\frac{1}{n}+||\theta_{B_k}||^2} + 2 \frac{1}{n} \right] + O(n^{-\frac{2\alpha}{2\alpha+1}})$$

where $K$ is the number of blocks, $B_k$ represents the $k$ th block, $|B_k|$ is the size of that block.
We can naively set the number of elements in each block to be the power of 2, which allows us to write $K = \log_2 n$, where $n$ is the number of observations in total.

The correspondence from the James-Stein estimator risk upper bound is:

$$\sigma^2 = \frac{1}{n}, ||\theta||^2 = ||\theta_{B_k}||^2, m = |B_k|$$

Recall that we introduced the threshold index $I$ by considering bias-variance trade-off.
This time, we will try to find $K_0$ (the number of blocks) such that $I \le |B1| + \cdots |B{K_0}| \le 2I$ holds.

Now observe that for any positive real number $a$ and $b$, we have $\frac{ab}{a+b} \le a$ and $\frac{ab}{a+b} \le b$.
Therefore,

$$\sum_{k=1}^K \left[ \frac{|B_k| \frac{1}{n} ||\theta_{B_k}||^2}{|B_k|\frac{1}{n}+||\theta_{B_k}||^2} + 2 \frac{1}{n} \right] + O(n^{-\frac{2\alpha}{2\alpha+1}}) \\ \le \sum_{k=1}^{K_0} |B_k|\frac{1}{n} + \sum_{k=K_0+1}^K ||\theta_{B_k}||^2 + 2 \frac{1}{n} K + O(n^{-\frac{2 \alpha}{2\alpha+1}}) \\ \le \frac{1}{n} 2I + \sum_{i > I}^{\infty} \theta_i^2 + + O(n^{-\frac{2 \alpha}{2\alpha+1}}) \\$$

From the calculation in the last post, we see that this will be upper bounded by

$$\le c M^{\frac{1}{1+2\alpha}} n^{-\frac{2\alpha}{2\alpha+1}}$$

One criticism toward this procedure is that the constant $c$ could be huge.
Is there a way to reduce it? We will explore alternative blockwise procedure next time.

Nonparametric Regression 01

The goal of this post is to show an instance of derivation of risk bound for nonparmetric regression.

Model

For simplicity, we will focus on Gaussian sequence models. That is, our model of interest is the following type:

$$Y_i = \theta_i + \frac{1}{\sqrt{n}} \sigma Z_i, \text{ where } Z_i \sim^{iid} N(0,1)$$

for $i = 1, …, n$, where we assume $\sigma=1$. In another word, we observe a noisy parameter $Y$, and we are interested in denoising $Y$ to get the true parameter $\theta$.

Parameter space

Our parameter space is a type of Sobolev ellipsoid:
$$\Theta = \{ \theta : \sum_{i=1}^{\infty} i^{2 \alpha} \theta^2_i \le M \}$$

This way of choose our parameter space implicitly imposes smoothness assumption on function space we might be interested in. For example, if we view our estimation problem as a function estimation, the problem becomes to estimate a function $f(x)$ through a set of discrete sample pairs $(X_i, Y_i)$.
By Fourier basis expansion,

$$f(x) = \sum_{i=1}^{\infty} <f, \varphi_i> \varphi_i(x) \\ = \sum_{i=1}^{\infty} \theta_i \varphi_i(x)$$

where $\theta_i = <f, \varphi_i>$ and

$$\varphi_i(x) = \begin{cases} cos(cix) \\ sin(cix) \end{cases}$$

Imposing smoothness assumption on $f$ by bounding the integral of $f$’s second derivative is equivalent to the condition in our parameter space. That is,

$$f''(x) = \sum_{i=1}^{\infty} \theta_i (\varphi_i(x))'' \\ = \sum_{i=1}^{\infty} \theta_i (ci)^2 \varphi_i(x)$$

So

$$\int (f''(x))^2 dx \le M' \iff \sum_{i=1}^{\infty} [\theta_i (ci)^2]^2 \le M' \\ \sum_{i=1}^{\infty} c^4 i^4 \theta_i^2 \le M'$$

(In this derivation, we’ve assumed $\alpha=2$.)

Note that without any parameter space condition, the best estimation procedure is the linear procedure. We can show this through Pinsker bound, which we might go over in the future posts under decision-theory tag.

Goal of inference problem

Estimate $\theta$. Our loss function is

$$|| \hat{\theta} - \theta ||^2 = \sum_{i=1}^{\infty} (\hat{\theta}_i - \theta_i)^2$$

Estimation Procedure

Intuitively, due to our parameter space restriction, when $i$ is large, $\theta_i$ has to become smaller, and at some point or later, it becomes neglible. This observation leads to the following naive estimator:

$$\hat{\theta}_i = \begin{cases} Y_i & i \le I \\ 0 & i > I \end{cases}$$

where $I$ is the threshold value we will optimize (in terms of having a tigher risk lower bound) soon.

Risk Calculation

With this procedure, we can easily calculate risk for each case. That is, when $\hat{\theta}_i = Y_i$, its risk is $E (Y_i - \theta_i)^2 = \frac{1}{n}$ (Note: this is just the variance of $Y_i$, which is $\frac{1}{n}$ from the definition of our model.) When $\hat{\theta}_i = 0$, its risk is $E (0 - \theta_i)^2 = \theta_i^2$ (Note that expectation is taken over data, and $\theta$ is deterministic value.)

With this simple algebra above, we observe that if $\theta^2_i \le \frac{1}{n}$, we want to use $\hat{\theta}_i = 0$, which is imposed by $i > I$ condition.

The risk of this procedure is given by

$$R(\hat{\theta}) = E \sum_{i=1}^{\infty} (\hat{\theta}_i - \theta)^2 = E \sum_{i=1}^I (Y_i - \theta_i)^2 + E \sum_{i=I+1}^{\infty} \theta^2_i$$

Risk upper bound

We immediately see that the first term is equivalent to $\frac{I}{n}$.
The question is how to upper bound the second term.
Recall the condition of our parameter space definition: $\sum_{i=0}^{\infty} i^{2 \alpha} \theta_i^2 \le M$. From this, we observe the following inequalities:

$$\sum_{i=0}^{\infty} i^{2 \alpha} \theta_i^2 \le M \\ \Rightarrow \sum_{i = I + 1}^{\infty} i^{2 \alpha} \theta_i^2 \le M \\ \Rightarrow (I+1)^{2\alpha} \sum_{i = I + 1}^{\infty} \theta_i^2 \le M \\ \Rightarrow \sum_{i = I + 1}^{\infty} \theta_i^2 \le \frac{M}{(I+1)^{2\alpha}} \le \frac{M}{I^{2 \alpha}}$$

So our upper bound is

$$R(\hat{\theta}) \le \frac{I}{n} + \frac{M}{I^{2 \alpha}}$$

To find an optimal $I$, we will treat $\frac{M}{I^{2 \alpha}}$ as $Bias^2$, and pick $I$ to achieve the optimal variance-bias tradeoff. This leads to

$$\frac{I}{n} = \frac{M}{I^{2 \alpha}} \iff I^{1+2\alpha} = Mn \iff I = (Mn)^{\frac{1}{1+2\alpha}}$$

Therefore,

$$R(\hat{\theta}) \le \frac{I}{n} + \frac{M}{I^{2 \alpha}} = \frac{2}{n} (Mn)^{\frac{1}{1+2\alpha}} = 2 M^{\frac{1}{1+2\alpha}} n^{-\frac{2\alpha}{1+2\alpha}}$$

This leads to our risk upper bound:

$$sup_{\Theta} E || \hat{\theta} - \theta ||^2 \le 2 M^{\frac{1}{1+2\alpha}} n^{-\frac{2\alpha}{1+2\alpha}}$$

(We will try to find a better rate in some later posts.)

Risk lower bound

To find lower bound, we resort to Le Cum’s idea.

We first construct a sub-parameter space so that the resulting risk calculation over the sub-parameter space is simple enough. Considering a sub-parameter space is useful because for any sub-parameter space $\Theta_0 \subset \Theta$, we have the following lower bound for sup risk for $\Theta$.

$$sup_{\Theta} E || \hat{\theta} - \theta||^2 \ge sup_{\Theta_0} E || \hat{\theta} - \theta ||^2$$

In our case, let our sub-parameter space be

$$\Theta_0 = \{ \theta : \theta_i = 0 \text{ if } i \ge I + 1, \theta_i = \frac{1}{\sqrt{n}} \text{ if } i \le I \}$$

where $I$ is the previous optimal value : $I = (Mn)^{\frac{1}{1+2\alpha}}$.

We can confirm that $\Theta_0$ is indeed in $\Theta$ by checking the condition $\sum_{i=1}^{\infty} i^{2\alpha} \theta_i^2 \le M$:

$$\sum_{i=1}^I i^{2 \alpha} \frac{1}{n} = \frac{1}{n} \sum_{i=1}^I i^{2\alpha} \le \frac{1}{n} I I^{2\alpha} = \frac{1}{n} Mn = M$$

Now, we want to lower bound $sup_{\Theta_0} E || \hat{\theta} - \theta ||^2$:

$$sup_{\Theta_0} E || \hat{\theta} - \theta ||^2 \ge sup_{\Theta_0} E \sum_{i=1}^I \hat{\theta} - \theta ||^2 \\$$

Now assume a prior on $\theta_i$ such that

$$\theta_i = \begin{cases} 0 & \text{ w.p. } \frac{1}{2} \\ \frac{1}{\sqrt{n}} & \text{ w.p. } \frac{1}{2} \end{cases}$$

Then, by observing that $sup_x f(x) \ge E_x f(x)$, we have

$$sup_{\Theta_0} E \sum_{i=1}^I || \hat{\theta} - \theta ||^2 \ge E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_i - \theta_i )^2 ]$$

Now recall that Bayes estimator is supposed to attain the smallest risk. This gives us the following further lower bound:

$$E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_i - \theta_i )^2 ] \ge E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_{i,Bayes} - \theta_i )^2 ]$$

Note that $\hat{\theta}_{i,Bayes}$ is a function of $Y_i$ only due to $i.i.d$ assumption. Therefore using our definition of prior,

$$E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_{i,Bayes} - \theta_i )^2 ] = \sum_{i=1}^I [ \frac{1}{2} E_{Y_i|\theta_i=0} (\hat{\theta}_{i,Bayes} - 0)^2 + \frac{1}{2} E_{Y_i|\theta_i=\frac{1}{\sqrt{n}}} (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 ]$$

Rewriting the above equation with $\phi_{a,b}(x)$ be the pdf of N(a,b) yields

$$\sum_{i=1}^I [ \frac{1}{2} E_{Y_i|\theta_i=0} (\hat{\theta}_{i,Bayes} - 0)^2 + \frac{1}{2} E_{Y_i| \theta_i=\frac{1}{\sqrt{n}}} (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 \\ = \sum_{i=1}^I \frac{1}{2} [ \int (\hat{\theta}_{i,Bayes} - 0)^2 \phi_{0,\frac{1}{n}}(x)dx + \int (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 \phi_{\frac{1}{\sqrt{n}},\frac{1}{n}}(x) dx ]$$

Let $g(x) = \min \{ \phi_{0,\frac{1}{n}}(x), \phi_{\frac{1}{\sqrt{n}},\frac{1}{n}}(x) \}$. Then using $g(x)$, the above equation can be lower bounded:

$$\ge \sum_{i=1}^I \frac{1}{2} \int \left( ( \hat{\theta}_{i,Bayes} - 0)^2 + (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 \right) g(x) dx$$

Observing that $a^2 + b^2 \ge \frac{1}{2} (a-b)^2$, we can rewrite it to

$$\ge \sum_{i=1}^I \frac{1}{2} \int \frac{1}{2} (\frac{1}{\sqrt{n}})^2 g(x) dx = \frac{I}{n} \frac{1}{4} \int g(x) dx$$

Since $\int g(x) dx$ is some constant $c$, our risk lower bound is

$$c \frac{I}{n} = c n^{- \frac{2 \alpha}{2 \alpha + 1}}$$

This establishes our risk lower bound for this particular estimation procedure under this model.