2017-03-13

Nonparametric Regression 02

In the last post, we investigated a Gaussian Sequence model and showed minimax rate:

$$\inf_{\hat{\theta}} \sup_{\Theta} E || \hat{\theta} - \theta ||^2 \asymp M^{\frac{1}{2\alpha + 1}} n^{-\frac{2\alpha}{2\alpha+1}}$$

The crucial assumption to derive this bound is that we assumed $\alpha$ and $M$ are known, which defines a Sobolev ellipsoid, our parameter space.

In this post, we remove that assumption and attempt to achieve the same risk bound (up to some constant).

To recap, the followings are our model description and parameter space.

Model

$$Y_i = \theta_i + \frac{1}{\sqrt{n}} \sigma Z_i, \text{ where } Z_i \sim^{iid} N(0,1)$$

Parameter space

$$\Theta = \{ \theta : \sum_{i=1}^{\infty} i^{2 \alpha} \theta^2_i \le M \}$$

Adaptive Estimation

Recall that the naive estimation procedure in the previous post just uses an observation as our estimator for small $i$s and 0 for large $i$s.
It worked because we were able to optimize $I$ based on $\alpha$ and $M$.
Now since we don’t have access to $\alpha$ and $M$, we have to take an alternative estimation procedure.
Since the only available information we have is data, we want our procedure to reflect this information in a meaningful way.

To get a hint, we will look into how James-Stein estimator was conceived.

James-Stein Estimator

James-Stein estimator is a form of $\delta(X) = a^T X$ (linear procedure) where $a$ is a data-dependent parameter.
(That is, JS estimator is adaptive to the $|| \theta ||^2$.)

If we have $m$ observations such that

$$X_i = \theta_i + \sigma Z_i , Z_i \sim^{i.i.d} N(0,1)$$

Then JS estimator is

$$\hat{\theta}_{JS} = (1 - \frac{(m-2)\sigma^2}{\sum_{i=1}^m X_i^2} ) X$$

where $X$ represents a $m$ dimensional vector (so we put all the $m$ observations into one vector).

We say that JS-estimator is mimicking linear estimator because the difference of the two risks is bounded by some constant:

$$E || \hat{\theta}_{JS} - \theta ||^2 - \inf_c E || c X - \theta ||^2 \le \text{constant}$$

To see this, we will investigate each term and show the difference is bounded.
(assume $\sigma=1$ without loss of generality.)

The First Term

$$E || \hat{\theta}_{JS} - \theta ||^2 = E_{X|\theta} \sum_{i=1}^m (X_i - \frac{m-2}{||X||^2} - \theta_i)^2 \\ = E_{X|\theta} \sum_{i=1}^m ((X_i - \theta)^2 + \frac{(m-2)^2}{(||X||^2)^2} X_i^2 - 2(X_i - \theta_i) \frac{m-2}{||X||^2} X_i ) \\ = m + E_{X|\theta} (\frac{(m-2)^2}{||X||^2} ) - 2 \sum_{i=1}^m E_{X|\theta} (X_i - \theta) \frac{m-2}{||X||^2} X_i$$

To go further, we will need Stein’s identity (simplified version):

$$\text{For } Y \sim N(\theta, 1), \text{ we have } E_{Y|\theta} (Y - \theta) g(Y) = E_{Y|\theta} g'(Y)$$

under some regularity assumption for $g$.

In our case, $Y=X_i$ and $g(X_i) = \frac{m-2}{||X||^2} X_i$.

By the above identity, we have

$$\sum_{i=1}^m E_{X|\theta} (X_i - \theta) \frac{m-2}{||X||^2} X_i = \sum_{i=1}^m E_{X|\theta} \frac{\partial}{\partial X_i}(\frac{m-2}{||X||^2}X_i) \\ = \sum_{i=1}^m E_{X|\theta}( \frac{m-2}{||X||^2} + \frac{-(m-2) 2X_i}{(\sum_{i=1}^m X_i^2)^2}X_i ) \\ = E_{X|\theta} (\frac{m(m-2)}{||X||^2} - \frac{2 (m-2)}{||X||^2} ) \\ = E_{X|\theta} (\frac{m(m-2)^2}{||X||^2} ) \\$$

This gives us

$$E || \hat{\theta}_{JS} - \theta ||^2 = m + E_{X|\theta} (\frac{(m-2)^2}{||X||^2} ) - 2 \sum_{i=1}^m E_{X|\theta} (X_i - \theta) \frac{m-2}{||X||^2} X_i \\ = m - E_{X|\theta} \frac{(m-2)^2}{||X||^2}$$

Now to get an upper bound on this, we will apply Jensen’s inequality but before that we will use the following property of noncentral Chi-squared distribution to simplify (noncentral because each $X_i$ has a different mean $\theta_i$):

$$||X||^2 = \sum_{i=1}^m X_i^2 \sim \chi_{m+2N}, \text{ where } N \sim Poisson \left(\frac{||\theta||^2}{2} \right)$$

By conditioning on $N$, we have

$$m - E_{X|\theta} \frac{(m-2)^2}{||X||^2} = m - E_N \left[ E_{X|N} \frac{(m-2)^2}{||X||^2} | N \right] \\ = m - E_N \frac{(m-2)^2}{m+2N-2} \text{ (Note that } E\frac{1}{\chi^2_d} = \frac{1}{d-2} \text{)} \\ \le m - \frac{(m-2)^2}{m-2 + 2E_N[n]} \text{ Note (Jensen's inequality):} E \frac{1}{Y} \ge \frac{1}{EY} \\ = m - \frac{(m-2)^2}{m-2 + ||\theta||^2} \\ = 2 + m - 2 - \frac{(m-2)^2}{m-2 + ||\theta||^2} \\ = 2 + \frac{(m-2)||\theta||^2}{m-2 + ||\theta||^2} \\ \le 2 + \frac{m||\theta||^2}{m + ||\theta||^2}$$

When $\sigma \neq 1$, it’s going to be

$$E ||\hat{\theta} - \theta||^2 \le 2\sigma^2 + \frac{m \sigma^2 ||\theta||^2}{m\sigma^2 + ||\theta||^2}$$

The Second Term

We can see that
$$\inf_c E || c X - \theta ||^2 = \frac{m \sigma^2 ||\theta||^2}{||\theta||^2 + m \sigma^2}$$

by optimizing $c$ to get $inf$.

First, consider the bias-variance decomposition for squared loss:

$$E || cX - \theta ||^2 = (c-1)^2 ||\theta||^2 + c^2 m \sigma^2$$

By taking the derivative w.r.t. $c$ and setting it equal to zero, we have

$$2(c-1) ||\theta||^2 + 2cm\sigma^2 = 0 \\ c = \frac{||\theta||^2}{||\theta||^2 + m \sigma^2}$$

By plugging this value into $E||cX - \theta||^2$, we have

$$(c-1)^2 ||\theta||^2 + c^2 m \sigma^2 = (\frac{-m\sigma^2}{||\theta||^2 + m \sigma^2})^2 ||\theta||^2 + (\frac{||\theta||^2}{||\theta||^2 + m \sigma^2})^2 m \sigma^2 \\ = \frac{m \sigma^2 ||\theta||^2 }{||\theta||^2 + m \sigma^2 }$$

By combining the results from First Term and Second Term, we have

$$E || \hat{\theta}_{JS} - \theta ||^2 - \inf_c E || c X - \theta ||^2 \le 2 \sigma^2$$

which shows that the risk difference is bounded by $2 \sigma^2$, a constant that doesn’t depend on $c$.

Shrinkage Estimator

Now let’s go back to our original problem of constructing an estimation procedure that achieves

$$E || \hat{\theta} - \theta ||^2 \asymp M^\frac{1}{1+2 \alpha} n^{- \frac{2 \alpha}{2 \alpha+1}}$$

By leveraging the idea of James-Stein estimator, we can think of a form of shrinkage estimator for this problem as well.
However, simply using $\hat{\theta} = \hat{\theta}_{JS}$ doesn’t work, because JS implicitely assumes that $\theta_i$ ($i=1,…,m$) are all in the same magnitude, which means that when some $\theta_i$ is big and others are small, the performance will be really bad.
This makes sense because we equally shrink $X$ by the same factor: $(1 - \frac{c}{||X||^2})$

The natural extention of the JS procedure is to divide the observation into many blocks and apply a different shrinkcage factor for each block, according to its magnitude. (WLOG, we can assume our observations are sorted in a decreasing order.)

Risk Calculation

The modified JS-like estimation procedure leads us to the following risk:

$$E || \hat{\theta} - \theta ||^2 = E \sum_{i=1}^n (\hat{\theta}_i - \theta_i)^2 + \sum_{i=n+1}^{\infty} \theta_i^2 \\ \le \sum_{k=1}^K \left[ \frac{|B_k| \frac{1}{n} ||\theta_{B_k}||^2}{|B_k|\frac{1}{n}+||\theta_{B_k}||^2} + 2 \frac{1}{n} \right] + O(n^{-\frac{2\alpha}{2\alpha+1}})$$

where $K$ is the number of blocks, $B_k$ represents the $k$ th block, $|B_k|$ is the size of that block.
We can naively set the number of elements in each block to be the power of 2, which allows us to write $K = \log_2 n$, where $n$ is the number of observations in total.

The correspondence from the James-Stein estimator risk upper bound is:

$$\sigma^2 = \frac{1}{n}, ||\theta||^2 = ||\theta_{B_k}||^2, m = |B_k|$$

Recall that we introduced the threshold index $I$ by considering bias-variance trade-off.
This time, we will try to find $K_0$ (the number of blocks) such that $I \le |B1| + \cdots |B{K_0}| \le 2I$ holds.

Now observe that for any positive real number $a$ and $b$, we have $\frac{ab}{a+b} \le a$ and $\frac{ab}{a+b} \le b$.
Therefore,

$$\sum_{k=1}^K \left[ \frac{|B_k| \frac{1}{n} ||\theta_{B_k}||^2}{|B_k|\frac{1}{n}+||\theta_{B_k}||^2} + 2 \frac{1}{n} \right] + O(n^{-\frac{2\alpha}{2\alpha+1}}) \\ \le \sum_{k=1}^{K_0} |B_k|\frac{1}{n} + \sum_{k=K_0+1}^K ||\theta_{B_k}||^2 + 2 \frac{1}{n} K + O(n^{-\frac{2 \alpha}{2\alpha+1}}) \\ \le \frac{1}{n} 2I + \sum_{i > I}^{\infty} \theta_i^2 + + O(n^{-\frac{2 \alpha}{2\alpha+1}}) \\$$

From the calculation in the last post, we see that this will be upper bounded by

$$\le c M^{\frac{1}{1+2\alpha}} n^{-\frac{2\alpha}{2\alpha+1}}$$

One criticism toward this procedure is that the constant $c$ could be huge.
Is there a way to reduce it? We will explore alternative blockwise procedure next time.

Notes