Nonparametric Regression 01

The goal of this post is to show an instance of derivation of risk bound for nonparmetric regression.

Model

For simplicity, we will focus on Gaussian sequence models. That is, our model of interest is the following type:

$$Y_i = \theta_i + \frac{1}{\sqrt{n}} \sigma Z_i, \text{ where } Z_i \sim^{iid} N(0,1)$$

for $i = 1, …, n$, where we assume $\sigma=1$. In another word, we observe a noisy parameter $Y$, and we are interested in denoising $Y$ to get the true parameter $\theta$.

Parameter space

Our parameter space is a type of Sobolev ellipsoid:
$$\Theta = \{ \theta : \sum_{i=1}^{\infty} i^{2 \alpha} \theta^2_i \le M \}$$

This way of choose our parameter space implicitly imposes smoothness assumption on function space we might be interested in. For example, if we view our estimation problem as a function estimation, the problem becomes to estimate a function $f(x)$ through a set of discrete sample pairs $(X_i, Y_i)$.
By Fourier basis expansion,

$$f(x) = \sum_{i=1}^{\infty} <f, \varphi_i> \varphi_i(x) \\ = \sum_{i=1}^{\infty} \theta_i \varphi_i(x)$$

where $\theta_i = <f, \varphi_i>$ and

$$\varphi_i(x) = \begin{cases} cos(cix) \\ sin(cix) \end{cases}$$

Imposing smoothness assumption on $f$ by bounding the integral of $f$’s second derivative is equivalent to the condition in our parameter space. That is,

$$f''(x) = \sum_{i=1}^{\infty} \theta_i (\varphi_i(x))'' \\ = \sum_{i=1}^{\infty} \theta_i (ci)^2 \varphi_i(x)$$

So

$$\int (f''(x))^2 dx \le M' \iff \sum_{i=1}^{\infty} [\theta_i (ci)^2]^2 \le M' \\ \sum_{i=1}^{\infty} c^4 i^4 \theta_i^2 \le M'$$

(In this derivation, we’ve assumed $\alpha=2$.)

Note that without any parameter space condition, the best estimation procedure is the linear procedure. We can show this through Pinsker bound, which we might go over in the future posts under decision-theory tag.

Goal of inference problem

Estimate $\theta$. Our loss function is

$$|| \hat{\theta} - \theta ||^2 = \sum_{i=1}^{\infty} (\hat{\theta}_i - \theta_i)^2$$

Estimation Procedure

Intuitively, due to our parameter space restriction, when $i$ is large, $\theta_i$ has to become smaller, and at some point or later, it becomes neglible. This observation leads to the following naive estimator:

$$\hat{\theta}_i = \begin{cases} Y_i & i \le I \\ 0 & i > I \end{cases}$$

where $I$ is the threshold value we will optimize (in terms of having a tigher risk lower bound) soon.

Risk Calculation

With this procedure, we can easily calculate risk for each case. That is, when $\hat{\theta}_i = Y_i$, its risk is $E (Y_i - \theta_i)^2 = \frac{1}{n}$ (Note: this is just the variance of $Y_i$, which is $\frac{1}{n}$ from the definition of our model.) When $\hat{\theta}_i = 0$, its risk is $E (0 - \theta_i)^2 = \theta_i^2$ (Note that expectation is taken over data, and $\theta$ is deterministic value.)

With this simple algebra above, we observe that if $\theta^2_i \le \frac{1}{n}$, we want to use $\hat{\theta}_i = 0$, which is imposed by $i > I$ condition.

The risk of this procedure is given by

$$R(\hat{\theta}) = E \sum_{i=1}^{\infty} (\hat{\theta}_i - \theta)^2 = E \sum_{i=1}^I (Y_i - \theta_i)^2 + E \sum_{i=I+1}^{\infty} \theta^2_i$$

Risk upper bound

We immediately see that the first term is equivalent to $\frac{I}{n}$.
The question is how to upper bound the second term.
Recall the condition of our parameter space definition: $\sum_{i=0}^{\infty} i^{2 \alpha} \theta_i^2 \le M$. From this, we observe the following inequalities:

$$\sum_{i=0}^{\infty} i^{2 \alpha} \theta_i^2 \le M \\ \Rightarrow \sum_{i = I + 1}^{\infty} i^{2 \alpha} \theta_i^2 \le M \\ \Rightarrow (I+1)^{2\alpha} \sum_{i = I + 1}^{\infty} \theta_i^2 \le M \\ \Rightarrow \sum_{i = I + 1}^{\infty} \theta_i^2 \le \frac{M}{(I+1)^{2\alpha}} \le \frac{M}{I^{2 \alpha}}$$

So our upper bound is

$$R(\hat{\theta}) \le \frac{I}{n} + \frac{M}{I^{2 \alpha}}$$

To find an optimal $I$, we will treat $\frac{M}{I^{2 \alpha}}$ as $Bias^2$, and pick $I$ to achieve the optimal variance-bias tradeoff. This leads to

$$\frac{I}{n} = \frac{M}{I^{2 \alpha}} \iff I^{1+2\alpha} = Mn \iff I = (Mn)^{\frac{1}{1+2\alpha}}$$

Therefore,

$$R(\hat{\theta}) \le \frac{I}{n} + \frac{M}{I^{2 \alpha}} = \frac{2}{n} (Mn)^{\frac{1}{1+2\alpha}} = 2 M^{\frac{1}{1+2\alpha}} n^{-\frac{2\alpha}{1+2\alpha}}$$

This leads to our risk upper bound:

$$sup_{\Theta} E || \hat{\theta} - \theta ||^2 \le 2 M^{\frac{1}{1+2\alpha}} n^{-\frac{2\alpha}{1+2\alpha}}$$

(We will try to find a better rate in some later posts.)

Risk lower bound

To find lower bound, we resort to Le Cum’s idea.

We first construct a sub-parameter space so that the resulting risk calculation over the sub-parameter space is simple enough. Considering a sub-parameter space is useful because for any sub-parameter space $\Theta_0 \subset \Theta$, we have the following lower bound for sup risk for $\Theta$.

$$sup_{\Theta} E || \hat{\theta} - \theta||^2 \ge sup_{\Theta_0} E || \hat{\theta} - \theta ||^2$$

In our case, let our sub-parameter space be

$$\Theta_0 = \{ \theta : \theta_i = 0 \text{ if } i \ge I + 1, \theta_i = \frac{1}{\sqrt{n}} \text{ if } i \le I \}$$

where $I$ is the previous optimal value : $I = (Mn)^{\frac{1}{1+2\alpha}}$.

We can confirm that $\Theta_0$ is indeed in $\Theta$ by checking the condition $\sum_{i=1}^{\infty} i^{2\alpha} \theta_i^2 \le M$:

$$\sum_{i=1}^I i^{2 \alpha} \frac{1}{n} = \frac{1}{n} \sum_{i=1}^I i^{2\alpha} \le \frac{1}{n} I I^{2\alpha} = \frac{1}{n} Mn = M$$

Now, we want to lower bound $sup_{\Theta_0} E || \hat{\theta} - \theta ||^2$:

$$sup_{\Theta_0} E || \hat{\theta} - \theta ||^2 \ge sup_{\Theta_0} E \sum_{i=1}^I \hat{\theta} - \theta ||^2 \\$$

Now assume a prior on $\theta_i$ such that

$$\theta_i = \begin{cases} 0 & \text{ w.p. } \frac{1}{2} \\ \frac{1}{\sqrt{n}} & \text{ w.p. } \frac{1}{2} \end{cases}$$

Then, by observing that $sup_x f(x) \ge E_x f(x)$, we have

$$sup_{\Theta_0} E \sum_{i=1}^I || \hat{\theta} - \theta ||^2 \ge E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_i - \theta_i )^2 ]$$

Now recall that Bayes estimator is supposed to attain the smallest risk. This gives us the following further lower bound:

$$E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_i - \theta_i )^2 ] \ge E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_{i,Bayes} - \theta_i )^2 ]$$

Note that $\hat{\theta}_{i,Bayes}$ is a function of $Y_i$ only due to $i.i.d$ assumption. Therefore using our definition of prior,

$$E_{\theta} E_{Y|\theta} [ \sum_{i=1}^I (\hat{\theta}_{i,Bayes} - \theta_i )^2 ] = \sum_{i=1}^I [ \frac{1}{2} E_{Y_i|\theta_i=0} (\hat{\theta}_{i,Bayes} - 0)^2 + \frac{1}{2} E_{Y_i|\theta_i=\frac{1}{\sqrt{n}}} (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 ]$$

Rewriting the above equation with $\phi_{a,b}(x)$ be the pdf of N(a,b) yields

$$\sum_{i=1}^I [ \frac{1}{2} E_{Y_i|\theta_i=0} (\hat{\theta}_{i,Bayes} - 0)^2 + \frac{1}{2} E_{Y_i| \theta_i=\frac{1}{\sqrt{n}}} (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 \\ = \sum_{i=1}^I \frac{1}{2} [ \int (\hat{\theta}_{i,Bayes} - 0)^2 \phi_{0,\frac{1}{n}}(x)dx + \int (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 \phi_{\frac{1}{\sqrt{n}},\frac{1}{n}}(x) dx ]$$

Let $g(x) = \min \{ \phi_{0,\frac{1}{n}}(x), \phi_{\frac{1}{\sqrt{n}},\frac{1}{n}}(x) \}$. Then using $g(x)$, the above equation can be lower bounded:

$$\ge \sum_{i=1}^I \frac{1}{2} \int \left( ( \hat{\theta}_{i,Bayes} - 0)^2 + (\hat{\theta}_{i,Bayes} - \frac{1}{\sqrt{n}})^2 \right) g(x) dx$$

Observing that $a^2 + b^2 \ge \frac{1}{2} (a-b)^2$, we can rewrite it to

$$\ge \sum_{i=1}^I \frac{1}{2} \int \frac{1}{2} (\frac{1}{\sqrt{n}})^2 g(x) dx = \frac{I}{n} \frac{1}{4} \int g(x) dx$$

Since $\int g(x) dx$ is some constant $c$, our risk lower bound is

$$c \frac{I}{n} = c n^{- \frac{2 \alpha}{2 \alpha + 1}}$$

This establishes our risk lower bound for this particular estimation procedure under this model.