Supplementary Notes for Dropout as a Bayesian Approximation

(notes to myself)

Summary

The dropout objective minimises the KL divergence between an approximate distribution and the posterior of a deep Gaussian process (marginalized over its finite rank covariance function parameters).

Gaussian Processes

  • models distributions over functions.
  • offers a way to get uncertainty estimates over the function values, robustness to over-fitting, and principled ways of hyper-parameter tuning.

How to get uncertainty information from these deep learning models for free – without changing a thing

  • it’s long been known that these deep tools can be related to Gaussian processes
  • What are the uncertainty estimates that we can get from Gaussian proceses?
    – Variance of a predictive distribution.

  • So when we say, we can get uncertainty information from dropout network, it means that we can just calculate the variance of the predictive distribution induced by our dropout network.

How exactly do we calculate the variance of the predictive distribution?

    1. Define a prior length-scale $l$, which captures our belief over the function frequency. A short length-scale ll corresponds to high frequency data, and a long length-scale corresponds to low frequency data.
    1. Take the length-scale squared, and divide it by the weight decay. We then scale the result by half the dropout probability over the number of data points. That is,
$$\tau = \frac{l^2 p}{2 N \lambda}$$

where $p$ indicates the probability of the units not being dropped. (usually $p$ is the probability of units being dropped, so be careful!)

    1. This $\tau$ is the Gaussian Process precision. (to see why see the derivation part.)
    1. Next, simulate a network with a test point $x^$ with dropout on. Repeat this $T$ times with different units dropped, and collect the results ${ yt }{t=1}^T$. These are empirical samples from our approximate predictive posterior distribution.
    1. We can get an empirical estimator for the predictive mean of our approximate posterior as well as the predictive variance (our uncertainty) from these samples as follows:
$$E[ y^*] \approx \frac{1}{T} \sum_{t=1}^T \hat{y}^*_t(x^*) \\ Var[y^*] \approx \tau^{-1} I + \frac{1}{T} \sum_{t=1}^T \hat{y}^*_t (x^*)^T \hat{y}^*_t (x^*) - E[y^*]^T E[y^*]$$

Bayesian approach to function approximation

Given a training set $\{ X_i, Y_i \}_{i=1}^n$, we want to estimate a function $y = f(x)$ that is likely to describe the relationship between $X$ and $Y$.

Following the Bayesian approach, we can put some prior distribution over the space of functions $p(f)$.
We then look for the posterior distribution given data $(X,Y)$:

$$p(f|X,Y) \propto p(Y|X, f) p(f)$$

This distribution captures the most likely functions given $(X,Y)$.
A prediction can be done in the following way:

$$p(y^*|x^*, X,Y) = \int p(y^*|x^*,f) p(f|X,Y) df \\ \iff p(y^*|x^*, X,Y) = \int p(y^*|f) p(f|x,X,Y) df$$

Covariance function is like a kernel function: it takes two arguments, and returns a similarity of these two. Therefore, given some $n by p$ data matrix, this function induces an $n \times n$ covariance matrix.

How do these functions correspond to the Gaussian process?

Why can we think that the noise as approximate integration?

  • Because averaging forward passes through the network is equivalent to Monte Carlo integration over a Gaussian process posterior approximation.

Derivation

  • averaging forward passes:
$$p(y^* | x^*, X, Y) = \int p(y^*|x^*,\omega) p(\omega|X,Y) d\omega$$

where one forward pass calculation (given all the weights $\omega$ and the test input $x^*$) is equal to one sample draw from this distribution $p(y^*|x^*,\omega)$, where we define

$$p(y^*|x^*,\omega) = N(y^*; \sqrt{\frac{1}{K}} \sigma(x^* W_1 + b) W_2, \tau^{-1} I_n)$$

Gaussian Process

  • A Gaussian process is completely specified by its mean function and covariance function.

  • The predictions from a GP model take the form of a full predictive distribution

  • A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

  • The random variables represent the value of the function $f(x)$ at location $x$.

  • Often, GP is defined over time, where the index set of random variables is time, but this is not normally the case.

  • Usually, the index set $\mathcal{X}$ is the set of all possible inputs, i.e. $\mathbb{R}^D$.

Question: When you say “distribution over functions”, it means we put some probability mass on every possible function in the function space of interest. How do we put such a mass in what way?

-> We can see this from the fact that we can draw samples from the distribution of functions evaluated at any number of points; i.e. we choose a number of input points $X_*$ and write out the corresponding covariance matrix using pre-defined covariance function elementwise. Then we generate a random Gaussian vector with this covariance matrix:

$$f_* \sim N(0, K(X_*, X_*))$$

(WIP)

Reference:

Yarin Gal’s blog post about the paper

The paper :
“Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”

Textbook: Gaussian Processes for Machine Learning