2016-06-06

Ch19: Approximate Inference

Inference = compute $P(h|v)$.
There are many situations where you want to calculate the posterior distribution $P(h|v)$ (i.e. sparse coding, ). Here, h is a set of hidden variables and v is a set of observed variables. In general, when the model is complicated, it is intractable to compute the corresponding $P(h|v)$. This chapter introduces many examples of the inference problem, which approximates $P(h|v)$.

The core idea of approximate inference.
Given a probabilistic model with latent variables $h$ and observed variables $v$, we want to know how good it is. One measure of goodness is $\log p(v; \theta)$, often called model evidence or marginalized likelihood.
When integrating out $h$ is difficult, we instead seek for an alternative. Is there a way to lower bound $\log p(v, \theta)$?
Consider the following:
$$
L(v, \theta, q) = \log p(v; \theta) - KL(q(h|v), p(h|v; \theta))
$$
Since KL divergence is always non-negative, we can think of L(v, \theta, q) as a lower bound approximation of $\log p(v, \theta)$.
By modifying the above equation, we have
$$L(v,\theta,q) = \log p(v;\theta) - E_{h \sim q}\log \frac{q(h|v)}{p(h|v;\theta)} \\ = \log p(v; \theta) - E_{h \sim q}\log\frac{q(h|v)}{\frac{p(v,h; \theta)}{p(v; \theta)}} \\ = \log p(v; \theta) - E_{h\sim q}[ \log q(h|v) - \log \frac{p(v,h; \theta)}{p(v; \theta)}] \\ = - E_{h\sim q}[ \log q(h|v) - \log p(v,h; \theta)] \\ = - E_{h\sim q}[ \log q(h|v)] + E_{h \sim q}[\log p(v,h; \theta)] \\ = E_{h \sim q}[\log p(v,h; \theta)] + H(q)$$
where $H(q) = - E_{h\sim q}[ \log q(h|v)]$.
For an appropriate choice of $q$, $L$ is tractable.
The following sections will show how to derive different forms of approximate inference by using approximate optimization to find $q$.

(I thought it’s easier to view this section through an example so I’ll add some Mixture-of-Gaussian taste in my notes)

The goal is to maximize the lower bound $L(v,\theta, q)$. Note that we have two distinct terms when we expand $L$ in the above section: the left term is parameterized by $\theta$ and the right term is parameterized by $q$.

(will skip this for the time being)

The core idea: we can view the iterative optimization of maximizing $L(v,q)$ w.r.t. $q$ as a function $f$ that maps an input $v$ to an approximate distribution $q* = argmax_q L(v,q)$. Once we view this way, we can learn a function $f(v; \theta)$ with neural network.

Ch14 : Autoencoders

Shannon entropy = $H(x)$ = $-E_{x \sim P}[\log P(x)]$ (also denoted $H(P)$)

In words, the Shannon entropy of a distribution $P$ is the expected amount of information in an event drawn from that distribution.

KL divergence = $D{KL}[P || Q]$ = $E{x \sim P}[\log P(x) - \log Q(x)]$

Cross entropy = $H(P,Q)$ = $H(P) + D{KL}(P||Q)$ = $-E{x \sim P}\log Q(x)$

Minimizing the cross-entropy w.r.t Q is equivalent to minimizing the KL divergence.