Ch19: Approximate Inference

Notes for myself: Ch19 Approximate Inference http://www.deeplearningbook.org/contents/inference.html

Introduction

  • Inference = compute $P(h|v)$.

  • There are many situations where you want to calculate the posterior distribution $P(h|v)$ (i.e. sparse coding, ). Here, h is a set of hidden variables and v is a set of observed variables. In general, when the model is complicated, it is intractable to compute the corresponding $P(h|v)$. This chapter introduces many examples of the inference problem, which approximates $P(h|v)$.

19.1 Inference as Optimization

  • The core idea of approximate inference.

  • Given a probabilistic model with latent variables $h$ and observed variables $v$, we want to know how good it is. One measure of goodness is $\log p(v; \theta)$, often called model evidence or marginalized likelihood.

  • When integrating out $h$ is difficult, we instead seek for an alternative. Is there a way to lower bound $\log p(v, \theta)$?

  • Consider the following:
    $$
    L(v, \theta, q) = \log p(v; \theta) - KL(q(h|v), p(h|v; \theta))
    $$

  • Since KL divergence is always non-negative, we can think of L(v, \theta, q) as a lower bound approximation of $\log p(v, \theta)$.

  • By modifying the above equation, we have

    $$L(v,\theta,q) = \log p(v;\theta) - E_{h \sim q}\log \frac{q(h|v)}{p(h|v;\theta)} \\ = \log p(v; \theta) - E_{h \sim q}\log\frac{q(h|v)}{\frac{p(v,h; \theta)}{p(v; \theta)}} \\ = \log p(v; \theta) - E_{h\sim q}[ \log q(h|v) - \log \frac{p(v,h; \theta)}{p(v; \theta)}] \\ = - E_{h\sim q}[ \log q(h|v) - \log p(v,h; \theta)] \\ = - E_{h\sim q}[ \log q(h|v)] + E_{h \sim q}[\log p(v,h; \theta)] \\ = E_{h \sim q}[\log p(v,h; \theta)] + H(q)$$

    where $H(q) = - E_{h\sim q}[ \log q(h|v)]$.

  • For an appropriate choice of $q$, $L$ is tractable.

  • The following sections will show how to derive different forms of approximate inference by using approximate optimization to find $q$.

19.2 Expectation Minimization

(I thought it’s easier to view this section through an example so I’ll add some Mixture-of-Gaussian taste in my notes)

  • The goal is to maximize the lower bound $L(v,\theta, q)$. Note that we have two distinct terms when we expand $L$ in the above section: the left term is parameterized by $\theta$ and the right term is parameterized by $q$.

19.3 MAP Inference and Sparse Coding

(will skip this for the time being)

19.4 Variational Inference and learning

  • The core idea: maximize $L$ over a restricted family of distributions $q$.

19.5 Learned Approximate Inference

  • The core idea: we can view the iterative optimization of maximizing $L(v,q)$ w.r.t. $q$ as a function $f$ that maps an input $v$ to an approximate distribution $q* = argmax_q L(v,q)$. Once we view this way, we can learn a function $f(v; \theta)$ with neural network.

Ch14 : Autoencoders

14.4 Stochastic Encoders and Decoders

Ch3 : Probability and Information Theory

http://www.deeplearningbook.org/contents/prob.html

3.13 Information Theory

Shannon entropy = $H(x)$ = $-E_{x \sim P}[\log P(x)]$ (also denoted $H(P)$)

In words, the Shannon entropy of a distribution $P$ is the expected amount of information in an event drawn from that distribution.

KL divergence = $D{KL}[P || Q]$ = $E{x \sim P}[\log P(x) - \log Q(x)]$

Cross entropy = $H(P,Q)$ = $H(P) + D{KL}(P||Q)$ = $-E{x \sim P}\log Q(x)$

Minimizing the cross-entropy w.r.t Q is equivalent to minimizing the KL divergence.

Ch5 : Machine Learning Basics

5.5 Maximum Likelihood estimation

-