ELBO for diffusion models

I was always told that I needed to know what an ELBO was. Most of the time, I did not see any ELBOs in any of the models that I actually trained. But when I got into diffusion models, that was an area in which the variational lower bound is tied to how the objective is actually derived. So, in this review, we will:

Derive the ELBO from first principles
Connect the ELBO to diffusion models

Introduction

Modern bayesian statistics is about approximating the posteriors of models which are not easy to compute. Variational inference* is a method that approximates difficult to compute probability densities through optimization, rather than through sampling. Let’s formalize the general problem of ‘modern bayesian statistics’. Consider a joint density of latent variables $z = z_{1 : m}$ and observations $x = x_{1 : m}$ , $p (z, x) = p (z) p (x ∣ z)$ .

I.e., draw the latent variable from a prior density $p (z)$ and relate the latents to the observation through the likelihood $p (x ∣ z)$ . Then, inference in a Bayesian model is conditioning on the data and computing the posterior $p (z ∣ x)$ . For contrast with variational methods, let’s get into the details of a sampling based method for approximating the posterior.

MCMC (sampling based) methods for approximate inference

First, construct an ergodic Markov chain on z (i.e. all states of z are visited in the limit of time—irreducibility—and is aperiodic, does not get trapped in repeating cycles), whose stationary distribution is the posterior $p (z ∣ x)$ . The Markov chain has transition matrix $P$ so a stationary distribution $π$ has property $π = π P$ . In this case, we cannot sample from $p (z ∣ x)$ directly, but we can ‘evaluate’ the unnormalized posterior $p (z ∣ x) \propto p (x ∣ z) p (z)$ . The pointwise evaluator is turned into a sampler. The histogram of the points sampled from this converges to $p (z ∣ x)$ . (So in some sense after sampling you can construct a categorical distribution if you want.)

Algorithm MCMC

1. from current state z, propose z' from some proposal distribution q(z'|z), e.g. a Gaussian centered at z
2. Compute the acceptance ratio acceptance = min(1, p(x|z')p(z')q(z|z') / 
   p(x|z)p(z)q(z'|z) )
3. Accept z' with probabliblity acceptance, o.w. stay at z
   
# the numerator is the reversal and the denominator is the forward

So, What’s the issue with MCMC? We need to sample $p (z ∣ x)$ faster than this. Let’s see how we can optimize for the posterior. Assume a family $F$ of approximate densities over the latent variables. Then, find the member of that family $F$ that minimizes the KL divergence to the exact posterior, i.e., $q^{*} (z) = a r g mi n_{q (z) \in F} K L (q (z) ∣∣ p (z ∣ x))$ .

Here are the quantities we’ll be working with.

$p (z ∣ x) = \frac{p ( z , x )}{p ( x )}$
$p (x)$ , the marginal density of the observations, is called the evidence
and $p (x) = \int p (z, x) d z$ is intractable

Let’s get familiar with the evidence lower bound

The variational inference objective is $q^{*} (z) = a r g mi n_{q (z) \in F} K L (q (z) ∣∣ p (z ∣ x))$ .

Recall the formula for KL divergence :

$D_{K L} (P ∣∣ Q) = \sum_{x \in X} P (x) lo g \frac{P ( x )}{Q ( x )}$

Now, let’s expand the KL divergence.

K L (q (z) ∣∣ p (z ∣ x)) = E [lo g q (z)] - E [lo g p (z ∣ x)] = E [lo g q (z)] - E [lo g p (z, x)] + lo g p (x)

(We can remove the expectation around $lo g p (x)$ as it does not depend on $q (z)$ )

Now, let’s name this expression the ELBO

$E L B O (q) = E [lo g p (z, x)] - E [lo g q (z)]$

We can see that it lower bounds $lo g p (x)$ , the evidence.

Here’s an alternative way to write the ELBO.

E L B O (q) = E [lo g p (z, x)] - E [lo g q (z)] = E [lo g p (x ∣ z)] + E [lo g p (z)] - E [lo g q (z)] = E [lo g p (x ∣ z)] - K L (q (z) ∣∣ p (z))

Connection of ELBO to Diffusion Models

Let’s map some terms from variational inference to diffusion model terminology.

$x_{0}$ the clean latent, is the observation
$z = x_{1 : T}$ are the latents
$p_{θ} (x_{0} ∣ x_{1})$ , the step that produces the clean latent is the ‘likelihood’
$p_{θ} (x_{t - 1} ∣ x_{t})$ are latent to latent transitions. This can be thought of as an analogy to the prior $p (z)$
$q (z)$ is the noising process
- $q (x_{t} ∣ x_{t - 1}) = N (1 - β_{t} x_{t - 1}, β_{t} I)$

What is learned in a diffusion model is the entire model.

$p_{θ} = p (X_{T}) \prod_{t} P (X_{t - 1} ∣ X_{t})$

Notice that we’re optimizing $p$ while $q$ is the fixed Gaussian Markov process.

How is each step of the diffusion process aligned?

Let’s start with this form of the ELBO.

$lo g p (x) \geq E L B O + K L (q (z) ∣∣ p (z ∣ x))$

$lo g p (x) \geq E [lo g p (x, z)] - E [lo g q (z)] + K L (q (z) ∣∣ p (z ∣ x))$

$lo g p (x_{0}) \geq E [lo g \frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}] : = L$ , is the diffusion version

Per-step generative: $p (x_{0 : T} = p (x_{T}) \prod_{t = 1}^{T} p (x_{t - 1} ∣ x_{t}))$ Per-step forward: $q (x_{t : T} ∣ x_{0}) = \prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1})$

When we write $L$ notice that we can’t yet align the $p$ and $q$ transition kernels since they are going in opposite directions

$L = E [lo g p (x_{T}) + \sum_{t = 1}^{T} lo g p (x_{t - 1} ∣ x_{t})] - \sum_{t = 1}^{T} lo g q (x_{t} ∣ x_{t - 1})]$

We can use Baye’s rule to invert the direction of $q$ . Additionally, the Markov property means that we can condition on $x_{0}$ . We’ll subtly factor out $q (x_{1} ∣ x_{0})$ which will cancel when we do the telescoping sums.

t = 1 \prod T q (x_{t} ∣ x_{t - 1}) = q (x_{1} ∣ x_{0}) t = 2 \prod T q (x_{t} ∣ x_{t - 1}) = q (x_{1} ∣ x_{0}) t = 2 \prod T (q (x_{t - 1} ∣ x_{t}, x_{0}) \frac{q ( x _{t} ∣ x _{0} )}{q ( x _{t - 1} ∣ x _{0} )}) = q (x_{1} ∣ x_{0}) \frac{q ( x _{t} ∣ x _{0} )}{q ( x _{1} ∣ x _{0} )} t = 2 \prod T q (x_{t - 1} ∣ x_{t}, x_{0}) = q (x_{T} ∣ x_{0}) t = 2 \prod T q (x_{t - 1} ∣ x_{t}, x_{0})

L = E [lo g p (x_{T}) + lo g p (x_{0} ∣ x_{1}) + t = 2 \sum T lo g p (x_{t - 1} ∣ x_{t}) - t = 2 \sum T lo g q (x_{t} ∣ x_{t - 1}) - lo g q (x_{T} ∣ x_{0})] = E [lo g p (x_{0} ∣ x_{1}) + t = 2 \sum T D_{K L} (p (x_{t - 1} ∣ x) ∣∣ q (x_{t} ∣ x_{t - 1})) + D_{K L} (p (x_{T}) ∣∣ q (x_{T} ∣ x_{0}))]

Where the first term is the reconstruction error, the middle term is time-step aligned noise prediction, and the last term is aligning the noise prior.

Sources

https://arxiv.org/abs/1601.00670

Ann He

Explorer