We go over DDPM (Denoising Diffusion Probabilistic Model), a popular generative model introduced by Ho et al. 2020. The math was mainly deduced by Sohl-Dickstein et al. 2015. This post by 苏剑林 is very helpful.

Goal: generate samples $\sim p_{\rm data}(x).$

Idea: $X_0\sim p_{\rm data}\to X_1\to \cdots \to$ \(X_T\approx \mathcal{N}(0,I).\)

  • Forward/encoding process: $p(x_t\mid x_{t-1})$ gradually destroys structure and adds Gaussian noise with variance schedules $\beta_t^2$, not trainable.

  • Reverse/decoding process: $q_\theta(x_{t-1}\mid x_t)$ restores structure and recovers $p_{\rm data}$ with $\theta$ to be trained.

  • Note: In the DDPM paper, $\beta_t^{\rm DDPM}=\beta_t^2$. Also, $p\leftrightarrow q$ to coincide with notations from VAE: $q$ is decoding.

DDPM assumptions

  • Forward and reverse processes are Markov.

  • \[X_{t-1} \mid X_t \sim \mathcal{N}(\mu_\theta(X_t,t), \sigma_t^2I).\]
  • $X_t=\sqrt{1-\beta_t^2} X_{t-1}+\beta_t\varepsilon_t,$ where $\varepsilon_t\sim \mathcal{N}(0,I).$

Some direct consequences

Fact 1. $p(x_t\mid x_0)$ is an explicit Gaussian: writing $\alpha_t=\sqrt{1-\beta_t^2}$,

\[\begin{align*} X_t &= \alpha_t X_{t-1}+\beta_t\varepsilon_t = \alpha_t\alpha_{t-1}X_{t-2} +\beta_t\varepsilon_t + \alpha_t\beta_{t-1}\varepsilon_{t-1}\\ &= \alpha_t\cdots\alpha_1 X_0 + \sum_{s=1}^t \beta_s \alpha_{s+1}\cdots\alpha_t\varepsilon_s. \end{align*}\]

Since \({\rm Var} \sum_{s=1}^t \beta_s \alpha_{s+1}\cdots\alpha_t\varepsilon_s = \sum_{s=1}^t \beta_s^2\alpha_{s+1}^2\cdots\alpha_t^2 = 1-\alpha_1^2\cdots\alpha_t^2,\) if $\bar\alpha_t=\alpha_1\cdots\alpha_t,\bar\beta_t=\sqrt{1-\bar\alpha_t^2}$, then \(X_t \mid X_0 \sim \mathcal{N}(\bar\alpha_t X_0, \bar\beta_t^2 I).\)

Fact 2. \(X_{t-1}\mid X_t,X_0\sim \mathcal{N}(\tilde\mu_t(X_t,X_0),\tilde\beta_t^2 I),\) where

\[\tilde\beta_t = \frac{\bar\beta_{t-1}}{\bar\beta_t}\beta_t, \quad \tilde\mu_t(x_t,x_0) = \frac{\bar\alpha_{t-1}\beta_t^2}{\bar\beta_t^2}x_0 + \frac{\alpha_t\bar\beta_{t-1}^2}{\bar\beta_t^2} x_t.\]

This is elementary but messy. By Bayes and $p(x_t\mid x_{t-1},x_0)=p(x_t\mid x_{t-1})$ (Markov),

\[\begin{align*} &\log p(x_{t-1}\mid x_t,x_0) = \log\frac{p(x_{t}\mid x_{t-1})p(x_{t-1}\mid x_0)}{p(x_t\mid x_0)}\\ &= - \frac{1}{2\beta_t^2}\left| x_t - \alpha_t x_{t-1} \right|^2 - \frac{1}{2\bar\beta^2_{t-1}}|x_{t-1}-\bar\alpha_{t-1} x_0|^2 + \frac{1}{2\bar\beta_t^2} |x_t-\bar\alpha_t x_0|^2 + C\\ &= - \frac{\bar\beta_t^2}{2\beta_t^2\bar\beta_{t-1}^2}|x_{t-1}|^2 + x_{t-1}\cdot \left( \frac{\alpha_t}{\beta_t^2}x_t + \frac{\bar\alpha_{t-1}}{\bar\beta_{t-1}^2} x_0 \right) + C(x_t,x_0). \end{align*}\]

Training

Writing $z=x_{1:T},$ as in VAE, we complete the joint $p(x_{0:T})$ by marginal $p_{\rm data}(x_0).$ The KL of joints is

\[\begin{align*} &D_{\rm KL}(p(x_{0:T})\,\|\, q(x_{0:T})) = \int p_{\rm data}(x_0)\,dx_0 \int p(z\mid x_0)\log\frac{p(z\mid x_0)p_{\rm data}(x_0)}{q(x_0,z)}\, dz\\ &= -H(p_{\rm data}) + \mathbf{E}_{p_{\rm data}} \int p(x_T\mid x_{T-1})\cdots p(x_1\mid x_0) \log \frac{p(x_T\mid x_{T-1})\cdots p(x_1\mid x_0)}{q(x_0\mid x_1)\cdots q(x_{T-1}\mid x_T)} \,d x_{1:T}. \end{align*}\]

Note: This is different from VAE or EM as $p$ goes first in KL and KL is asymmetric.

As the forward process $p$ is not trainable, we can drop some terms:

\[\begin{align*} D_{\rm KL}(p(x_{0:T})\,\|\, q(x_{0:T})) =C - \mathbf{E}_{p_{\rm data}} \int p(x_T\mid x_{T-1})\cdots p(x_1\mid x_0) \sum_{t=1}^T \log {q(x_{t-1}\mid x_t)} \,d x_{1:T} \end{align*}\]

Call each term $L_t$ in the sum.

Recall $D_{\rm KL}(p|q)=-H(p)-\int p\log q.$ If $p\sim \mathcal{N}(\mu_1,\sigma_1^2I),q\sim \mathcal{N}(\mu_2,\sigma_2^2I)$, then

\[-\int p\log q = C + \int \frac{|x-\mu_2|^2}{2\sigma_2^2} \mathcal{N}(x\mid \mu_1,\sigma_1^2I)\,dx = C + \frac{\sigma_1^2+|\mu_1-\mu_2|^2}{2\sigma_2^2}.\]

By marginization,

\[\begin{align*} L_t&= -\mathbf{E}_{p_{\rm data}}\int p(x_t\mid x_{t-1})p(x_{t-1}\mid x_0) \log q(x_{t-1}\mid x_t)\,d x_{t-1}dx_t\\ &= C + \mathbf{E}_p D_{\rm KL}(p(x_{t-1}\mid x_t,x_0)\,\|\, q_\theta(x_{t-1}\mid x_t))\\ &= C' + \frac{1}{2\sigma_t^2}\mathbf{E}_p |\tilde\mu_t(X_t,X_0) - \mu_\theta(X_t,t)|^2. \end{align*}\]

So, in this most straightforward parametrization, we are matching $\mu_\theta$ with $\tilde\mu$ in $L^2$-loss.

Parametrizing the noise

We consider a different parametrization. Since $x_t=\bar\alpha_t x_0 + \bar\beta_t\varepsilon$ for some white noise $\varepsilon$,

\[\tilde \mu_t(x_t,x_0) = \frac{\bar\alpha_{t-1}\beta_t^2}{\bar\beta_t^2}x_0 + \frac{\alpha_t\bar\beta_{t-1}^2}{\bar\beta_t^2} x_t = \frac{\bar\alpha_{t-1}\beta_t^2}{\bar\alpha_t\bar\beta_t^2}(x_t-\bar\beta_t\varepsilon) + \frac{\alpha_t\bar\beta_{t-1}^2}{\bar\beta_t^2} x_t = \frac{x_t}{\alpha_t} - \frac{\beta_t^2\varepsilon}{\alpha_t\bar\beta_t}.\]

The loss becomes

\[\frac{1}{2\sigma_t^2} \mathbf{E}_p \left|\frac{1}{\alpha_t}\left(X_t-\frac{\beta_t^2}{\bar\beta_t}\varepsilon \right) -\mu_\theta(X_t,t) \right|^2,\]

where $X_0\sim p_{\rm data}, \varepsilon\sim \mathcal{N}(0,I),X_t=\bar\alpha_t X_0+\bar\beta_t\varepsilon.$ Accordingly, define

\[\mu_\theta(x_t,t) =\frac{1}{\alpha_t}\left(x_t-\frac{\beta_t^2}{\bar\beta_t}\varepsilon_\theta(x_t,t) \right).\]

DDPM simply ignores the multiplicative constant $\frac{\beta_t^4}{2\sigma_t^2\alpha_t^2\bar\beta_t^2}$ and the loss is

\[\boxed{ \mathbf{E} \left|\varepsilon - \varepsilon_\theta(\bar\alpha_t X_0+\bar\beta_t\varepsilon,t) \right|^2}\]

where $t\sim {\rm Unif}[T],X_0\sim p_{\rm data}, \varepsilon\sim \mathcal{N}(0,I).$ Use backprop to minimize this loss, which is Algo 1 in their paper.

Sampling

Algo 2:

\[X_T,Z_{T},\cdots,Z_{2}\text{ iid } \sim \mathcal{N}(0,I), \quad Z_1=0, \quad X_{t-1}=\mu_\theta(X_t,t)+\sigma_t Z_t.\]