To save time, I read these brilliant lecture notes. I use slightly different notations to be consistent with previous posts.

Idea: Similar to DDPM, $X_0\approx p_{\rm data},p_1\approx p_{\rm noise}$ e.g. $\mathcal{N}(0,I)$. Different from DDPM, use a continuous process $X_t$ solving an ODE or SDE:

\[\begin{equation} \tag{ODE} dX_t = V^{\theta}_t(X_t)\,dt \end{equation}\] \[\begin{equation} \tag{SDE} dX_t = V^{\theta}_t(X_t)\,dt + \sigma_t\,dW_t \end{equation}\]

$V^{\theta}_t={\rm NN}(\theta,t)$ to be learned. $W_t$ denotes a Brownnian motion in $\mathbb{R}^d$ where $X_0\in \mathbb{R}^d.$

We need to find a target vector field \(V^{\rm tgt}_t\) so that if $X_t$ is the flow of $V^{\rm tgt}_t$ then \(X_0\sim p_{\rm data},X_1\sim p_{\rm noise}\). Then set up a loss of the form

\[|V^\theta-V^{\rm tgt}|^2.\]

To find $V^{\rm tgt}$, prescribe a path $p_t(x\mid x_0)$ below which marginalizes to $p_t(x)$. When sampling, solve a backward ODE or SDE.

Theory

We write $X\sim p$ if $p$ is the pdf of $X$.

Fokker-Planck Theorem

  • Suppose $dX_t=V_t(X_t)dt$ for some general $V_t$. Then $X_t\sim p_t$ iff.
\[\partial_t p = -{\rm div}(Vp).\]
  • Suppose $dX_t=V_t(X_t)dt+\sigma_t(X_t)dW_t$. Then $X_t\sim p_t$ iff.
\[\partial_t p = \tfrac{1}{2}\Delta(\sigma^2p) - {\rm div}(Vp).\]

We only prove the SDE case. For any test function $f=f(x)$ with compact support, by Ito formula, e.g. my last post,

\[\begin{align*} df(X_t) &=\nabla f\cdot dX_t + \tfrac{1}{2}\langle dX_t, \nabla^2f\cdot dX_t\rangle\\ &= \nabla f\cdot V \,dt + \sigma \nabla f\cdot dW_t + \tfrac{1}{2}\sigma^2 \Delta f \,dt. \end{align*}\]

Recall that $dW_t$ term contributes to a martingale whose expectation is 0.

Integrating by parts, $X_t\sim p_t$ iff.

\[\begin{align*} \int &f(x)p_t(x)\,dx = \mathbf{E} \,f(X_t) = \mathbf{E}\, f(X_0) + \mathbf{E}\int_0^t (\nabla f\cdot V +\tfrac{1}{2}\sigma^2 \Delta f )\\ &=\mathbf{E}\, f(X_0) +\int_0^tds\int(\nabla f\cdot V_s(x) +\tfrac{1}{2}\sigma^2_s(x) \Delta f(x) ) p_s(x)\,dx \\ &=\mathbf{E}\, f(X_0) + \int_0^tds\int f(-{\rm div}(Vp) +\tfrac{1}{2}\Delta(\sigma^2p))\,dx \end{align*}\]

The conclusion follows by taking $\partial_t$.

Conditional path

Now back to flows. The idealized process $X_t\sim p_t$ satisfies $p_0=p_{\rm data},p_1=\mathcal{N}(0,I)$ but $p_0$ is infeasible. We can use a conditional path connecting real data to noise: $p_t(x\mid x_0)$ satisfying

\[p_0(x\mid x_0) = \delta_{x_0},\quad p_1(\cdot\mid x_0)=p_{\rm noise}\]

A common choice is Gaussian:

\[p_t(x\mid x_0) = \mathcal{N}(x\mid \alpha_tx_0, \beta_t^2I),\]

where $\alpha_0=\beta_1=1,\alpha_1=\beta_0=0$ (or approximately). Or, $X_t=\alpha_t X_0 + \beta_t \varepsilon$, not too different from DDPM.

We will use Gaussian in the rest of this note.

The conditional vector field is

\[\begin{align*} V_t(x\mid x_0) &= \partial_t (\alpha_t x_0+\beta_t\varepsilon) = \dot\alpha x_0 + \dot\beta \beta^{-1}(x-\alpha x_0). \end{align*}\]

Target vector field

$V^{\rm tgt}$ can be obtained by marginalization:

\[\begin{align*} V^{\rm tgt}_t(x) &= \int V_t(x\mid x_0) p_t(x_0\mid x)\,dx_0 \\ &= \int V_t(x\mid x_0) \frac{p_t(x\mid x_0)p_{\rm data}(x_0)}{p_t(x)}dx_0. \end{align*}\]

If $dX_t=V^{\rm tgt}(X_t)dt$, we claim that $X_t$ is an idealized/target process.

By FP, it suffices to observe

\[\begin{align*} \partial_t p_t(x) &= \int \partial_t p_t(x\mid x_0)p_{\rm data}(x_0)\,dx_0\\ &= - \int {\rm div}_x(V_t(x\mid x_0)p_t(x\mid x_0)) p_{\rm data}(x_0)\,dx_0 \\ &= -{\rm div}_x \left(p_t(x)\int V_t(x\mid x_0) \frac{p_t(x\mid x_0)p_{\rm data}(x_0)}{p_t(x)}dx_0\right)\\ &= -{\rm div}(V_t^{\rm tgt}p_t)(x). \end{align*}\]

So, we have found $V^{\rm tgt}$ and we then set up a loss to optimize.

Training

Minimize

\[\begin{align*} &\mathbf{E}_{t\sim U,X\sim p_t} |V_t^\theta(X)-V_t^{\rm tgt}(X)|^2 \\ % = & \ \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot\mid X_0)} |V_t^\theta(X)-V_t^{\rm tgt}(X)|^2\\ = &\ \mathbf{E} \left[ |V_t^\theta(X)|^2 - 2 V_t^\theta(X)\cdot V_t^{\rm tgt}(X)\right] + C \\ =&\ C + \mathbf{E}|V_t^\theta(X)|^2 - 2 \int_0^1dt\int p_t(x)\,dx \int V_t^\theta(x)\cdot V_t(x\mid x_0)\frac{p_t(x\mid x_0)p_{\rm data}(x_0)}{p_t(x)}dx_0\\ =&\ C + \mathbf{E}|V_t^\theta(X)|^2 - \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot|X_0)} \left[2V_t^\theta(X)\cdot V_t(X\mid X_0)\right] \\ =&\ C' +\mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot|X_0)} \left|V_t^\theta(X)-V_t(X\mid X_0)\right|^2\\ =:&\ C' + \mathcal{L}_{\rm CFM}(\theta). \end{align*}\]

CFM is Conditional Flow Matching. Explicitly,

\[\mathcal{L}_{\rm CFM} = \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X_0\sim \mathcal{N}} |V_t^\theta(\alpha_t X_0+\beta_t\varepsilon) - \dot\alpha_t X_0 - \dot\beta_t \varepsilon|^2\]

This is similar to the direct loss of DDPM.

Score matching

The score is $s_t(x):=\nabla \log p_t(x)$.

Here is an SDE trick. If $dY_t=V_t(Y_t)dt,Y_t\sim p_t$, then for any $\sigma_t\ge 0$ (only time-dependent),

\[dX_t = [U_t(X_t)+\tfrac{1}{2}\sigma_t^2s_t(X_t)]\,dt + \sigma_t^2\,dW_t\]

satisfies $X_t\sim p_t$.

By FP, it suffices to verify

\[\partial_t p -\tfrac{1}{2}\sigma^2\Delta p + {\rm div}(Up+\tfrac{1}{2}\sigma^2p\nabla\log p) =\partial_t p + {\rm div}(Up)=0.\]

Now instead of matching $V$, we match scores and minimize the following.

\[\begin{align*} \mathbf{E}|s^\theta_t(X_t)- s_t^{\rm tgt}(X_t)|^2 &= C + \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot\mid X_0)} \Big|s^\theta_t(X) - \nabla\log p_t(X\mid X_0)\Big|^2\\ &=:C + \mathcal{L}_{\rm CSM}(\theta), \end{align*}\]

where we could argue as above. CSM is Conditional Score Matching.

Explicitly, if $X_t=\alpha_t X_0 + \beta_t \varepsilon,$

\[s_t(X\mid X_0):= \nabla\log p_t(X\mid X_0) = - \beta_t^{-2}(X-\alpha_t X_0) = -\beta^{-1}_t \varepsilon.\]

Then

\[\mathcal{L}_{\rm CSM}(\theta) = \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},\varepsilon\sim \mathcal{N}} \Big| s_t^\theta(\alpha_tX_0+\beta_t\varepsilon) + \beta^{-1}_t \varepsilon \Big|^2.\]

This almost coincides with DDPM loss.

Sampling

We can recover $V$ form $s$:

\[\begin{align*} V_t(x\mid x_0) &= \dot\alpha_t x_0 + \dot\beta_t \varepsilon =\frac{\dot\alpha}{\alpha}(x-\beta\varepsilon) + \dot\beta\varepsilon\\ &=\frac{\dot\alpha}{\alpha}x + \left(\frac{\dot\alpha\beta^2}{\alpha}-\beta\dot\beta\right)s_t(x\mid x_0)\\ &=: a_t x + b_t s_t(x\mid x_0). \end{align*}\]

After training $s_t^\theta$, set $V_t^\theta(x)=a_tx+b_ts_t^\theta(x).$

Tune $\sigma_t\ge 0$ and solve the backward SDE: for BM $\bar W_t$

\[\bar X_1\sim \mathcal{N}(0,I),\quad d \bar X_t = \left[V_t^\theta(\bar X_t)-\tfrac{1}{2}\sigma_t^2 s_t^\theta(\bar X_t)\right]\,dt + \sigma_t^2 \,d\bar W_t.\]

Any instance of $\bar X_0$ is a generation.

In a nutshell, score matching and DDPM have almost the same loss but different reconstructions.