Diffusion-II Flows

To save time, I read these brilliant lecture notes. I use slightly different notations to be consistent with previous posts.

Idea: Similar to DDPM, $X_0\approx p_{\rm data},p_1\approx p_{\rm noise}$ e.g. $\mathcal{N}(0,I)$. Different from DDPM, use a continuous process $X_t$ solving an ODE or SDE:

\[\begin{equation} \tag{ODE} dX_t = V^{\theta}_t(X_t)\,dt \end{equation}\] \[\begin{equation} \tag{SDE} dX_t = V^{\theta}_t(X_t)\,dt + \sigma_t\,dW_t \end{equation}\]

$V^{\theta}_t={\rm NN}(\theta,t)$ to be learned. $W_t$ denotes a Brownian motion in $\mathbb{R}^d$ where $X_0\in \mathbb{R}^d.$

We need to find a target vector field $V^{\rm tgt}_t$ so that if $X_t$ is the flow of $V^{\rm tgt}_t$ then $X_0\sim p_{\rm data},X_1\sim p_{\rm noise}$. Then set up a loss of the form

\[|V^\theta-V^{\rm tgt}|^2.\]

To find $V^{\rm tgt}$, prescribe a path $p_t(x\mid x_0)$ below which marginalizes to $p_t(x)$. When sampling, solve a backward ODE or SDE.

Theory

We write $X\sim p$ if $p$ is the pdf of $X$.

Fokker-Planck Theorem

Suppose $dX_t=V_t(X_t)dt$ for some general $V_t$. Then $X_t\sim p_t$ iff.

\[\partial_t p = -{\rm div}(Vp).\]

Suppose $dX_t=V_t(X_t)dt+\sigma_t(X_t)dW_t$. Then $X_t\sim p_t$ iff.

\[\partial_t p = \tfrac{1}{2}\Delta(\sigma^2p) - {\rm div}(Vp).\]

We only prove the SDE case. For any test function $f=f(x)$ with compact support, by Ito formula, e.g. my last post,

\[\begin{align*} df(X_t) &=\nabla f\cdot dX_t + \tfrac{1}{2}\langle dX_t, \nabla^2f\cdot dX_t\rangle\\ &= \nabla f\cdot V \,dt + \sigma \nabla f\cdot dW_t + \tfrac{1}{2}\sigma^2 \Delta f \,dt. \end{align*}\]

Recall that $dW_t$ term contributes to a martingale whose expectation is 0.

Integrating by parts, $X_t\sim p_t$ iff.

\[\begin{align*} \int &f(x)p_t(x)\,dx = \mathbf{E} \,f(X_t) = \mathbf{E}\, f(X_0) + \mathbf{E}\int_0^t (\nabla f\cdot V +\tfrac{1}{2}\sigma^2 \Delta f )\\ &=\mathbf{E}\, f(X_0) +\int_0^tds\int(\nabla f\cdot V_s(x) +\tfrac{1}{2}\sigma^2_s(x) \Delta f(x) ) p_s(x)\,dx \\ &=\mathbf{E}\, f(X_0) + \int_0^tds\int f(-{\rm div}(Vp) +\tfrac{1}{2}\Delta(\sigma^2p))\,dx \end{align*}\]

The conclusion follows by taking $\partial_t$.

Conditional path

Now back to flows. The idealized process $X_t\sim p_t$ satisfies $p_0=p_{\rm data},p_1=\mathcal{N}(0,I)$ but $p_0$ is infeasible. We can use a conditional path connecting real data to noise: $p_t(x\mid x_0)$ satisfying

\[p_0(x\mid x_0) = \delta_{x_0},\quad p_1(\cdot\mid x_0)=p_{\rm noise}\]

A common choice is Gaussian:

\[p_t(x\mid x_0) = \mathcal{N}(x\mid \alpha_tx_0, \beta_t^2I),\]

where $\alpha_0=\beta_1=1,\alpha_1=\beta_0=0$ (or approximately). Or, $X_t=\alpha_t X_0 + \beta_t \varepsilon$, not too different from DDPM.

We will use Gaussian in the rest of this note.

The conditional vector field is

\[\begin{align*} V_t(x\mid x_0) &= \partial_t (\alpha_t x_0+\beta_t\varepsilon) = \dot\alpha x_0 + \dot\beta \beta^{-1}(x-\alpha x_0). \end{align*}\]

Target vector field

$V^{\rm tgt}$ can be obtained by marginalization:

\[\begin{align*} V^{\rm tgt}_t(x) &= \int V_t(x\mid x_0) p_t(x_0\mid x)\,dx_0 \\ &= \int V_t(x\mid x_0) \frac{p_t(x\mid x_0)p_{\rm data}(x_0)}{p_t(x)}dx_0. \end{align*}\]

If $dX_t=V^{\rm tgt}(X_t)dt$, we claim that $X_t$ is an idealized/target process.

By FP, it suffices to observe

\[\begin{align*} \partial_t p_t(x) &= \int \partial_t p_t(x\mid x_0)p_{\rm data}(x_0)\,dx_0\\ &= - \int {\rm div}_x(V_t(x\mid x_0)p_t(x\mid x_0)) p_{\rm data}(x_0)\,dx_0 \\ &= -{\rm div}_x \left(p_t(x)\int V_t(x\mid x_0) \frac{p_t(x\mid x_0)p_{\rm data}(x_0)}{p_t(x)}dx_0\right)\\ &= -{\rm div}(V_t^{\rm tgt}p_t)(x). \end{align*}\]

So, we have found $V^{\rm tgt}$ and we then set up a loss to optimize.

Training

Minimize

\[\begin{align*} &\mathbf{E}_{t\sim U,X\sim p_t} |V_t^\theta(X)-V_t^{\rm tgt}(X)|^2 \\ % = & \ \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot\mid X_0)} |V_t^\theta(X)-V_t^{\rm tgt}(X)|^2\\ = &\ \mathbf{E} \left[ |V_t^\theta(X)|^2 - 2 V_t^\theta(X)\cdot V_t^{\rm tgt}(X)\right] + C \\ =&\ C + \mathbf{E}|V_t^\theta(X)|^2 - 2 \int_0^1dt\int p_t(x)\,dx \int V_t^\theta(x)\cdot V_t(x\mid x_0)\frac{p_t(x\mid x_0)p_{\rm data}(x_0)}{p_t(x)}dx_0\\ =&\ C + \mathbf{E}|V_t^\theta(X)|^2 - \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot|X_0)} \left[2V_t^\theta(X)\cdot V_t(X\mid X_0)\right] \\ =&\ C' +\mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot|X_0)} \left|V_t^\theta(X)-V_t(X\mid X_0)\right|^2\\ =:&\ C' + \mathcal{L}_{\rm CFM}(\theta). \end{align*}\]

CFM is Conditional Flow Matching. Explicitly,

\[\mathcal{L}_{\rm CFM} = \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X_0\sim \mathcal{N}} |V_t^\theta(\alpha_t X_0+\beta_t\varepsilon) - \dot\alpha_t X_0 - \dot\beta_t \varepsilon|^2\]

This is similar to the direct loss of DDPM.

Score matching

The score is $s_t(x):=\nabla \log p_t(x)$.

Now instead of matching $V$, we match scores and minimize the following.

\[\begin{align*} \mathbf{E}|s^\theta_t(X_t)- s_t^{\rm tgt}(X_t)|^2 &= C + \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},X\sim p_t(\cdot\mid X_0)} \Big|s^\theta_t(X) - \nabla\log p_t(X\mid X_0)\Big|^2\\ &=:C + \mathcal{L}_{\rm CSM}(\theta), \end{align*}\]

where we could argue as above. CSM is Conditional Score Matching.

Explicitly, if $X_t=\alpha_t X_0 + \beta_t \varepsilon,$

\[s_t(X\mid X_0):= \nabla\log p_t(X\mid X_0) = - \beta_t^{-2}(X-\alpha_t X_0) = -\beta^{-1}_t \varepsilon.\]

Then

\[\mathcal{L}_{\rm CSM}(\theta) = \mathbf{E}_{t\sim U,X_0\sim p_{\rm data},\varepsilon\sim \mathcal{N}} \Big| s_t^\theta(\alpha_tX_0+\beta_t\varepsilon) + \beta^{-1}_t \varepsilon \Big|^2.\]

This almost coincides with DDPM loss.

Sampling

For flow matching, we can just solve the backward ODE $dX_t = V_t^\theta(X_t)\, dt$ starting with $X_1\sim \mathcal{N}(0,I)$.

Below we consider score matching. The forward process is an ODE: $dX_t=V_t^{\rm tgt}(X_t)\,dt$ and $X_t\sim p_t$. The goal is to flow noise back to pictures.

Suppose the backward process is $d\bar X_\tau=U_\tau(\bar X_\tau)\,d\tau+\sigma_\tau\,dW_\tau$, where $\tau=1-t$. By FP, $\bar X_\tau\sim p_\tau$ iff.

\[{\rm div}(V^{\rm tgt}p)=-\partial_tp=\partial_\tau p = \tfrac{1}{2}\sigma^2\Delta p-{\rm div}(Up) = {\rm div}\lbrace p(\tfrac{1}{2}\sigma^2\nabla\log p-U)\rbrace.\]

So, $U_\tau=-V^{\rm tgt}_\tau+\frac{1}{2}\sigma_\tau^2\nabla\log p_\tau,$ and the backward SDE is ($d\tau=-dt$)

\[d\bar X_t = (V^{\rm tgt}_t(\bar X_t)-\tfrac{1}{2}\sigma_t^2s_t(\bar X_t))\,dt + \sigma_t\,dW_t.\]

We can recover $V$ form $s$:

\[\begin{align*} V_t(x\mid x_0) &= \dot\alpha_t x_0 + \dot\beta_t \varepsilon =\frac{\dot\alpha}{\alpha}(x-\beta\varepsilon) + \dot\beta\varepsilon\\ &=\frac{\dot\alpha}{\alpha}x + \left(\frac{\dot\alpha\beta^2}{\alpha}-\beta\dot\beta\right)s_t(x\mid x_0)\\ &=: a_t x + b_t s_t(x\mid x_0). \end{align*}\]

After training $s_t^\theta$, set $V_t^\theta(x)=a_tx+b_ts_t^\theta(x),$ and solve

\[\boxed{\bar X_1\sim \mathcal{N}(0,I),\quad d\bar X_t = [V_t^\theta(\bar X_t)-\tfrac{1}{2}\sigma_t^2s_t^\theta(\bar X_t)]\,dt + \sigma_t \,dW_t.}\]

In a nutshell, score matching and DDPM have almost the same loss but different reconstructions.