# Diffusion Models

Contents

So far, there are many models to generate images, such as GAN, VAE, etc. But each has its own drawbacks. For example, GAN is difficult to train and less deversity in genration, and VAE is difficult to generate high-quality images.

Diffusion models are inspired by non-equilibrium thermodynamics. They defined a Markov Chain of diffusion steps to slowly add noise and then learn to denoise to construct the desired images. Different models shown in Fig.1.

DDPM(Denoising Diffusion Probabilistic Models) is the first widespread diffusion model (DDPM, Ho et al., 2020). It is the first widespread diffusion-based model to generate high-quality images, but slow to inference.

Sampled $\mathbf{x}_0$ from a real data distribution $q(\mathbf{x})$, and then defined a forward diffusion process to add Gaussian noise in $T$ steps, which would get a sequence of noise images $\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_T$. $T$ is a hyperprameter, which was usually set to 1000 as $\mathbf{T}\to\infty$, the $\mathbf{x}_T$ would be an isotropic Gaussian distribution. The diffusion process is shown in Fig.2.

According to a variance schedule $\beta_1$, $\beta_2$, $\cdots$, $\beta_T$ and $\beta_t\in(0,1)$, the forward process is defined as follows:

$$q(\mathbf{x}_{1:T}| \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}) \tag{1}$$ $$q(\mathbf{x}_{t}| \mathbf{x}_{t-1}) := \mathcal{N}(\mathbf{x}_t ; \sqrt{1-\beta_t}\mathbf{x}_{t-1},\beta_t\mathbf{I}) \tag{2}$$ So, the sample data is $\mathbf{x}_t=\sqrt{1-\beta_t}\mathbf{x}_{t-1}+\beta_t\epsilon$, the random noise variables $\{\epsilon_t, \epsilon_t^*\}_{t=0}^{T}\sim\mathcal{N}(\epsilon;0,\mathbf{I})$. Define $\alpha:=1-\beta$, $\bar{\alpha}_t:=\prod_{i=1}^{T}\alpha_t$, then we can get the following equation: \begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\epsilon_{t-1}^* \\ &= \sqrt{\alpha}\left(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} +\sqrt{1-\alpha_{t-1}}\epsilon_{t-2}^*\right)+\sqrt{1-\alpha_{t}}\epsilon_{t-1}^* \\ &= \sqrt{\alpha_t\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}\epsilon_{t-2}^*+\sqrt{1-\alpha_t}\epsilon_{t-1}^* \\ &= \sqrt{\alpha_t\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}^2+\sqrt{1-\alpha_t}^2}\epsilon_{t-2} \text{\ \ \ \ \ merge two Gaussians} \\ &= \sqrt{\alpha_t\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\epsilon_{t-2} \\ &= \cdots \\ &= \textcolor{red}{\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\epsilon_0} \\ \end{aligned} \tag{3} Then, we can directly use $\mathbf{x}_0$ to generate $\mathbf{x}_t$: $$q(\mathbf{x}_t|\mathbf{x}_{0}) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I}) \tag{4}$$ which means that directly get $\mathbf{x}_t$, if the $t$ and $\mathbf{x}_0$ are given. $\mathbf{x}_t$ is explicitly expressed as Equation (3).

Understanding

The sampling noise $\epsilon$ would be added to image, which can be generated by a random 3-D Standard Gaussian Distribution.

Why 3-D?

Because the image is 3-D, which is composed of RGB channels. So, the noise should be 3-D, too.

Repeating $T$ steps, the nosing precess is finished.

After the noising process, we get a sequence of noise images $\textcolor{blue}{\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_T}$ and the original image $\textcolor{blue}{\mathbf{x}_0}$. Now, we need to learn a denoising process to get a new image.

Naturally, we can sample from $p(\mathbf{x}_t|\mathbf{x}_{t-1})$ reversing the above process. But it is difficult to estimate the distribution beacus of it needs to use the entire dataset and therefore we need to learn a denoising model $p_{\theta}(\mathbf{x}_t|\mathbf{x}_{t-1})$ to approximate the conditional distribution (Lil’Log, 2021).

We defined a backward diffusion process is also a Markov Chain with learned Gaussian transition starting at $p(\mathbf{x}_T)=\mathcal{N}(\mathbf{x}_T; 0,\mathbf{I})$:

$$p_{\theta}(\mathbf{x}_{0:T}) = p(\mathbf{x}_T)\prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t) \tag{5}$$ $$p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \textcolor{red}{\mu_{\theta}(\mathbf{x}_t,t)}, \Sigma_{\theta}(\mathbf{x}_t,t)) \tag{6}$$

The target of the train model is to minimize the negative log-likelihood of the data distribution:

$$\mathbb{E}\left[-\log p_{\theta}(\mathbf{x}_0)\right] \leqslant \mathbb{E}_q\left[-\log\frac{p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \right]=\mathbb{E}_q\left[-\log p_{\theta}(\mathbf{x}_T)-\sum_{t\geqslant1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_t|\mathbf{x}_{t-1})} \right] := L \tag{7}$$
Detailed Derivation
\begin{aligned} \log p(\mathbf{x})& \geq\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)\prod_{t=1}^Tp_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{\prod_{t=1}^Tq(\mathbf{x}_t|\mathbf{x}_{t-1})}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)\prod_{t=2}^Tp_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_1|\mathbf{x}_0)\prod_{t=2}^Tq(\mathbf{x}_t|\mathbf{x}_{t-1})}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)\prod_{t=2}^Tp_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_1|\mathbf{x}_0)\prod_{t=2}^Tq(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0)}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_T)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}+\log\prod_{t=2}^T\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0)}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}+\log\prod_{t=2}^T\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)q(\mathbf{x}_t|\mathbf{x}_0)}{q(\mathbf{x}_{t-1}|\mathbf{x}_0)} }\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)}+\log\prod_{t=2}^T\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|math_t)}{\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)g(\mathbf{x}_t|\mathbf{x}_0)}{q(\mathbf{x}_{t-1}|\mathbf{x}_0)}}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)}{\underline{q(\mathbf{x}_1|\mathbf{x}_0)}}+\log\frac{q(\mathbf{x}_1|\mathbf{x}_0)}{q(\mathbf{x}_T|\mathbf{x}_0)}+\log\prod_{t=2}^T\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)}{q(\mathbf{x}_T|\mathbf{x}_0)}+\sum_{t=2}^T\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)\right]+\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T|\mathbf{x}_0)}\right]+\sum_{t=2}^{T}\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)}\right] \\ &=\mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)\right]+\mathbb{E}_{q(\mathbf{x}_T|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T|\mathbf{x}_0)}\right]+\sum_{t=2}^T\mathbb{E}_{q(\mathbf{x}_t,\mathbf{x}_{t-1}|\mathbf{x}_0)}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)}\right] \\ &=\underbrace{\mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)\right]}_{\text{reconstruction term}}-\underbrace{D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0)\parallel p(\mathbf{x}_T))}_{\text{prior matching term}}-\sum_{t=2}^{T}\underbrace{\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\left[D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\parallel p_{\boldsymbol{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)})\right]}_{\text{denoising matching term}} \end{aligned}

Further improvment rewrite the loss function as follows: $$\mathbb{E}_q\left[\underbrace{D_{KL}(q(\mathbf{x}_T|\mathbf{x}_0)||p(\mathbf{x}_T))}_{L_T}+\sum_{t\geqslant1}\underbrace{D_{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)||p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t))}_{L_{t-1}}\underbrace{-\log p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)}_{L_0} \right] \tag{8}$$

According to Markov Chain, when conditioned on $\mathbf{x}_0$:

$$q(\mathbf{x}_{t-1}| \mathbf{x}_t)=q(\mathbf{x}_{t-1}| \mathbf{x}_t, \mathbf{x}_0)=\mathcal{N}(\mathbf{x}_{t-1}; \textcolor{green}{\widetilde{\mu}_t(\mathbf{x}_t,\mathbf{x}_0)}, \textcolor{orange}{\widetilde{\beta}_t\mathbf{I}}) \tag{9}$$ Using Bayes' rule, we can get: \begin{aligned} q(\mathbf{x}_{t-1}| \mathbf{x}_t, \mathbf{x}_0) &= \frac{q(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{x}_0)q(\mathbf{x}_{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)} \tag{10}\\ &\propto \text{exp}\left(-\frac{1}{2}\left(\frac{(\mathbf{x}_t-\sqrt{\alpha_t}\mathbf{x}_{t-1})^2}{1-\beta_t}+\frac{(\mathbf{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}\mathbf{x}_0})^2}{1-\bar{\alpha}_{t-1}}-\frac{(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0)^2}{1-\bar{\alpha}_t}\right)\right) \\ &= \text{exp}\left(-\frac{1}{2}\left(\frac{\mathbf{x}_t^2-2\sqrt{\alpha_t}\mathbf{x}_t\textcolor{green}{\mathbf{x}_{t-1}}+\alpha_t\textcolor{orange}{\mathbf{x}_{t-1}^2}}{\beta_t}+\frac{\textcolor{orange}{\mathbf{x}_{t-1}^2}-2\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0\textcolor{green}{\mathbf{x}_{t-1}}+\bar{\alpha}_{t-1}\mathbf{x}_0^2}{1-\bar{\alpha}_{t-1}}-\frac{(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0)^2}{1-\bar{\alpha}_t}\right)\right) \\ &= \text{exp}\left(-\frac{1}{2}\left(\textcolor{orange}{(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha_{t-1}}})\mathbf{x}_{t-1}^2}-\textcolor{green}{(\frac{2\sqrt{\alpha_t}}{\beta_t}\mathbf{x}_t+\frac{2\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}\mathbf{x}_0)\mathbf{x}_{t-1}}+C(\mathbf{x}_t, \mathbf{x}_0)\right)\right) \end{aligned}
Detailed Derivation
\begin{aligned} q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)& =\frac{q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0)q(\mathbf{x}_{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)} \\ &=\frac{\mathcal{N}(\mathbf{x}_t;\sqrt{\alpha_t}\mathbf{x}_{t-1},(1-\alpha_t)\mathbf{I})\mathcal{N}(\mathbf{x}_{t-1};\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0,(1-\bar{\alpha}_{t-1})\mathbf{I})}{\mathcal{N}(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I})} \\ &\propto\exp\left\{-\left[\frac{(\mathbf{x}_t-\sqrt{\alpha_t}\mathbf{x}_{t-1})^2}{2(1-\alpha_t)}+\frac{(\mathbf{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0)^2}{2(1-\bar{\alpha}_{t-1})}-\frac{(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0)^2}{2(1-\bar{\alpha}_t)}\right]\right\} \\ &=\exp\left\{-\frac12\left[\frac{(\mathbf{x}_t-\sqrt{\alpha_t}\mathbf{x}_{t-1})^2}{1-\alpha_t}+\frac{(\mathbf{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}}-\frac{(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0)^2}{1-\bar{\alpha}_t}\right]\right\} \\ &=\exp\left\{-\frac12\left[\frac{(-2\sqrt{\alpha_t}\mathbf{x}_t\mathbf{x}_{t-1}+\alpha_t\mathbf{x}_{t-1}^2)}{1-\alpha_t}+\frac{(\mathbf{x}_{t-1}^2-2\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_{t-1}\mathbf{x}_0)}{1-\bar{\alpha}_{t-1}}+C(\mathbf{x}_t,\mathbf{x}_0)\right]\right\} \\ &\propto\exp\left\{-\frac12\left[-\frac{2\sqrt{\alpha_t}\mathbf{x}_t\mathbf{x}_{t-1}}{1-\alpha_t}+\frac{\alpha_t\mathbf{x}_{t-1}^2}{1-\alpha_t}+\frac{\mathbf{x}_{t-1}^2}{1-\bar{\alpha}_{t-1}}-\frac{2\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_{t-1}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}}\right]\right\} \\ &=\exp\left\{-\frac12\left[(\frac{\alpha_t}{1-\alpha_t}+\frac1{1-\bar{\alpha}_{t-1}})\mathbf{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t}\mathbf{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}}\right)\mathbf{x}_{t-1}\right]\right\} \\ &=\exp\left\{-\frac12\left[\frac{\alpha_t(1-\bar{\alpha}_{t-1})+1-\alpha_t}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}\mathbf{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t}\mathbf{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}}\right)\mathbf{x}_{t-1}\right]\right\} \\ &=\exp\left\{-\frac12\left[\frac{\alpha_t-\bar{\alpha}_t+1-\alpha_t}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}\mathbf{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t}\mathbf{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}}\right)\mathbf{x}_{t-1}\right]\right\} \\ &=\exp\left\{-\frac12\left[\frac{1-\bar{\alpha}_t}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}\mathbf{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t}\mathbf{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}}\right)\mathbf{x}_{t-1}\right]\right\} \\ &=\exp\left\{-\frac12\left(\frac{1-\bar{\alpha}_t}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}\right)\left[\mathbf{x}_{t-1}^2-2\frac{\left(\frac{\sqrt{\alpha_t}\mathbf{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}}\right)}{\frac{1-\bar{\alpha}_t}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}\mathbf{x}_{t-1}}\right]\right\} \\ &=\exp\left\{-\frac12\left(\frac{1-\bar{\alpha}_t}{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}\right)\left[\mathbf{x}_{t-1}^2-2\frac{\left(\frac{\sqrt{\alpha_t}\mathbf{x}_t}{1-\alpha_t}+\frac{\sqrt{\alpha_{t-1}}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}}\right)(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_{t-1}\right]\right\} \\ &=\exp\left\{-\frac12\left(\frac1{\frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}}\right)\left[\mathbf{x}_{t-1}^2-2\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})\mathbf{x}_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)\mathbf{x}_0}{1-\bar{\alpha}_t}\mathbf{x}_{t-1}\right]\right\} \\ &\propto\mathcal{N}(\mathbf{x}_{t-1};\underbrace{\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})\mathbf{x}_t+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)\mathbf{x}_0}{1-\bar{\alpha}_t}}_{\mu_t(\mathbf{x}_t, \mathbf{x}_0)},\underbrace{\frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{I}}_{\widetilde{\beta}_t}) \end{aligned}

Now, set $\Sigma_{\theta}(\mathbf{x}_t,t)=\sigma_t^2\mathbf{I}$ as time dependent constants, and $\sigma_t^2=\beta_t=\widetilde{\beta}_t$

Fllowing the standard Gaussian density function, the mean and variance can be reparameterized as follows:

\begin{aligned} \widetilde{\beta}_t = 1/\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right) = 1/\left(\frac{\alpha_t-\bar{\alpha}_t+\beta_t}{\beta_t(1-\bar{\alpha}_{t-1})} \right) = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\cdot\beta_t \end{aligned} \tag{11} \begin{aligned} \widetilde{\mu}_t(\mathbf{x}_t,\mathbf{x}_0) &= \left(\frac{\sqrt{\alpha_t}}{\beta_t}\mathbf{x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}\mathbf{x}_0\right)/\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right) \\ &= \left(\frac{\sqrt{\alpha_t}}{\beta_t}\mathbf{x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}\mathbf{x}_0\right)\frac{{1-\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_t} \cdot \beta_t \\ &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0 \end{aligned} \tag{12}

From Equation (3), we can get $\mathbf{x}_0=\frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\epsilon_t)$, plug it into Equation (12):

$$\textcolor{red}{\mu_\theta(\mathbf{x}_t, t)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t,t)\right)} \tag{13}$$

where $\epsilon_\theta(\mathbf{x}_t,t)$ is a learnable model (like UNet), given $\mathbf{x}_t$ and $t$, it can predict the noise $\epsilon^*$.

Training and Sampling

To train the model is to take gradient descent step: $\nabla_{\theta}\lVert\epsilon-\epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\epsilon,t)\rVert^2$, when $\mathbf{x}_0\sim q(\mathbf{x}_0)$, $t\sim \text{Uniform}({1,\dots, T})$ and $\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ were given.

To sample $\mathbf{x}_t$ from $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)$ is to compute: $\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_{\theta}(\mathbf{x}_t,t)\right)+\sigma_t \mathbf{z}$, where $\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$.

At first, we want to opimize the joint distribution in Equation (5), so we use the log-likelihood to define our target (loss function), and then according to a several steps of derivation, we get the Equation (8), which is consist of KL divergence. Especially the term of $L_{t-1}$ is the KL about $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ and $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)$ and therefore we use the Bayes’ rule to calculate the posterior distribution $q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)$ in Equation (10), which can be calculate by Equation (9), if we can obtain the $\widetilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0)$. Unfortunately, the term of $\mathbf{x}_0$ is unknown when reverse the diffusion process, so it is necessary to predict a $\mathbf{x}_0^*$, then we found that to predict $\mathbf{x}_0^*$ is to predict the noise $\epsilon^*$ in Equation (13).

1. Why predict the noise?

Precisely because the prediction is noise $\epsilon^*$, so the model is diversity, which is different from the GAN model. If we directly to predict the image $\mathbf{x}_0^*$, the model would be a deterministic model.

2. Why use the KL divergence?

Actually, there are many ways to describ the difference between two distributions, such as Bhattacharyya distance [$D_B\left(p(x),q(x)\right)=-\ln\int\sqrt{p(x)q(x)}dx$]. Acually the Bhattacharyya distance is not only symmetric, but there is no infinite problem of KL divergence. However, we still choose the KL divergence, because the KL divergence can be written in the desired form, which allows us to sample it, in contrast, the Bhattacharyya distance is not so easy. And the KL divergence is connected with the log-likelihood and expectation, which is the target of the train model (Jianlin. 2018).

3. Why use the Gaussian distribution?

The Gaussian distribution is the most common distribution in the world, which is easy to calculate and has a good performance in many fields.

4. Can reduce $T$

No, according to the equation (3), setting ${\alpha}_t$ close to 1 but not equal to 1 will get a small variance, as close as possible to the distribution of $t-1$. And the target of the nosing process is to get an isotropic Gaussian distribution $\mathbf{x}_T$ only when $T\to\infty$, the $\bar{\alpha}_T \to 0$.

5. Why the model predict step by step?

Because the Equation (9) model is a Markov Chain therefore we can not to speed up the inference process by ’leapfrog’ sampling, which means that the inference process of DDPM is so slow.

After DDPM proposed, the biggest problem is the slow inference process as mentioned above Q5.

However, DDIM (Denoising Diffusion Implict Models) achieves it by transforming the sampling(denosing) process from Markov Chian to Non-markov Chain (Song et al., 2022). This is a very important exploration, which can greatly improve the $10\times$ to $50\times$ efficiency of the inference process under the same models trained by DDPM, but gain less diversity.

According to DDPM, our target is the Equation (10) based on Markov Property. In order to remove the Markov Process, we can transform the Equation (10) to:

\begin{aligned} q(\mathbf{x}_{m}| \mathbf{x}_n, \mathbf{x}_0) = \frac{q(\mathbf{x}_n|\mathbf{x}_{m}, \mathbf{x}_0)q(\mathbf{x}_{m}|\mathbf{x}_0)}{q(\mathbf{x}_n|\mathbf{x}_0)} \tag{14} \end{aligned}

where $m\leq n-1$, and the terms of $q(\mathbf{x}_m|\mathbf{x}_0)$ and $q(\mathbf{x}_n|\mathbf{x}_0)$ can be calculated by Equation (3)(4) beacuse of the Gaussian additivity, and the term of $q(\mathbf{x}_n|\mathbf{x}_m, \mathbf{x}_0)$ can be ignore beacuse of the training process directly using $\mathbf{x}_0 \to \mathbf{x}_t$ instead of $\mathbf{x}_{t-1}$.

Therefore, we can assume that:

\begin{aligned} q_{\sigma}(\mathbf{x}_{m}| \mathbf{x}_n, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{m}; k\mathbf{x}_0+l\mathbf{x}_n, \sigma^2\mathbf{I}) \tag{15} \end{aligned}

whose mean is about $\mathbf{x}_n$ and $\mathbf{x}_0$ compared to $\mathbf{x}_t, \mathbf{x}_0$ in Equation (12). So we can get $\mathbf{x}_m = k\mathbf{x}_0+l\mathbf{x}_n+\sigma\epsilon$, and from Equation (3), $\mathbf{x}_n = \sqrt{\bar{\alpha}_n}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_n}\epsilon^{\prime}$ , then we can get:

\begin{aligned} \mathbf{x}_m &= k\mathbf{x}_0+l\left(\sqrt{\bar{\alpha}_n}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_n}\epsilon^{\prime}\right)+\sigma\epsilon \\ &= (k+l\sqrt{\bar{\alpha}_n})\mathbf{x}_0+l\sqrt{1-\bar{\alpha}_n}\epsilon^{\prime}+\sigma\epsilon \\ &= (k+l\sqrt{\bar{\alpha}_n})\mathbf{x}_0+\sqrt{l^2(1-\bar{\alpha}_n)+ \sigma^2\epsilon} \\ \end{aligned} \tag{16}

The method of undetermined coefficient is applied, $k+l\sqrt{\bar{\alpha}_n} = \sqrt{\bar{\alpha}_m}$ and $l^2(1-\bar{\alpha}_n)+ \sigma^2 = 1-\bar{\alpha}_m$, then we can get:

\begin{aligned} k &= \frac{\sqrt{1-\bar{\alpha}_m-\sigma^2}}{\sqrt{1-\bar{\alpha}_n}} \\ l &= \sqrt{\bar{\alpha}_m}-\sqrt{1-\bar{\alpha}_m-\sigma^2}\frac{\sqrt{\bar{\alpha}_n}}{\sqrt{1-\bar{\alpha}_n}} \tag{17} \end{aligned}

So, we can get the Equation (15) as follows:

\begin{aligned} q_{\sigma}(\mathbf{x}_{m}| \mathbf{x}_n, \mathbf{x}_0) = \mathcal{N}\left(\mathbf{x}_{m}; \sqrt{\bar{\alpha}_m}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_m-\sigma^2}\frac{\mathbf{x}_n-\sqrt{\bar{\alpha}_n}\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_n}}, \sigma^2\mathbf{I}\right) \tag{18} \end{aligned} \begin{aligned} \mathbf{x}_m = \sqrt{\bar{\alpha}_m}\left(\frac{\mathbf{x}_n-\sqrt{1-\bar{\alpha}_n}\epsilon_{\theta}(\mathbf{x}_n)}{\sqrt{\bar{\alpha}_n}}\right)+\sqrt{1-\bar{\alpha}_m-\sigma^2}\epsilon_{\theta}(\mathbf{x}_n) + \sigma\epsilon \tag{19} \end{aligned}

From Equation (19), set $m = t-1, n =t$, we can get:

\begin{aligned} \sqrt{1-\bar{\alpha}_{t-1}-\sigma^2}& =\frac{\sqrt{1-\bar{\alpha}_t}}{\sqrt{1-\bar{\alpha}_t}}\sqrt{1-\bar{\alpha}_{t-1}-\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}(1-\alpha_t)} \\ &=\frac{\sqrt{(1-\bar{\alpha}_{t-1})(1-\frac1{1-\bar{\alpha}_t}(1-\alpha_t))(1-\bar{\alpha}_t)}}{\sqrt{1-\bar{\alpha}_t}} \\ &=\frac{\sqrt{(1-\bar{\alpha}_{t-1})(1-\bar{\alpha}_t-1+\alpha_t)}}{\sqrt{1-\bar{\alpha}_t}} \\ &=\frac{\sqrt{(1-\bar{\alpha}_{t-1})(\alpha_t-\bar{\alpha}_t)}}{\sqrt{1-\bar{\alpha}_t}} \\ &=\frac{\sqrt{(1-\bar{\alpha}_{t-1})(1-\bar{\alpha}_{t-1})\alpha_t}}{\sqrt{1-\bar{\alpha}_t}} \\ &=\frac{(1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}}{\sqrt{1-\bar{\alpha}_t}} \tag{20} \end{aligned} \begin{aligned} \mu_{\theta}(\mathbf{x}_{t-1}) &= \frac1{\sqrt{\alpha_t}}\mathbf{x}_t-\frac1{\sqrt{\alpha_t}}\sqrt{1-\bar{\alpha}_t}\epsilon_\theta(\mathbf{x}_t)+\frac{(1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}}{\sqrt{1-\bar{\alpha_t}}}\epsilon_\theta(\mathbf{x}_t) \\ &= \frac{1}{\sqrt{\alpha_t}}\mathbf{x}_t-(\frac{1}{\sqrt{\alpha_t}}\sqrt{1-\bar{\alpha}_t}-\frac{(1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}}{\sqrt{1-\bar{\alpha}_t}})\epsilon_\theta(\mathbf{x}_t) \\ &= \frac{1}{\sqrt{\alpha_t}}\mathbf{x}_t-(\frac{(1-\bar{\alpha}_t)-(1-\bar{\alpha}_{t-1})\alpha_t}{\sqrt{\alpha_t}\sqrt{1-\bar{\alpha}_t}})\epsilon_\theta(\mathbf{x}_t) \\ &= \frac1{\sqrt{\alpha_t}}\mathbf{x}_t-(\frac{1-\bar{\alpha}_t-\alpha_t+\bar{\alpha}_t}{\sqrt{\alpha_t}\sqrt{1-\bar{\alpha}_t}})\epsilon_\theta(\mathbf{x}_t) \\ &= \frac{1}{\sqrt{\alpha_t}}\mathbf{x}_t-(\frac{1-\alpha_t}{\sqrt{\alpha_t}\sqrt{1-\bar{\alpha}_t}})\epsilon_\theta(\mathbf{x}_t) \\ &= \textcolor{red}{\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t)\right)} \tag{21} \end{aligned}

which is same as Equation (13) in DDPM,

Classifier Guidance & Classifier-free Guidance

Continue…

1. Ho, J., Jain, A., Abbeel, P., 2020. Denoising Diffusion Probabilistic Models. https://doi.org/10.48550/arXiv.2006.11239

2. Luo, C., 2022. Understanding Diffusion Models: A Unified Perspective. https://doi.org/10.48550/arXiv.2208.11970

3. Lil’Log, 2021. What are Diffusion Models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

4. Song, J., Meng, C., Ermon, S., 2022. Denoising diffusion implicit models. https://doi.org/10.48550/arXiv.2010.02502

5. 苏剑林. (Mar. 28, 2018). 《变分自编码器（二）：从贝叶斯观点出发 》. https://kexue.fm/archives/5343