Document


                    class DDPMScheduler(SchedulerMixin, ConfigMixin):
                        """
                        `DDPMScheduler` explores the connections between denoising score matching and Langevin dynamics sampling.

                        This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. Check the superclass documentation for the generic
                        methods the library implements for all schedulers such as loading and saving.

                        Args:
                            num_train_timesteps (`int`, defaults to 1000):
                                The number of diffusion steps to train the model.
                            beta_start (`float`, defaults to 0.0001):
                                The starting `beta` value of inference.
                            beta_end (`float`, defaults to 0.02):
                                The final `beta` value.
                            beta_schedule (`str`, defaults to `"linear"`):
                                The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from
                                `linear`, `scaled_linear`, or `squaredcos_cap_v2`.
                            trained_betas (`np.ndarray`, *optional*):
                                An array of betas to pass directly to the constructor without using `beta_start` and `beta_end`.
                            variance_type (`str`, defaults to `"fixed_small"`):
                                Clip the variance when adding noise to the denoised sample. Choose from `fixed_small`, `fixed_small_log`,
                                `fixed_large`, `fixed_large_log`, `learned` or `learned_range`.
                            clip_sample (`bool`, defaults to `True`):
                                Clip the predicted sample for numerical stability.
                            clip_sample_range (`float`, defaults to 1.0):
                                The maximum magnitude for sample clipping. Valid only when `clip_sample=True`.
                            prediction_type (`str`, defaults to `epsilon`, *optional*):
                                Prediction type of the scheduler function; can be `epsilon` (predicts the noise of the diffusion process),
                                `sample` (directly predicts the noisy sample`) or `v_prediction` (see section 2.4 of [Imagen
                                Video](https://imagen.research.google/video/paper.pdf) paper).
                            thresholding (`bool`, defaults to `False`):
                                Whether to use the "dynamic thresholding" method. This is unsuitable for latent-space diffusion models such
                                as Stable Diffusion.
                            dynamic_thresholding_ratio (`float`, defaults to 0.995):
                                The ratio for the dynamic thresholding method. Valid only when `thresholding=True`.
                            sample_max_value (`float`, defaults to 1.0):
                                The threshold value for dynamic thresholding. Valid only when `thresholding=True`.
                            timestep_spacing (`str`, defaults to `"leading"`):
                                The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and
                                Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.
                            steps_offset (`int`, defaults to 0):
                                An offset added to the inference steps. You can use a combination of `offset=1` and
                                `set_alpha_to_one=False` to make the last step use step 0 for the previous alpha product like in Stable
                                Diffusion.
                            rescale_betas_zero_snr (`bool`, defaults to `False`):
                                Whether to rescale the betas to have zero terminal SNR. This enables the model to generate very bright and
                                dark samples instead of limiting it to samples with medium brightness. Loosely related to
                                [`--offset_noise`](https://github.com/huggingface/diffusers/blob/74fd735eb073eb1d774b1ab4154a0876eb82f055/examples/dreambooth/train_dreambooth.py#L506).
                        """

                        _compatibles = [e.name for e in KarrasDiffusionSchedulers]
                        order = 1

DDPMScheduler代码解释


                    @register_to_config
                    def __init__(
                        self,
                        num_train_timesteps: int = 1000,
                        beta_start: float = 0.0001,
                        beta_end: float = 0.02,
                        beta_schedule: str = "linear",
                        trained_betas: Optional[Union[np.ndarray, List[float]]] = None,
                        variance_type: str = "fixed_small",
                        clip_sample: bool = True,
                        prediction_type: str = "epsilon",
                        thresholding: bool = False,
                        dynamic_thresholding_ratio: float = 0.995,
                        clip_sample_range: float = 1.0,
                        sample_max_value: float = 1.0,
                        timestep_spacing: str = "leading",
                        steps_offset: int = 0,
                        rescale_betas_zero_snr: int = False,
                    ):
                        if trained_betas is not None:
                            self.betas = torch.tensor(trained_betas, dtype=torch.float32)
                        elif beta_schedule == "linear":
                            self.betas = torch.linspace(beta_start, beta_end, num_train_timesteps, dtype=torch.float32)
                        elif beta_schedule == "scaled_linear":
                            # this schedule is very specific to the latent diffusion model.
                            self.betas = torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
                        elif beta_schedule == "squaredcos_cap_v2":
                            # Glide cosine schedule
                            self.betas = betas_for_alpha_bar(num_train_timesteps)
                        elif beta_schedule == "sigmoid":
                            # GeoDiff sigmoid schedule
                            betas = torch.linspace(-6, 6, num_train_timesteps)
                            self.betas = torch.sigmoid(betas) * (beta_end - beta_start) + beta_start
                        else:
                            raise NotImplementedError(f"{beta_schedule} does is not implemented for {self.__class__}")

设 \( \bf x_0 \)表示一张图片, 逐步在当前图片上添加微小噪音，经过T步得到T张中间图片，依次为 \( \bf x_1, \bf x_2, \cdots, \bf x_T \)。添加噪声的计算公式如下: \( \bf x_t = \alpha_t \bf x_{t-1} + \beta_t \bf \epsilon_t \) 其中， \( \bf \epsilon_t \sim \mathcal N(0,\mit I) \) \( \alpha_t^2 + \beta_t^2 = 1, \alpha_t,\beta_t > 0 \) 为了实现每次添加噪音都很小，因此从标准正态分布取样，然后乘以一个很小的系数\( \beta_t \)就可以保证。
上面解释了为什么要设置\( \beta_t \)是一个很小的数字，取值范围限制在了[0.0001, 0.02]这个范围。
代码给出了beta序列的几种具体的取值方式。其取值如下图所示



                # Rescale for zero SNR
                if rescale_betas_zero_snr:
                    self.betas = rescale_zero_terminal_snr(self.betas)

围绕DDPM加噪过程最后一步并没有完全达到zero SNR的问题，进行优化改造。rescale_zero_terminal_snr就是一种方法，使得最后一步加噪得到的就是完全噪音，不包含任何原始数据成分。具体可参考论文 https://arxiv.org/pdf/2305.08891.pdf


        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        self.one = torch.tensor(1.0)

为了保证后续讲解的一致性已经与代码的吻合，此处统一上述公式化表达。即

\( \bf x_t = \sqrt {\alpha_t} \bf x_{t-1} + \sqrt{\beta_t} \bf \epsilon_t, \alpha_t + \beta_t = 1 \)

继续推导得到

\( \begin{aligned} \bf x_t & = \sqrt {\alpha_t} \bf x_{t-1} + \sqrt {\beta_t} \epsilon_t \\ & = \sqrt {\alpha_t} (\sqrt {\alpha_{t-1}} \bf x_{t-2} + \sqrt {\beta_{t-1}} \epsilon_{t-1})+ \sqrt {\beta_t} \epsilon_t \\ & = \cdots \\ & = \sqrt{(\alpha_t\cdots\alpha_1)}\bf x_0 + \sqrt{(\alpha_t\cdots\alpha_2)\beta_1}\epsilon_1 + \sqrt {(\alpha_t\cdots\alpha_3)\beta_2}\epsilon_2 + \cdots + \sqrt {\alpha_t\beta_{t-1}}\epsilon_{t-1} + \sqrt {\beta_t} \epsilon_t \\ \end{aligned} \)

上一步公式除了\( \bf x_0 \)项，其余所有项可以看作是多个相互独立的正态噪音之和。也等价于从均值为0，方差为 \( (\alpha_t\cdots\alpha_2)\beta_1 + (\alpha_t\cdots\alpha_3)\beta_2 + \cdots + \alpha_t\beta_{t-1} + \beta_t \) 正态分布取样。由于 \( \alpha_t\cdots\alpha_1 + (\alpha_t\cdots\alpha_2)\beta_1 + (\alpha_t\cdots\alpha_3)\beta_2 + \cdots + \alpha_t\beta_{t-1} + \beta_t = 1 \) 因此可以改写为

\( \bf x_t = \sqrt { \alpha_t\cdots\alpha_1 } \bf x_0 + \sqrt {1-\alpha_t\cdots\alpha_1} \bf{\bar \epsilon_t}, \bar \epsilon_t \sim \mathcal N(0, I) \)

或者

\( \bf x_t = \sqrt{\bar{\alpha}_t} \bf x_0 + \sqrt{1-\bar{\alpha}_t} \bf{\bar \epsilon_t} \), 其中\( \bar{\alpha}_t = \alpha_t\cdots\alpha_1 \)

至此关于代码中self.alphas, self.betas, self.alphas_cumprod都解释清楚。

self.betas 是最开始确定的噪音的系数，参考上图
self.alphas 是根据 \( \alpha_t + \beta_t = 1 \)得到
self.alphas_cumprod 就是 \(\alpha_t\)的累积。


                # standard deviation of the initial noise distribution
                self.init_noise_sigma = 1.0
        
                # setable values
                self.custom_timesteps = False
                self.num_inference_steps = None
                self.timesteps = torch.from_numpy(np.arange(0, num_train_timesteps)[::-1].copy())
        
                self.variance_type = variance_type


                def set_timesteps(
                    self,
                    num_inference_steps: Optional[int] = None,
                    device: Union[str, torch.device] = None,
                    timesteps: Optional[List[int]] = None,
                ):
                    """
                    Sets the discrete timesteps used for the diffusion chain (to be run before inference).
            
                    Args:
                        num_inference_steps (`int`):
                            The number of diffusion steps used when generating samples with a pre-trained model. If used,
                            `timesteps` must be `None`.
                        device (`str` or `torch.device`, *optional*):
                            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
                        timesteps (`List[int]`, *optional*):
                            Custom timesteps used to support arbitrary spacing between timesteps. If `None`, then the default
                            timestep spacing strategy of equal spacing between timesteps is used. If `timesteps` is passed,
                            `num_inference_steps` must be `None`.
            
                    """

num_inference_steps: 当使用预训练模型生成样本时，扩散模型用于迭代降噪的步数。它与timesteps是互斥的，只能一个赋值另一个未空。
timesteps: 自定义的降噪步伐。走多少步，大跨步或者小碎步都可以通过timesteps来设置。如果设置了num_inference_steps，仅是知道要走多少步，但是如何走没有告知，需要代码中进一步明确。


                if num_inference_steps is not None and timesteps is not None:
                raise ValueError("Can only pass one of `num_inference_steps` or `custom_timesteps`.")
    
            if timesteps is not None:
                for i in range(1, len(timesteps)):
                    if timesteps[i] >= timesteps[i - 1]:
                        raise ValueError("`custom_timesteps` must be in descending order.")
    
                if timesteps[0] >= self.config.num_train_timesteps:
                    raise ValueError(
                        f"`timesteps` must start before `self.config.train_timesteps`:"
                        f" {self.config.num_train_timesteps}."
                    )
    
                timesteps = np.array(timesteps, dtype=np.int64)
                self.custom_timesteps = True
            else:
                if num_inference_steps > self.config.num_train_timesteps:
                    raise ValueError(
                        f"`num_inference_steps`: {num_inference_steps} cannot be larger than `self.config.train_timesteps`:"
                        f" {self.config.num_train_timesteps} as the unet model trained with this scheduler can only handle"
                        f" maximal {self.config.num_train_timesteps} timesteps."
                    )
    
                self.num_inference_steps = num_inference_steps
                self.custom_timesteps = False

num_inference_steps: 推理步数不能超过训练时的步数
timesteps: 是降噪从时间步T到0方向走，因此是降序的。


                # "linspace", "leading", "trailing" corresponds to annotation of Table 2. of https://arxiv.org/abs/2305.08891
            if self.config.timestep_spacing == "linspace":
                timesteps = (
                    np.linspace(0, self.config.num_train_timesteps - 1, num_inference_steps)
                    .round()[::-1]
                    .copy()
                    .astype(np.int64)
                )
            elif self.config.timestep_spacing == "leading":
                step_ratio = self.config.num_train_timesteps // self.num_inference_steps
                # creates integer timesteps by multiplying by ratio
                # casting to int to avoid issues when num_inference_step is power of 3
                timesteps = (np.arange(0, num_inference_steps) * step_ratio).round()[::-1].copy().astype(np.int64)
                timesteps += self.config.steps_offset
            elif self.config.timestep_spacing == "trailing":
                step_ratio = self.config.num_train_timesteps / self.num_inference_steps
                # creates integer timesteps by multiplying by ratio
                # casting to int to avoid issues when num_inference_step is power of 3
                timesteps = np.round(np.arange(self.config.num_train_timesteps, 0, -step_ratio)).astype(np.int64)
                timesteps -= 1
            else:
                raise ValueError(
                    f"{self.config.timestep_spacing} is not supported. Please make sure to choose one of 'linspace', 'leading' or 'trailing'."
                )

        self.timesteps = torch.from_numpy(timesteps).to(device)

参考论文Common Diffusion Noise Schedules and Sample Steps are Flawed，如果没有指定timesteps,仅指定了num_inference_steps，代码提供了三种方式： linspace, leading, trailing。如下图所示


                def add_noise(
                    self,
                    original_samples: torch.FloatTensor,
                    noise: torch.FloatTensor,
                    timesteps: torch.IntTensor,
                ) -> torch.FloatTensor:
                    # Make sure alphas_cumprod and timestep have same device and dtype as original_samples
                    # Move the self.alphas_cumprod to device to avoid redundant CPU to GPU data movement
                    # for the subsequent add_noise calls
                    self.alphas_cumprod = self.alphas_cumprod.to(device=original_samples.device)
                    alphas_cumprod = self.alphas_cumprod.to(dtype=original_samples.dtype)
                    timesteps = timesteps.to(original_samples.device)
            
                    sqrt_alpha_prod = alphas_cumprod[timesteps] ** 0.5
                    sqrt_alpha_prod = sqrt_alpha_prod.flatten()
                    while len(sqrt_alpha_prod.shape) < len(original_samples.shape):
                        sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1)
            
                    sqrt_one_minus_alpha_prod = (1 - alphas_cumprod[timesteps]) ** 0.5
                    sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten()
                    while len(sqrt_one_minus_alpha_prod.shape) < len(original_samples.shape):
                        sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1)
            
                    noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noise
                    return noisy_samples

按照公式计算任意\( t \) 时间步的值 \( x_t \)

\( \bf x_t = \sqrt { \alpha_t\cdots\alpha_1 } \bf x_0 + \sqrt {1-\alpha_t\cdots\alpha_1} \bf{\bar \epsilon_t} \)

或者

\( \bf x_t = \sqrt{\bar{\alpha}_t} \bf x_0 + \sqrt{1-\bar{\alpha}_t} \bf{\bar \epsilon_t} \)


                def step(
                    self,
                    model_output: torch.FloatTensor,
                    timestep: int,
                    sample: torch.FloatTensor,
                    generator=None,
                    return_dict: bool = True,
                ) -> Union[DDPMSchedulerOutput, Tuple]:
                    """
                    Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
                    process from the learned model outputs (most often the predicted noise).
            
                    Args:
                        model_output (`torch.FloatTensor`):
                            The direct output from learned diffusion model.
                        timestep (`float`):
                            The current discrete timestep in the diffusion chain.
                        sample (`torch.FloatTensor`):
                            A current instance of a sample created by the diffusion process.
                        generator (`torch.Generator`, *optional*):
                            A random number generator.
                        return_dict (`bool`, *optional*, defaults to `True`):
                            Whether or not to return a [`~schedulers.scheduling_ddpm.DDPMSchedulerOutput`] or `tuple`.
            
                    Returns:
                        [`~schedulers.scheduling_ddpm.DDPMSchedulerOutput`] or `tuple`:
                            If return_dict is `True`, [`~schedulers.scheduling_ddpm.DDPMSchedulerOutput`] is returned, otherwise a
                            tuple is returned where the first element is the sample tensor.
            
                    """
                    t = timestep
            
                    prev_t = self.previous_timestep(t)
            
                    if model_output.shape[1] == sample.shape[1] * 2 and self.variance_type in ["learned", "learned_range"]:
                        model_output, predicted_variance = torch.split(model_output, sample.shape[1], dim=1)
                    else:
                        predicted_variance = None
            
                    # 1. compute alphas, betas
                    alpha_prod_t = self.alphas_cumprod[t]
                    alpha_prod_t_prev = self.alphas_cumprod[prev_t] if prev_t >= 0 else self.one
                    beta_prod_t = 1 - alpha_prod_t
                    beta_prod_t_prev = 1 - alpha_prod_t_prev
                    current_alpha_t = alpha_prod_t / alpha_prod_t_prev
                    current_beta_t = 1 - current_alpha_t
            
                    # 2. compute predicted original sample from predicted noise also called
                    # "predicted x_0" of formula (15) from https://arxiv.org/pdf/2006.11239.pdf
                    if self.config.prediction_type == "epsilon":
                        pred_original_sample = (sample - beta_prod_t ** (0.5) * model_output) / alpha_prod_t ** (0.5)
                    elif self.config.prediction_type == "sample":
                        pred_original_sample = model_output
                    elif self.config.prediction_type == "v_prediction":
                        pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
                    else:
                        raise ValueError(
                            f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, `sample` or"
                            " `v_prediction`  for the DDPMScheduler."
                        )
            
                    # 3. Clip or threshold "predicted x_0"
                    if self.config.thresholding:
                        pred_original_sample = self._threshold_sample(pred_original_sample)
                    elif self.config.clip_sample:
                        pred_original_sample = pred_original_sample.clamp(
                            -self.config.clip_sample_range, self.config.clip_sample_range
                        )
            
                    # 4. Compute coefficients for pred_original_sample x_0 and current sample x_t
                    # See formula (7) from https://arxiv.org/pdf/2006.11239.pdf
                    pred_original_sample_coeff = (alpha_prod_t_prev ** (0.5) * current_beta_t) / beta_prod_t
                    current_sample_coeff = current_alpha_t ** (0.5) * beta_prod_t_prev / beta_prod_t
            
                    # 5. Compute predicted previous sample µ_t
                    # See formula (7) from https://arxiv.org/pdf/2006.11239.pdf
                    pred_prev_sample = pred_original_sample_coeff * pred_original_sample + current_sample_coeff * sample
            
                    # 6. Add noise
                    variance = 0
                    if t > 0:
                        device = model_output.device
                        variance_noise = randn_tensor(
                            model_output.shape, generator=generator, device=device, dtype=model_output.dtype
                        )
                        if self.variance_type == "fixed_small_log":
                            variance = self._get_variance(t, predicted_variance=predicted_variance) * variance_noise
                        elif self.variance_type == "learned_range":
                            variance = self._get_variance(t, predicted_variance=predicted_variance)
                            variance = torch.exp(0.5 * variance) * variance_noise
                        else:
                            variance = (self._get_variance(t, predicted_variance=predicted_variance) ** 0.5) * variance_noise
            
                    pred_prev_sample = pred_prev_sample + variance
            
                    if not return_dict:
                        return (pred_prev_sample,)
            
                    return DDPMSchedulerOutput(prev_sample=pred_prev_sample, pred_original_sample=pred_original_sample)

入参

model_output: 已训练的扩散模型的直接输出
timestep: 扩散过程中的某个离散的时间步\( t \)
sample: 扩散过程中时间步\(t\)产生的样本
generator: 随机数生成器

\( t = timestep\) 指定的某个离散时间步
prev_t = self.previous_timestep(t) \(t\)时间的前一个时间步
model_output 输出可能由两部分构成，模型输出\(x\)和\(epsilon\)

# 1. compute alphas, betas
根据self.alphas_cumprod，时间步t, prev_t 获得\( \bar {\alpha}_t, \bar {\alpha}_{t_{pre}} \)，再根据\( \bar {\alpha}_t + \bar {\beta}_{t} = 1 \)得到，\( \bar {\beta}_t = 1 - \bar {\alpha}_t, \bar {\beta}_{t_{pre}} = 1 - \bar {\alpha}_{t_{pre}}\)，再得到\(t, t_{pre}\)时间步的\(\alpha, \beta\)值，即\(\alpha_t = \frac{\bar {\alpha}_t}{\bar {\alpha}_{t_{pre}}}, \beta_t = 1 - \alpha_t \)

# 2. compute predicted original sample from predicted noise also called
# "predicted x_0" of formula (15) from https://arxiv.org/pdf/2006.11239.pdf
根据unet模型输出类型不同，推导出预测的原始样本

epsilon：模型预测输出是预测噪音。根据
\( \bf x_t = \sqrt{\bar{\alpha}_t} \bf x_0 + \sqrt{1-\bar{\alpha}_t} \bf{\bar \epsilon_t} \)
可以推导出
\( \bf x_0 = \frac {\bf x_t - \sqrt{1-\bar{\alpha}_t} \bf{\bar \epsilon_t} } {\sqrt{\bar{\alpha}_t} }\)
此处中的\(\bf x_t \)就是代码中的sample，\(\bar \epsilon_t\)就是代码中的model_output。
sample: 模型输出就是预测的原始样本
v_prediction: 【待补充】

# 4. Compute coefficients for pred_original_sample x_0 and current sample x_t
# See formula (7) from https://arxiv.org/pdf/2006.11239.pdf
根据上述论文中的公式(7)计算相应的系数，公式如下

\( \tilde \mu_t (x_t, x_0)= \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_t\)

# 5. Compute predicted previous sample µ_t
# See formula (7) from https://arxiv.org/pdf/2006.11239.pdf
按照上述公式得到上一个时间步样本所属分布的均值\( \tilde \mu_t \)

# 6. Add noise