论文[ControlNet] #

论文地址 Adding Conditional Control to Text-to-Image Diffusion Models
开源地址 ControlNet git

ControlNet[10] #

ControlNet is a type of model for controlling image diffusion models by conditioning the model with an additional input image. There are many types of conditioning inputs (canny edge, user sketching, human pose, depth, and more) you can use to control a diffusion model. This is hugely useful because it affords you greater control over image generation, making it easier to generate specific images without experimenting with different text prompts or denoising values as much.

ControlNet是一种模型，用于通过使用额外的输入图像调节模型来控制图像扩散模型。您可以使用**多种类型的调节输入（精巧的边缘、用户草图、人体姿势、深度等）**来控制扩散模型。这非常有用，因为它为您提供了对图像生成的更大控制，从而可以更轻松地生成特定图像，而无需尝试使用不同的文本提示或对值进行过多的去噪。

Method [2][3] #

ControlNet 采用了一种类似微调的方法，如下图，在原模型的基础上，增加一个可训练副本，可训练副本的输入是原输入x加上条件c，然后把两个模型的输出相加，可训练副本的输入输出都经过零卷积(zero convolution)处理，用于在刚开始训练时保持模型的稳定性。

{% asset_img ’’ %}

具体的针对 Stable Diffusion 的 ControlNet 结构如下图，只复制了 UNet 的 Encoder blocks 和 Middle block (结构+权重)，控制条件图c先经过几层卷积，再与原UNet的输入zt相加作为输入，ControlNet 每个 block 的输出再 Add 到原 UNet 的 Decoder block 输入，实际实现中，ControlNet 的输出还可以乘上一个scale，用于控制影响程度。注意这里 ControlNet 同样输入了和原 UNet 一样的 Prompt&Time

{% asset_img ’’ %}