图像：基于文本的真实图像编辑，具有扩散模型

论文标题

图像：基于文本的真实图像编辑，具有扩散模型

Imagic: Text-Based Real Image Editing with Diffusion Models

论文作者

Kawar, Bahjat, Zada, Shiran, Lang, Oran, Tov, Omer, Chang, Huiwen, Dekel, Tali, Mosseri, Inbar, Irani, Michal

论文摘要

文本条件的图像编辑最近引起了极大的兴趣。但是，大多数方法当前仅限于特定的编辑类型（例如对象覆盖，样式传输），或应用于合成生成的映像，或者需要共同对象的多个输入图像。在本文中，我们首次证明了将复杂（例如，非刚性）的文本引入的语义编辑应用于单个真实图像的能力。例如，我们可以更改图像内一个或多个对象的姿势和组成，同时保留其原始特征。我们的方法可以使站立的狗坐下或跳跃，导致鸟散布翅膀等等 - 每个鸟都在用户提供的单个高分辨率自然图像中。与以前的工作相反，我们提出的方法仅需要一个输入图像和目标文本（所需的编辑）。它在真实的图像上运行，并且不需要任何其他输入（例如图像掩码或对象的其他视图）。我们称之为“成像”的方法为此任务利用了预训练的文本对图扩散模型。它产生的文本嵌入与输入图像和目标文本一致，同时微调扩散模型以捕获特定于图像的外观。我们在来自各个领域的众多输入上展示了我们方法的质量和多功能性，展示了大量高质量的复杂语义图像编辑，所有这些都在一个统一的框架内。

Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call "Imagic", leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题