M-VADER：一种用于多模式上下文扩散的模型

论文标题

M-VADER：一种用于多模式上下文扩散的模型

M-VADER: A Model for Diffusion with Multimodal Context

论文作者

Weinbach, Samuel, Bellagente, Marco, Eichenberg, Constantin, Dai, Andrew, Baldock, Robert, Nanda, Souradeep, Deiseroth, Björn, Oostermeijer, Koen, Teufel, Hannah, Cruz-Salinas, Andres Felipe

论文摘要

我们介绍了M-Vader：图像生成的扩散模型（DM），其中可以使用图像和文本的任意组合来指定输出。我们展示了M-Vader如何使用图像和文本组合以及多个图像的组合来启用图像的生成。以前，已经引入了许多成功的DM图像生成算法，使得可以使用文本提示符指定输出图像。受这些模型成功的启发，并以一种已经开发出来的语言来描述人类认为最重要的视觉上下文的要素的观念，我们引入了与视觉语言模型密切相关的嵌入模型。具体来说，我们介绍了嵌入模型S麦克马：130亿个参数多模式解码器，将组件从自动回归视觉模型岩浆和偏见进行拟进行语义搜索。

We introduce M-VADER: a diffusion model (DM) for image generation where the output can be specified using arbitrary combinations of images and text. We show how M-VADER enables the generation of images specified using combinations of image and text, and combinations of multiple images. Previously, a number of successful DM image generation algorithms have been introduced that make it possible to specify the output image using a text prompt. Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, we introduce an embedding model closely related to a vision-language model. Specifically, we introduce the embedding model S-MAGMA: a 13 billion parameter multimodal decoder combining components from an autoregressive vision-language model MAGMA and biases finetuned for semantic search.

下载PDF全文

下载文献需遵守相关版权规定

论文标题