从跨模式检索中的成对数据中学习的潜在因素：一种隐性可识别的VAE方法

论文标题

从跨模式检索中的成对数据中学习的潜在因素：一种隐性可识别的VAE方法

Learning Disentangled Latent Factors from Paired Data in Cross-Modal Retrieval: An Implicit Identifiable VAE Approach

论文作者

Kim, Minyoung, Guerrero, Ricardo, Pavlovic, Vladimir

论文摘要

我们处理了学习跨模式检索中配对双模式数据之间共享的基本分离潜在因素的问题。我们的假设是，两种模态中的数据都是复杂，结构化的和高维的（例如，图像和文本），传统的深层自动编码潜在变量模型（例如变异自动编码器（VAE））通常会遭受精确解码器训练的困难或真实的合成。次优训练的解码器可能会损害该模型识别真实因素的能力。在本文中，我们提出了一个关于隐式解码器的新颖概念，该想法通过隐式编码倒置，完全消除了潜在变量模型中的环境数据解码模块，该模块是通过低维嵌入功能的雅各布式的正则化来实现的。从最近的可识别VAE（IVAE）模型中的动机中，我们将其修改为将查询模态数据纳入条件辅助输入，这使我们能够证明该模型的真实参数可以在某些规律性条件下识别。在各种数据集上进行了完全/部分可用的数据集测试，我们的模型被证明可以准确地识别因素，并明显优于常规编码器潜在变量模型。我们还测试了我们在食谱中的模型，即1M，即大规模食品图像/食谱数据集，通过我们的方法，学到的因素与最明显的食物因素高度重合，这些因素被广泛同意，包括味道，水性和绿色。

We deal with the problem of learning the underlying disentangled latent factors that are shared between the paired bi-modal data in cross-modal retrieval. Our assumption is that the data in both modalities are complex, structured, and high dimensional (e.g., image and text), for which the conventional deep auto-encoding latent variable models such as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder training or realistic synthesis. A suboptimally trained decoder can potentially harm the model's capability of identifying the true factors. In this paper we propose a novel idea of the implicit decoder, which completely removes the ambient data decoding module from a latent variable model, via implicit encoder inversion that is achieved by Jacobian regularization of the low-dimensional embedding function. Motivated from the recent Identifiable VAE (IVAE) model, we modify it to incorporate the query modality data as conditioning auxiliary input, which allows us to prove that the true parameters of the model can be identified under some regularity conditions. Tested on various datasets where the true factors are fully/partially available, our model is shown to identify the factors accurately, significantly outperforming conventional encoder-decoder latent variable models. We also test our model on the Recipe1M, the large-scale food image/recipe dataset, where the learned factors by our approach highly coincide with the most pronounced food factors that are widely agreed on, including savoriness, wateriness, and greenness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题