论文标题
基于自然语言监督的基于检索的分离表示学习
Retrieval-based Disentangled Representation Learning with Natural Language Supervision
论文作者
论文摘要
由于数据中不存在差异的根本因素,因此分解的表示学习仍然具有挑战性。现实世界数据的固有复杂性使详尽的枚举并将其所有变体封装在有限因素集中。但是,值得注意的是,大多数现实世界中的数据具有语言等效物,通常以文本描述的形式。这些语言对应物可以代表数据,并毫不费力地分解为不同的令牌。鉴于此,我们提出了基于检索的框架词汇删除检索(VDR),该框架将自然语言作为基础数据变化的代理,以推动分离的表示形式学习。我们的方法采用双重编码器模型来表示词汇空间中的数据和自然语言,从而使该模型能够通过其自然语言对应物区分捕获数据中固有特征的维度,从而促进分解。我们广泛评估了VDR在15个检索基准数据集中的性能,涵盖文本到文本和跨模式检索方案以及人类评估。我们的实验结果令人难以置信的结果表明,VDR比以前的Bi-编码猎犬具有可比的模型大小和培训成本的优势,在贝尔基准测试中,NDCG@10的令人印象深刻的8.7%提高,Coco女士增加了5.3%,而在零构图中,FlickR30K的均值增加了6.0%。此外,人类评估的结果表明,我们方法的解释性与SOTA字幕模型相当。
Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.