论文标题
视频和语言接地的多级对齐培训计划
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
论文作者
论文摘要
为了求解视频和语言接地任务,关键是要使网络了解两种方式之间的连接。对于一对视频和语言描述,他们的语义关系反映了它们的编码相似性。良好的多模式编码器应该能够很好地捕获两个输入的语义,并在共享特征空间中对其进行编码,在共享特征空间中,嵌入距离被正确地转化为其语义相似性。在这项工作中,我们专注于视频和语言之间的这种语义连接,并开发了一种多级对齐训练方案,以直接塑造编码过程。根据从高级上下文到细粒度语义的信息相似性,设计了视频对齐对的全局和段级别。对比度损失用于对比,以对比编码对对齐对之间的相似性,并确保对网络进行训练,以使相似信息在共享的特征空间中紧密编码,而不同语义的信息则分开。我们的多级对齐培训可以应用于各种视频和语言基础任务。加上特定于任务的培训损失,我们的框架与多个视频质量检查和检索数据集上的先前最先进的框架相当。
To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space where embedding distance gets properly translated into their semantic similarity. In this work, we focused on this semantic connection between video and language, and developed a multi-level alignment training scheme to directly shape the encoding process. Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. The contrastive loss was used to contrast the encodings' similarities between the positive and negative alignment pairs, and to ensure the network is trained in such a way that similar information is encoded closely in the shared feature space while information of different semantics is kept apart. Our multi-level alignment training can be applied to various video-and-language grounding tasks. Together with the task-specific training loss, our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.