Actbert：学习全部本地视频文本表示

论文标题

Actbert：学习全部本地视频文本表示

ActBERT: Learning Global-Local Video-Text Representations

论文作者

Zhu, Linchao, Yang, Yi

论文摘要

在本文中，我们介绍了Actbert，以从未标记的数据中自审学习联合视频文本表示。首先，我们利用全球行动信息来催化语言文本与地方区域对象之间的相互作用。它从配对的视频序列和文本说明中发现了全局和本地视觉线索，以详细的视觉和文本关系建模。其次，我们介绍一个纠缠的变压器块（ENT），以编码三个信息源，即全球动作，本地区域对象和语言描述。通过从上下文信息中提取出明智的线索来发现全局本地对应。它强制执行联合视频XTERTAIDATION，以意识到细粒度的对象以及全球人类的意图。我们验证了Actbert在下游视频和语言任务上的概括能力，即文本视频剪辑检索，视频字幕，视频问题回答，动作细分和动作步骤本地化。 Actbert极大地表现出最先进的表现，证明了其在视频文本表示学习中的优势。

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题