论文标题
时空视频接地的对象感知的多分支关系网络
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding
论文作者
论文摘要
时空视频接地旨在根据给定的句子检索查询对象的时空管。当前,大多数现有的接地方法仅限于良好的分段句子对。在本文中,我们探讨了时空的视频,该视频接地是在未对齐的数据和多形式句子上。这项具有挑战性的任务需要捕获关键对象关系以识别查询目标。但是,现有方法无法区分著名的对象,并保持不必要的对象之间的无效关系建模。因此,我们提出了一个新颖的对象感知的多分支关系网络,用于对象感知关系发现。具体而言,我们首先设计了多个分支来开发对象感知区域建模,每个分支都集中在句子中提到的关键对象上。然后,我们提出了多分支关系推理,以捕获主要分支和辅助分支之间的关键对象关系。此外,我们应用多样性损失,使每个分支只注意其相应的对象并增强多分支学习。广泛的实验显示了我们提出的方法的有效性。
Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. This challenging task requires to capture critical object relations to identify the queried target. However, existing approaches cannot distinguish notable objects and remain in ineffective relation modeling between unnecessary objects. Thus, we propose a novel object-aware multi-branch relation network for object-aware relation discovery. Concretely, we first devise multiple branches to develop object-aware region modeling, where each branch focuses on a crucial object mentioned in the sentence. We then propose multi-branch relation reasoning to capture critical object relationships between the main branch and auxiliary branches. Moreover, we apply a diversity loss to make each branch only pay attention to its corresponding object and boost multi-branch learning. The extensive experiments show the effectiveness of our proposed method.