视频问题回答的位置感知图卷积网络

论文标题

视频问题回答的位置感知图卷积网络

Location-aware Graph Convolutional Networks for Video Question Answering

论文作者

Huang, Deng, Chen, Peihao, Zeng, Runhao, Du, Qing, Tan, Mingkui, Gan, Chuang

论文摘要

我们解决了视频问题回答的具有挑战性的任务，该任务要求机器以自然语言形式回答有关视频的问题。以前的最新方法试图在视频框架功能上应用时空注意机制，而无需明确建模对象互动之间的位置和关系。但是，对象互动及其位置信息之间的关系对于行动识别和问题推理都至关重要。在这项工作中，我们建议通过将对象的位置信息合并到图形构造中，将视频中的内容表示为位置吸引图。在这里，每个节点与以其外观和位置特征表示的对象相关联。基于构造的图，我们建议使用图形卷积来推断动作的类别和时间位置。由于图是基于对象构建的，因此我们的方法能够将重点放在前景动作内容上，以获得更好的视频问题。最后，我们利用一种注意机制将图形卷积的输出和编码的问题特征结合在一起，以进行最终答案推理。广泛的实验证明了所提出的方法的有效性。具体而言，我们的方法在TGIF-QA，YouTube2Text-QA和MSVD-QA数据集上大大优于最先进的方法。代码和预培训模型可公开可用：https：//github.com/sundoge/l-gcn

We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets. Code and pre-trained models are publicly available at: https://github.com/SunDoge/L-GCN

下载PDF全文

下载文献需遵守相关版权规定

论文标题