论文标题

捍卫视觉问题的网格功能

In Defense of Grid Features for Visual Question Answering

论文作者

Jiang, Huaizu, Misra, Ishan, Rohrbach, Marcus, Learned-Miller, Erik, Chen, Xinlei

论文摘要

基于“自下而上”的关注,基于边界的框(或区域)的视觉特征最近超越了基于香草网格的卷积特征,作为视觉和语言任务(例如视觉询问答录(VQA))的事实上的标准。但是,目前尚不清楚地区的优势(例如,更好的本地化)是否是成功关注的关键原因。在本文中,我们重新访问VQA的网格功能,发现它们可以出人意料地工作 - 运行速度超过具有相同精度的速度(例如,以类似的方式进行训练)。通过广泛的实验,我们验证了该观察结果在不同的VQA模型中构成真实(报告VQA 2.0 Test-STD,72.71的最先进的精度),数据集,并且可以很好地介绍到图像字幕等其他任务。随着网格功能使模型设计和训练过程变得更加简单,这使我们能够端对端训练它们,并使用更灵活的网络设计。我们学习端到端的VQA模型,从直接的像素到答案,并表明在不使用任何区域注释的情况下,可以实现强大的性能。我们希望我们的发现有助于进一步改善VQA的科学理解和实际应用。代码和功能将提供。

Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA, and find they can work surprisingly well - running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion). Through extensive experiments, we verify that this observation holds true across different VQA models (reporting a state-of-the-art accuracy on VQA 2.0 test-std, 72.71), datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源