关于共同注意变压器层在视觉问题回答中的功效

论文标题

关于共同注意变压器层在视觉问题回答中的功效

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

论文作者

Sikarwar, Ankur, Kreiman, Gabriel

论文摘要

近年来，多模式变压器在视觉任务（例如视觉问题答案（VQA））上显示出显着的进展，以相当大的差距表现优于先前的架构。 VQA的这种改进通常归因于视觉和语言流之间的丰富相互作用。在这项工作中，我们调查了共同注意变压器层在回答问题的同时，帮助网络专注于相关区域的功效。我们使用这些共同注意层中问题条件的图像注意分数生成视觉注意图。我们评估了以下关键组件对最新VQA模型的视觉注意的影响：（i）对象区域建议的数量，（ii）问题部分（pos）标签的部分，（iii）问题语义，（iv）共同注意层的数量，以及（v）答案准确性。我们将神经网络的注意图与人类注意力图进行比较，既有定性和定量。我们的发现表明，共同注意变压器模块对于参与图像的相关区域至关重要。重要的是，我们观察到问题的语义含义不是引起视觉关注的内容，而是问题中的特定关键字。我们的工作阐明了共同发音变压器层的功能和解释，突出了当前网络中的差距，并可以指导未来的VQA模型和网络的开发，这些模型和网络同时处理视觉和语言流。

In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicate that co-attention transformer modules are crucial in attending to relevant regions of the image given a question. Importantly, we observe that the semantic meaning of the question is not what drives visual attention, but specific keywords in the question do. Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models and networks that simultaneously process visual and language streams.

下载PDF全文

下载文献需遵守相关版权规定

论文标题