WEAQA：通过字幕进行视觉问题回答的弱监督

论文标题

WEAQA：通过字幕进行视觉问题回答的弱监督

WeaQA: Weak Supervision via Captions for Visual Question Answering

论文作者

Banerjee, Pratyay, Gokhale, Tejas, Yang, Yezhou, Baral, Chitta

论文摘要

训练视觉问题回答的方法论（VQA）模型假设具有人类注销的\ textit {image-question-andwer}（i-q-a）三重态的数据集的可用性。这导致了对数据集的严重依赖，并且缺乏对新类型的问题和场景的概括。语言先知以及由于注释主观性引起的偏见和错误已显示出渗透到在此类样本上训练的VQA模型中。我们研究是否可以在没有任何人类宣传的Q-A对的情况下训练模型，但仅具有图像及其相关的文本描述或字幕。我们提出了一种用字幕生成的合成Q-A对训练模型的方法。此外，我们证明了空间 - 锥体图像贴片的功效，作为现有VQA模型中使用的密集且昂贵的对象边界框注释的简单但有效的替代方案。我们对三个VQA基准测试的实验证明了这种弱监督的方法的功效，尤其是在VQA-CP挑战上，该方法在不断变化的语言先验下测试了性能。

Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated \textit{Image-Question-Answer} (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题