迈向类似人类的基于文本的视觉问题回答的3D空间推理

论文标题

迈向类似人类的基于文本的视觉问题回答的3D空间推理

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

论文作者

Li, Hao, Huang, Jinfa, Jin, Peng, Song, Guoli, Wu, Qi, Chen, Jie

论文摘要

基于文本的视觉问题回答〜（TextVQA）旨在为具有多个场景文本的图像问题提供正确的答案。在大多数情况下，文本自然附着在物体表面上。因此，文本和对象之间的空间推理在文本VQA中至关重要。但是，现有方法在从输入图像中学到的2D空间信息中受到限制，并依靠基于变压器的体系结构在融合过程中隐含地推理。在此设置下，这些2D空间推理方法无法区分同一图像平面上的视觉对象和场景文本之间的细颗粒空间关系，从而损害了TextVQA模型的可解释性和性能。在本文中，我们将3D几何信息引入了类似人类的空间推理过程，以逐步捕获关键对象的上下文知识。％我们通过引入3D几何信息来捕获关键对象的上下文知识来制定类似人类的空间推理过程。为了增强模型对3D空间关系的理解，特别是（i）〜我们提出了一个关系预测模块，以准确定位关键对象的关注区域；（ii）〜我们设计了一个深度感知的注意校准模块，以根据关键对象校准OCR令牌的注意力。广泛的实验表明，我们的方法在TextVQA和ST-VQA数据集上实现了最先进的性能。更令人鼓舞的是，我们的模型在涉及TextVQA和ST-VQA有效拆分中的空间推理的问题上以5.7 \％和12.1 \％的明显边缘超过了其他模型。此外，我们还验证了模型对基于文本的图像字幕任务的普遍性。

Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial reasoning between texts and objects is crucial in TextVQA. However, existing approaches are constrained within 2D spatial information learned from the input images and rely on transformer-based architectures to reason implicitly during the fusion process. Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grain spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models. In this paper, we introduce 3D geometric information into a human-like spatial reasoning process to capture the contextual knowledge of key objects step-by-step. %we formulate a human-like spatial reasoning process by introducing 3D geometric information for capturing key objects' contextual knowledge. To enhance the model's understanding of 3D spatial relationships, Specifically, (i)~we propose a relation prediction module for accurately locating the region of interest of critical objects; (ii)~we design a depth-aware attention calibration module for calibrating the OCR tokens' attention according to critical objects. Extensive experiments show that our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets. More encouragingly, our model surpasses others by clear margins of 5.7\% and 12.1\% on questions that involve spatial reasoning in TextVQA and ST-VQA valid split. Besides, we also verify the generalizability of our model on the text-based image captioning task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题