论文标题
视觉空间描述:受控空间的图像到文本生成
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
论文作者
论文摘要
数十年来,图像到文本任务(例如开放式图像字幕和可控图像描述)已受到广泛关注。在这里,我们通过呈现视觉空间描述(VSD)进一步推进了这一工作,这是一种朝着空间语义的图像到文本的新观点。给定图像和其中的两个对象,VSD旨在产生一个描述,重点是两个对象之间的空间透视。因此,我们手动注释数据集,以促进对新引入的任务的调查,并通过使用VL-BART和VL-T5作为骨架来构建多个基准编码器模型。此外,我们研究了将视觉空间关系分类(VSRC)信息纳入我们的模型的管道和关节端到端体系结构。最后,我们在基准数据集上进行实验,以评估所有模型。结果表明,我们的模型令人印象深刻,提供了准确且类似于人类的空间导向文本描述。同时,VSRC在VSD中具有巨大的潜力,而联合端到端体系结构是它们集成的更好选择。我们将数据集和代码公开用于研究目的。
Image-to-text tasks, such as open-ended image captioning and controllable image description, have received extensive attention for decades. Here, we further advance this line of work by presenting Visual Spatial Description (VSD), a new perspective for image-to-text toward spatial semantics. Given an image and two objects inside it, VSD aims to produce one description focusing on the spatial perspective between the two objects. Accordingly, we manually annotate a dataset to facilitate the investigation of the newly-introduced task and build several benchmark encoder-decoder models by using VL-BART and VL-T5 as backbones. In addition, we investigate pipeline and joint end-to-end architectures for incorporating visual spatial relationship classification (VSRC) information into our model. Finally, we conduct experiments on our benchmark dataset to evaluate all our models. Results show that our models are impressive, providing accurate and human-like spatial-oriented text descriptions. Meanwhile, VSRC has great potential for VSD, and the joint end-to-end architecture is the better choice for their integration. We make the dataset and codes public for research purposes.