视觉新闻：新闻图像字幕中的基准和挑战

论文标题

视觉新闻：新闻图像字幕中的基准和挑战

Visual News: Benchmark and Challenges in News Image Captioning

论文作者

Liu, Fuxiao, Wang, Yinghan, Wang, Tianlu, Ordonez, Vicente

论文摘要

我们提出了视觉新闻字幕仪，这是一种实体感知的新闻图像字幕任务的模型。我们还介绍了视觉新闻，这是一个大规模的基准，包括超过一百万个新闻图像以及相关的新闻文章，图像标题，作者信息和其他元数据。与标准图像字幕任务不同，新闻图像描绘了人们，位置和事件至关重要的情况。我们提出的方法可以有效地结合视觉和文本功能，以生成字幕以及更丰富的信息，例如事件和实体。更具体地说，是基于变压器体系结构的，我们的模型进一步配备了新型的多模式特征融合技术和注意力机制，这些功能和注意力机制旨在更准确地生成命名实体。我们的方法利用了比竞争方法更少的参数，同时实现了预测结果稍好。我们更大，更多样化的视觉新闻数据集进一步突出了标题为新闻图像的剩余挑战。

We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题