论文标题
知识增强的注意网络,具有群体的语义,可视觉讲故事
Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling
论文作者
论文摘要
作为一个技术上具有挑战性的话题,视觉讲故事旨在与一组相关图像的叙事多句子一起产生一个虚构的故事。现有方法通常会生成明显基于图像的内容的直接和严格描述,因为它们无法探索图像以外的隐式信息。因此,这些方案无法从整体代表中捕获一致的依赖性,从而损害了合理和流利的故事的产生。为了解决这些问题,提出了一个具有新颖的知识增强的注意力网络,该网络提出了群体的语义模型。通过实质性实验来设计和支持三个主要的新型组件,以揭示实际优势。首先,富含知识的注意力网络旨在从外部知识系统中提取隐式概念,这些概念之后是级联跨模式的注意机制,以表征想象力和具体表示。其次,开发了一个具有二阶合并的小组语义模块,以探索全球一致的指导。第三,提出了一个统一的一阶段故事生成模型,以同时训练和推断知识增强的注意力网络,小组的语义模块和多模式的故事生成解码器以端到端的方式进行训练。具有客观和主观评估指标的流行视觉讲故事数据集的大量实验表明,与其他最先进的方法相比,该方案的卓越性能。
As a technically challenging topic, visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images. Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images. Hence, these schemes could not capture consistent dependencies from holistic representation, impairing the generation of reasonable and fluent story. To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed. Three main novel components are designed and supported by substantial experiments to reveal practical advantages. First, a knowledge-enriched attention network is designed to extract implicit concepts from external knowledge system, and these concepts are followed by a cascade cross-modal attention mechanism to characterize imaginative and concrete representations. Second, a group-wise semantic module with second-order pooling is developed to explore the globally consistent guidance. Third, a unified one-stage story generation model with encoder-decoder structure is proposed to simultaneously train and infer the knowledge-enriched attention network, group-wise semantic module and multi-modal story generation decoder in an end-to-end fashion. Substantial experiments on the popular Visual Storytelling dataset with both objective and subjective evaluation metrics demonstrate the superior performance of the proposed scheme as compared with other state-of-the-art methods.