论文标题
MHSAN:视觉语义嵌入的多头自我注意力网络
MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
论文作者
论文摘要
视觉语义嵌入可以实现各种任务,例如图像文本检索,图像字幕和视觉问题回答。成功视觉语义嵌入的关键是通过考虑其复杂的关系来正确表达视觉和文本数据。尽管以前的研究通过将视觉和文本数据编码为近距离概念的关节空间来实现了很大的进步,但它们通常通过单个向量来表示数据,而单个向量忽略了图像或文本中多个重要组成部分的存在。因此,除了联合嵌入空间外,我们还提出了一个新型的多头自发项网络,以通过参与数据中的重要部分来捕获视觉和文本数据的各种组成部分。我们的方法在MS-Coco和Flicker30k数据集上实现了新的最新最新方法。通过可视化注意图在图像和文本中的多个位置捕获不同的语义成分,我们证明了我们的方法可以实现有效且可解释的视觉语义关节空间。
Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space.