论文标题

英语 - 日本多模式神经机器翻译的语料库,句子可比

A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences

论文作者

Merritt, Andrew, Chu, Chenhui, Arase, Yuki

论文摘要

多年来,多模式神经机器翻译(NMT)已成为越来越重要的研究领域,因为其他模式(例如图像数据)可以为文本数据提供更多的上下文。此外,由于使用图像的平行句子的可用性较低,尤其是对于英语 - 日本数据,训练多模式NMT模型的生存能力继续进行研究。但是,该空白可​​以用包含双语术语和平行短语的可比句子填充,这些句子是通过社交网络帖子和电子商务产品描述等媒体自然创建的。在本文中,我们提出了一个新的多模式英语 - 日本语料库,其中包含可比较的句子,这些句子是从现有图像字幕数据集中编译的。此外,我们还用较小的并行语料库来补充可比句子,以验证和测试目的。为了测试这种可比较的句子翻译方案的性能,我们使用可比的语料库训练多个基线NMT模型,并评估其英语 - 日本翻译表现。由于我们的基线实验中的平移得分较低,我们认为当前的多模式NMT模型并非旨在有效利用可比较的句子数据。尽管如此,我们希望我们的语料库被用来进一步研究具有可比句子的多模式NMT。

Multimodal neural machine translation (NMT) has become an increasingly important area of research over the years because additional modalities, such as image data, can provide more context to textual data. Furthermore, the viability of training multimodal NMT models without a large parallel corpus continues to be investigated due to low availability of parallel sentences with images, particularly for English-Japanese data. However, this void can be filled with comparable sentences that contain bilingual terms and parallel phrases, which are naturally created through media such as social network posts and e-commerce product descriptions. In this paper, we propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets. In addition, we supplement our comparable sentences with a smaller parallel corpus for validation and test purposes. To test the performance of this comparable sentence translation scenario, we train several baseline NMT models with our comparable corpus and evaluate their English-Japanese translation performance. Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data. Despite this, we hope for our corpus to be used to further research into multimodal NMT with comparable sentences.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源