朝着以中文为中心的低资源语言的神经机器翻译

论文标题

朝着以中文为中心的低资源语言的神经机器翻译

Towards Better Chinese-centric Neural Machine Translation for Low-resource Languages

论文作者

Li, Bin, Weng, Yixuan, Xia, Fei, Deng, Hanjun

论文摘要

过去十年来，科学和技术方面取得了巨大的进步，刺激了各个国家对经济和文化交流的不断增长。建立神经机器翻译（NMT）系统已成为一种紧迫的趋势，尤其是在低资源环境中。但是，最近的工作倾向于研究以英语为中心的低资源语言的NMT系统，而很少有作品集中在以其他语言（例如中文）为中心的低资源NMT系统上。为了实现这一目标，2021 IFLYTEK AI开发人员竞争的低资源多语言翻译挑战提供了以中国为中心的多语言低资源NMT NMT任务，其中要求参与者根据提供的低资源样本来构建NMT系统。在本文中，我们介绍了赢家竞争系统，该系统利用单语嵌入数据增强，双语课程学习和对比度重新排名。此外，提出了一种新的不完整信任（信任）损失函数，以替代训练时传统的跨透镜损失。实验结果表明，与其他最先进的方法相比，这些想法的实施能力更好。所有实验代码均在以下网址发布：https：//github.com/wengsyx/low-row-row-source-text-translation。

The last decade has witnessed enormous improvements in science and technology, stimulating the growing demand for economic and cultural exchanges in various countries. Building a neural machine translation (NMT) system has become an urgent trend, especially in the low-resource setting. However, recent work tends to study NMT systems for low-resource languages centered on English, while few works focus on low-resource NMT systems centered on other languages such as Chinese. To achieve this, the low-resource multilingual translation challenge of the 2021 iFLYTEK AI Developer Competition provides the Chinese-centric multilingual low-resource NMT tasks, where participants are required to build NMT systems based on the provided low-resource samples. In this paper, we present the winner competition system that leverages monolingual word embeddings data enhancement, bilingual curriculum learning, and contrastive re-ranking. In addition, a new Incomplete-Trust (In-trust) loss function is proposed to replace the traditional cross-entropy loss when training. The experimental results demonstrate that the implementation of these ideas leads better performance than other state-of-the-art methods. All the experimental codes are released at: https://github.com/WENGSYX/Low-resource-text-translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题