论文标题
关于通用机器翻译的学习语言不变表示
On Learning Language-Invariant Representations for Universal Machine Translation
论文作者
论文摘要
通用机器翻译的目的是学会在任何一对语言之间进行翻译,鉴于\ emph {emph {一个小的子集}的成对翻译文档的语料库。尽管令人印象深刻的经验结果和对大量多语言模型的兴趣日益增加,但这种通用机器翻译模型对翻译误差的理论分析只是新生的。在本文中,我们正式证明了这项努力的某些不可能,并且在存在附加(但自然)数据结构的情况下证明了积极的结果。 对于前者而言,我们在多一对许多翻译设置中的翻译错误中得出了一个下限,该算法旨在学习多种语言对之间的共享句子表示形式,必须在至少一个翻译任务上犯一个大型翻译错误,如果对语言结构没有任何假设。对于后者,我们表明,如果语料库中的配对文档遵循自然\ emph {encoder-decoder}生成过程,我们可以期望自然的``概括''的自然概念:线性数量的语言对,而不是典型的语言,足以学习一个很好的表示。我们的理论还解释了多对语言之间哪些连接图更适合:较长路径的连接图会导致样本复杂性较差,因为所需的每个语言对的文档总数。我们认为,我们的理论见解和含义有助于通用机器翻译的未来算法设计。
The goal of universal machine translation is to learn to translate between any pair of languages, given a corpus of paired translated documents for \emph{a small subset} of all pairs of languages. Despite impressive empirical results and an increasing interest in massively multilingual models, theoretical analysis on translation errors made by such universal machine translation models is only nascent. In this paper, we formally prove certain impossibilities of this endeavour in general, as well as prove positive results in the presence of additional (but natural) structure of data. For the former, we derive a lower bound on the translation error in the many-to-many translation setting, which shows that any algorithm aiming to learn shared sentence representations among multiple language pairs has to make a large translation error on at least one of the translation tasks, if no assumption on the structure of the languages is made. For the latter, we show that if the paired documents in the corpus follow a natural \emph{encoder-decoder} generative process, we can expect a natural notion of ``generalization'': a linear number of language pairs, rather than quadratic, suffices to learn a good representation. Our theory also explains what kinds of connection graphs between pairs of languages are better suited: ones with longer paths result in worse sample complexity in terms of the total number of documents per language pair needed. We believe our theoretical insights and implications contribute to the future algorithmic design of universal machine translation.