“听，理解和翻译”：三重监督端到端语音到文本翻译

论文标题

“听，理解和翻译”：三重监督端到端语音到文本翻译

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

论文作者

Dong, Qianqian, Ye, Rong, Wang, Mingxuan, Zhou, Hao, Xu, Shuang, Xu, Bo, Li, Lei

论文摘要

端到端的语音到文本翻译（ST）采用源语言的音频，并以目标语言输出文本。现有方法受并行语料库量的限制。我们可以在平行的ST语料库中构建一个完全利用信号的系统吗？我们受到人类理解系统的启发，该系统由听觉感知和认知处理组成。在本文中，我们提出了聆听理解 - 翻译（LUT），这是一个带有三重监督信号的统一框架，以使端到端的语音到文本翻译任务解除。 LUT能够指导声学编码器从听觉输入中提取尽可能多的信息。此外，LUT利用预先训练的BERT模型来强制执行上层编码器，以产生尽可能多的语义信息，而无需额外的数据。我们对各种语音翻译基准进行实验，包括Librispech英语 - 法国，IWSLT英语 - 德文和Ted English-Chinese。我们的结果表明，LUT实现了最先进的性能，优于先前的方法。该代码可在https://github.com/dqqcasia/st上获得。

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand-Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at https://github.com/dqqcasia/st.

下载PDF全文

下载文献需遵守相关版权规定

论文标题