论文标题

“听,理解和翻译”:三重监督端到端语音到文本翻译

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

论文作者

Dong, Qianqian, Ye, Rong, Wang, Mingxuan, Zhou, Hao, Xu, Shuang, Xu, Bo, Li, Lei

论文摘要

端到端的语音到文本翻译(ST)采用源语言的音频,并以目标语言输出文本。现有方法受并行语料库量的限制。我们可以在平行的ST语料库中构建一个完全利用信号的系统吗?我们受到人类理解系统的启发,该系统由听觉感知和认知处理组成。在本文中,我们提出了聆听理解 - 翻译(LUT),这是一个带有三重监督信号的统一框架,以使端到端的语音到文本翻译任务解除。 LUT能够指导声学编码器从听觉输入中提取尽可能多的信息。此外,LUT利用预先训练的BERT模型来强制执行上层编码器,以产生尽可能多的语义信息,而无需额外的数据。我们对各种语音翻译基准进行实验,包括Librispech英语 - 法国,IWSLT英语 - 德文和Ted English-Chinese。我们的结果表明,LUT实现了最先进的性能,优于先前的方法。该代码可在https://github.com/dqqcasia/st上获得。

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand-Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at https://github.com/dqqcasia/st.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源