弥合语音差距以进行语音到文本翻译

论文标题

弥合语音差距以进行语音到文本翻译

Bridging the Modality Gap for Speech-to-Text Translation

论文作者

Liu, Yuchen, Zhu, Junnan, Zhang, Jiajun, Zong, Chengqing

论文摘要

端到端的语音翻译旨在通过端到端的方式将语音用一种语言翻译成另一种语言。大多数现有方法都使用单个编码器采用编码器结构来同时学习声学表示和语义信息，这忽略了语音和文本模态差异并使编码器超负荷，从而导致了学习这种模型的困难。为了解决这些问题，我们提出了语音翻译（Stast）模型的语音到文本改编版，该模型旨在通过弥合语音和文本之间的模态差距来改善端到端模型性能。具体而言，我们将语音翻译编码器分解为三个部分，并引入收缩机制，以将语音表示的长度与相应的文本转录相匹配。为了获得更好的语义表示形式，我们将基于文本的翻译模型完全集成到了停止中，以便可以在同一潜在空间中训练两个任务。此外，我们引入了一种跨模式适应方法，以关闭语音和文本表示之间的距离。英语和英语 - 德语语音翻译公司的实验结果表明，我们的模型显着优于强大的基准，并实现了新的最新性能。

End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate a text-based translation model into the STAST so that two tasks can be trained in the same latent space. Furthermore, we introduce a cross-modal adaptation method to close the distance between speech and text representation. Experimental results on English-French and English-German speech translation corpora have shown that our model significantly outperforms strong baselines, and achieves the new state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题