端到端代码转换自动语音识别的发音和语言

论文标题

端到端代码转换自动语音识别的发音和语言

Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition

论文作者

Zhang, Shuai, Yi, Jiangyan, Tian, Zhengkun, Bai, Ye, Tao, Jianhua, wen, Zhengqi

论文摘要

尽管最近在端到端（E2E）ASR系统方面取得了重大进展，用于代码开关，但对音频文本配对数据的饥饿却限制了模型性能的进一步改善。在本文中，我们提出了一个脱钩的变压器模型，以使用单语配对数据和未配对的文本数据来减轻代码转换数据短缺问题。该模型分为两个部分：音频到phoneme（A2P）网络和音素与文本（P2T）网络。 A2P网络可以使用大规模单语言配对数据学习声学模式方案。同时，在训练过程中，它实时生成了单个音频数据的多个音素序列。然后，生成的音素文本配对数据用于训练P2T网络。该网络可以通过大量外部未配对的文本数据进行预训练。通过使用单语数据和未配对的文本数据，解耦的变压器模型在一定程度上降低了E2E模型的高度依赖性对代码转换配对训练数据。最后，两个网络通过注意融合共同优化。我们在公共普通话 - 英语密码转换数据集上评估了提出的方法。与我们的变压器基线相比，所提出的方法可实现18.14％的相对混合误差率降低。

Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real-time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14% relative mix error rate reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题