低资源语言的序列到序列的序列文本到语音的无监督学习

论文标题

低资源语言的序列到序列的序列文本到语音的无监督学习

Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages

论文作者

Zhang, Haitong, Lin, Yue

论文摘要

最近，在文本到语音（TTS）中成功应用了具有注意力的序列到序列模型。这些模型可以通过大量准确转录的语音语料库产生近人类的语音。但是，准备如此大的数据集既昂贵又费力。为了减轻大量数据需求的问题，我们在本文中提出了一种新颖的无监督预培训机制。具体而言，我们首先使用矢量定量化变量 - 自动编码器（VQ-VAE）从大规模，公开发现和未转录的语音中删除无监督的语言单位。然后，我们使用<无监督的语言单元，音频>对预训练序列到序列TTS模型。最后，我们用少量的<文本，音频>来自目标扬声器的配对数据对模型进行了微调。结果，客观和主观评估都表明，我们提出的方法可以通过相同数量的配对培训数据综合更可理解和自然的语音。此外，我们将提出的方法扩展到假设的低资源语言，并使用客观评估验证该方法的有效性。

Recently, sequence-to-sequence models with attention have been successfully applied in Text-to-speech (TTS). These models can generate near-human speech with a large accurately-transcribed speech corpus. However, preparing such a large data-set is both expensive and laborious. To alleviate the problem of heavy data demand, we propose a novel unsupervised pre-training mechanism in this paper. Specifically, we first use Vector-quantization Variational-Autoencoder (VQ-VAE) to ex-tract the unsupervised linguistic units from large-scale, publicly found, and untranscribed speech. We then pre-train the sequence-to-sequence TTS model by using the<unsupervised linguistic units, audio>pairs. Finally, we fine-tune the model with a small amount of<text, audio>paired data from the target speaker. As a result, both objective and subjective evaluations show that our proposed method can synthesize more intelligible and natural speech with the same amount of paired training data. Besides, we extend our proposed method to the hypothesized low-resource languages and verify the effectiveness of the method using objective evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题