平行波形在vae潜在向量的条件

论文标题

平行波形在vae潜在向量的条件

Parallel WaveNet conditioned on VAE latent vectors

论文作者

Rohnke, Jonas, Merritt, Tom, Lorenzo-Trueba, Jaime, Gabrys, Adam, Aggarwal, Vatsal, Moinet, Alexis, Barra-Chicote, Roberto

论文摘要

最近，最先进的文本到语音综合系统已转移到了两种模型方法：一个序列到序列模型，以预测语音表示（通常是MEL-SPECTROGRAMEN），然后是“神经声录器”模型，该模型从该中间语音表示中产生了时间域的语音波形。这种方法能够综合与自然语音记录相混淆的语音。但是，神经声码器方法的推理速度代表了为商业应用部署这项技术的主要障碍。平行WaveNet是一种用于解决此问题的方法，将某些合成质量交易，以更快的推理速度。在本文中，我们研究了句子级调节载体的使用来提高平行的波丝神经声码器的信号质量。我们从TACOTRON 2风格序列到序列模型的预训练的VAE组件中调节神经辅助载体的潜在载体。因此，我们能够显着提高演讲的质量。

Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with natural speech recordings. However, the inference speed of neural vocoder approaches represents a major obstacle for deploying this technology for commercial applications. Parallel WaveNet is one approach which has been developed to address this issue, trading off some synthesis quality for significantly faster inference speed. In this paper we investigate the use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder. We condition the neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model. With this, we are able to significantly improve the quality of vocoded speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题