基于识别合成的非平行语音转换与对抗性学习

论文标题

基于识别合成的非平行语音转换与对抗性学习

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

论文作者

Zhang, Jing-Xuan, Ling, Zhen-Hua, Dai, Li-Rong

论文摘要

本文提出了一种基于识别合成的对抗性学习方法，基于非并行语音转换。识别器用于将声学特征转化为语言表示，而合成器则将识别器与扬声器身份一起恢复识别器输出的输出功能。通过将说话者特征与语言表示分开，可以通过用目标代替扬声器身份来实现语音转换。在我们提出的方法中，采用说话者对抗性损失，以便使用识别器获得与说话者无关的语言表示。此外，引入歧视因子，并使用生成对抗网络（GAN）损失来防止预测的特征过度平滑。对于训练模型参数，设计了一种预先培训的策略，然后设计了源目标扬声器对上的微调。我们的方法比在2018年语音转换挑战中获得最佳性能的基线模型获得了更高的相似性。

This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. A recognizer is used to transform acoustic features into linguistic representations while a synthesizer recovers output features from the recognizer outputs together with the speaker identity. By separating the speaker characteristics from the linguistic representations, voice conversion can be achieved by replacing the speaker identity with the target one. In our proposed method, a speaker adversarial loss is adopted in order to obtain speaker-independent linguistic representations using the recognizer. Furthermore, discriminators are introduced and a generative adversarial network (GAN) loss is used to prevent the predicted features from being over-smoothed. For training model parameters, a strategy of pre-training on a multi-speaker dataset and then fine-tuning on the source-target speaker pair is designed. Our method achieved higher similarity than the baseline model that obtained the best performance in Voice Conversion Challenge 2018.

下载PDF全文

下载文献需遵守相关版权规定

论文标题