论文标题

通过对抗训练,语音后验的多到许多歌声转换

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training

论文作者

Guo, Haohan, Lu, Heng, Hu, Na, Zhang, Chunlei, Yang, Shan, Xie, Lei, Su, Dan, Yu, Dong

论文摘要

本文介绍了一种端到端的对抗性歌声转换(EA-SVC)方法。它可以通过给定的语音后验(PPG)直接生成任意的唱歌波形,该声音后验(PPG)分别代表俯仰的内容,f0和代表音色的扬声器嵌入。建议的系统由三个模块组成:生成器$ g $,音频生成差异$ d_ {a} $和功能删除歧视$ d_f $。发电机$ g $并行地编码这些功能,并将它们转换为目标波形。为了使音色转换更加稳定和可控,扬声器嵌入进一步分解为一组代表不同音色簇的可训练向量的加权总和。此外,为了实现更强大,更准确的唱歌转换,提出了删除歧视器$ d_f $,以删除保留在编码PPG中的音调和音色相关的信息。最后,进行了两阶段的训练,以保持稳定有效的对抗训练过程。主观评估结果证明了我们提出的方法的有效性。拟议的系统在唱歌质量和歌手的相似性方面都优于常规级联方法和基于Waveet的端到端方法。进一步的客观分析表明,通过拟议的两阶段训练策略训练的模型可以产生更平稳,更清晰的义能,从而提高音频质量。

This paper describes an end-to-end adversarial singing voice conversion (EA-SVC) approach. It can directly generate arbitrary singing waveform by given phonetic posteriorgram (PPG) representing content, F0 representing pitch, and speaker embedding representing timbre, respectively. Proposed system is composed of three modules: generator $G$, the audio generation discriminator $D_{A}$, and the feature disentanglement discriminator $D_F$. The generator $G$ encodes the features in parallel and inversely transforms them into the target waveform. In order to make timbre conversion more stable and controllable, speaker embedding is further decomposed to the weighted sum of a group of trainable vectors representing different timbre clusters. Further, to realize more robust and accurate singing conversion, disentanglement discriminator $D_F$ is proposed to remove pitch and timbre related information that remains in the encoded PPG. Finally, a two-stage training is conducted to keep a stable and effective adversarial training process. Subjective evaluation results demonstrate the effectiveness of our proposed methods. Proposed system outperforms conventional cascade approach and the WaveNet based end-to-end approach in terms of both singing quality and singer similarity. Further objective analysis reveals that the model trained with the proposed two-stage training strategy can produce a smoother and sharper formant which leads to higher audio quality.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源