Universal Melgan：在多个领域中用于高保真波形的强大神经声码器

论文标题

Universal Melgan：在多个领域中用于高保真波形的强大神经声码器

Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains

论文作者

Jang, Won, Lim, Dan, Yoon, Jaesam

论文摘要

我们提出了梅尔根（Universal Melgan），梅尔根（Universal Melgan）是一项综合多个领域中高保真演讲的声音。为了保持声音质量，当基于梅尔根的结构使用数百个扬声器的数据集训练时，我们添加了多分辨率的光谱图辨别仪以锐化生成的波形的光谱分辨率。这使该模型能够通过减轻大型足迹模型的高频带中的过度光滑问题来生成多演讲者的真实波形。我们的结构通过区分训练期间的波形和频谱图，在不降低推理速度的情况下生成了接近地面数据的信号。该模型在大多数情况下都使用地面真形光谱图作为输入获得了最佳的平均意见分数（MOS）。尤其是，与说话者，情感和语言有关，它在看不见的领域表现出了卓越的表现。此外，在使用变压器模型生成的MEL光谱图的多演讲者文本到语音方案中，它合成了4.22 MOS的高保真语音。在没有外部域信息的情况下实现的这些结果突出了所提出的模型作为通用声码器的潜力。

We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. To preserve sound quality when the MelGAN-based structure is trained with a dataset of hundreds of speakers, we added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms. This enables the model to generate realistic waveforms of multi-speakers, by alleviating the over-smoothing problem in the high frequency band of the large footprint model. Our structure generates signals close to ground-truth data without reducing the inference speed, by discriminating the waveform and spectrogram during training. The model achieved the best mean opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an input. Especially, it showed superior performance in unseen domains with regard of speaker, emotion, and language. Moreover, in a multi-speaker text-to-speech scenario using mel-spectrogram generated by a transformer model, it synthesized high-fidelity speech of 4.22 MOS. These results, achieved without external domain information, highlight the potential of the proposed model as a universal vocoder.

下载PDF全文

下载文献需遵守相关版权规定

论文标题