论文标题

频谱和韵律转换,用于与Cyclegan的跨语性语音转换

Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

论文作者

Du, Zongyang, Zhou, Kun, Sisman, Berrak, Li, Haizhou

论文摘要

跨语性的语音转换旨在将源扬声器的声音更改为“目标扬声器”的声音,当源和目标扬声器说不同的语言时。它依靠来自两种不同语言的非平行训练数据,因此比单语语音转换更具挑战性。先前关于跨语性语音转换的研究主要集中于光谱转化,并通过线性转换进行F0转移。但是,作为一个重要的韵律因素,F0本质上是层次结构,因此仅使用线性方法进行转换是不够的。我们建议将连续小波变换(CWT)分解用于F0建模。 CWT提供了一种将信号分解为不同时间尺度的方法,这些时间尺度在不同的时间分辨率中解释了韵律。我们还建议分别训练两个自行车管道进行频谱和韵律映射。这样,我们消除了对任何两种语言和任何对齐技术的并行数据的需求。实验结果表明,我们提出的Spectrum-prosody-Cyclegan框架在主观评估中优于频谱基线。据我们所知,这是对跨语性语音转换中韵律的第一个研究。

Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages. It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion. Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer. However, as an important prosodic factor, F0 is inherently hierarchical, thus it is insufficient to just use a linear method for conversion. We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions. We also propose to train two CycleGAN pipelines for spectrum and prosody mapping respectively. In this way, we eliminate the need for parallel data of any two languages and any alignment techniques. Experimental results show that our proposed Spectrum-Prosody-CycleGAN framework outperforms the Spectrum-CycleGAN baseline in subjective evaluation. To our best knowledge, this is the first study of prosody in cross-lingual voice conversion.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源