论文标题
通过电话级内容式删除文本到语音综合中的细颗粒样式建模,传输和预测
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
论文作者
论文摘要
本文介绍了一种新颖的神经网络系统设计,用于表达文本到语音(TTS)合成中的细粒度样式建模,转移和预测。通过从电话级语音段的MEL光谱图中提取样式的嵌入方式来实现细粒度的建模。采用协作学习和对抗性学习策略,以在语音中有效地分解内容和样式因素,并减轻样式建模中的“内容泄漏”问题。所提出的系统可用于在单扬声器方案中进行不同的语音风格转移。客观和主观评估的结果表明,我们的系统的性能要比其他细粒度的语音转移模型更好,尤其是在内容保存方面。通过合并样式预测指标,所提出的系统也可以用于文本到语音综合。为系统演示提供了音频样本https://daxintan-cuhk.github.io/pl-csd-speech。
This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-speech .