论文标题
通用指示文本到语音合成器在端到端框架中具有快速适应
Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework
论文作者
论文摘要
构建印度语言的文本到语音(TTS)合成器是一项艰巨的任务,这是由于大量的活性语言。印度语言可以归类为有限的家庭,其中突出的印度 - 雅利安和德拉维语。拟议的工作利用此属性在端到端框架中使用来自同一家族的多种语言构建通用TTS系统。通用系统非常强大,因为它们能够跨语言捕获各种语音。然后,使用少量的适应数据将这些系统适应同一家庭中的新语言。实验表明,只能使用7分钟的适应数据构建高质量的TTS系统。适应性TTSE的平均降解平均意见分数为3.98。 对通用TTSE中语言之间的系统相互作用进行了广泛的分析。将X向量包括作为扬声器嵌入,以在特定扬声器的声音中综合文本。一个有趣的观察是,保留了目标扬声器的声音的韵律。这些结果非常有前途,因为它们表明通用TTSE的能力无缝地处理扬声器和语言切换,并易于适应新语言。
Building text-to-speech (TTS) synthesisers for Indian languages is a difficult task owing to a large number of active languages. Indian languages can be classified into a finite set of families, prominent among them, Indo-Aryan and Dravidian. The proposed work exploits this property to build a generic TTS system using multiple languages from the same family in an end-to-end framework. Generic systems are quite robust as they are capable of capturing a variety of phonotactics across languages. These systems are then adapted to a new language in the same family using small amounts of adaptation data. Experiments indicate that good quality TTS systems can be built using only 7 minutes of adaptation data. An average degradation mean opinion score of 3.98 is obtained for the adapted TTSes. Extensive analysis of systematic interactions between languages in the generic TTSes is carried out. x-vectors are included as speaker embedding to synthesise text in a particular speaker's voice. An interesting observation is that the prosody of the target speaker's voice is preserved. These results are quite promising as they indicate the capability of generic TTSes to handle speaker and language switching seamlessly, along with the ease of adaptation to a new language.