使用Tacotron2的视听语音综合2

论文标题

使用Tacotron2的视听语音综合2

Audiovisual Speech Synthesis using Tacotron2

论文作者

Abdelaziz, Ahmed Hussen, Kumar, Anushree Prasanna, Seivwright, Chloe, Fanelli, Gabriele, Binder, Justin, Stylianou, Yannis, Kajarekar, Sachin

论文摘要

视听语音综合是综合说话面孔的问题，同时最大程度地提高声学和视觉语音的相干性。在本文中，我们提出并比较了3D面模型的两个视听语音合成系统。第一个系统是AVTACOTRON2，它是基于Tacotron2体系结构的端到端文本到原告的语音合成器。 AVTACOTRON2将代表句子的一系列音素转换为合成的声学特征和面部模型的相应控制器。输出声学特征用于调节Wavernn以重建语音波形，并且输出面部控制器用于生成说话面的相应视频。第二种视听语音合成系统是模块化的，其中使用传统的tacotron2从文本中合成声音。然后使用重建的声音信号使用独立训练的音频到面部动画神经网络来驱动面部模型的面部控制。我们进一步将端到端和模块化方法调节在编码所需韵律以产生情感视听语音的情感嵌入中。我们分析了两个系统的性能，并使用主观评估测试将它们与地面真相视频进行了比较。与专业记录的视频产生的地面真相相比，端到端和模块化系统能够以4.1和3.9的平均意见分数（MOS）为4.1和3.9的均值分数（MOS），分别为4.1和3.9。尽管端到端系统可提供更好的总体质量，但模块化方法更加灵活，声音和视觉语音综合的质量几乎彼此独立。

Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes representing the sentence to synthesize into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face. The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech signal is then used to drive the facial controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. We analyze the performance of the two systems and compare them to the ground truth videos using subjective evaluation tests. The end-to-end and modular systems are able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1 and 3.9, respectively, compared to a MOS of 4.1 for the ground truth generated from professionally recorded videos. While the end-to-end system gives a better overall quality, the modular approach is more flexible and the quality of acoustic speech and visual speech synthesis is almost independent of each other.

下载PDF全文

下载文献需遵守相关版权规定

论文标题