VCVT：通过语音转换从跨模式知识转移的多演讲者视频到语音综合

论文标题

VCVT：通过语音转换从跨模式知识转移的多演讲者视频到语音综合

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

论文作者

Wang, Disong, Yang, Shan, Su, Dan, Liu, Xunying, Yu, Dong, Meng, Helen

论文摘要

尽管对于依赖说话者的视频到语音（VTS）的综合已经取得了重大进展，但很少关注多演讲者的VT，这些VT可以将无声视频映射到语音，同时允许在单个系统中灵活控制说话者身份。 This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units.然后，LIP2IND网络可以替换VC的内容编码器形成多扬声器VTS系统，以将无声视频转换为声学单元，以重建准确的口语内容。 VTS系统还通过使用扬声器编码器来产生说话者表示来有效控制生成的语音的说话者身份，从而继承了VC的优势。广泛的评估验证了拟议方法的有效性，这些方法可以在受限的词汇和开放词汇条件下应用，从而在产生具有高质量的自然性，清晰度和扬声器相似性的高质量语音方面实现了最先进的表现。我们的演示页面在此处发布：https：//wendison.github.io/vcvts-demo/

Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here: https://wendison.github.io/VCVTS-demo/

下载PDF全文

下载文献需遵守相关版权规定

论文标题