强大的一击音频到视频发电

论文标题

强大的一击音频到视频发电

Robust One Shot Audio to Video Generation

论文作者

Kumar, Neeraj, Goel, Srishti, Narang, Ankur, Hasan, Mujtaba

论文摘要

视频发电是一个有趣的问题，它在行业垂直领域具有许多应用，包括电影制作，多媒体，市场营销，教育等。具有表现力的面部运动的高质量视频生成是一个具有挑战性的问题，涉及生成对抗网络的复杂学习步骤。此外，为看不见的单图像启用一声学习会增加问题的复杂性，同时使其更适用于实际情况。在论文中，我们提出了一种新颖的方法soneshota2v，以使用AS INPUT：音频信号和一个人的单一图像来合成一个会说话的人视频。 OneShota2v利用课程学习学习表现力的面部成分的运动，因此产生了给定人的高质量说话视频。此外，它将从音频输入产生的功能直接馈送到生成的对抗网络中，并通过仅使用少数输出更新时代应用少量图表来适应任何给定的自拍。 OneShota2v利用基于空间自适应归一化的多级生成器和多个基于多级别歧视的架构。输入音频剪辑不仅限于任何特定语言，这给出了该方法多语言适用性。实验评估表明，与使用gans（RSDGAN）[43]，Speech2Vid [8]和其他方法的逼真的语音驱动面部动画相比，Oneshota2v的表现出色，以及其他方法，以及其他定量指标，包括：SSIM（结构相似性指数），PSNR（峰值信号到噪声比率）和CPBD（图像尖锐）。此外，定性评估和在线图灵测试证明了我们方法的功效。

Audio to Video generation is an interesting problem that has numerous applications across industry verticals including film making, multi-media, marketing, education and others. High-quality video generation with expressive facial movements is a challenging problem that involves complex learning steps for generative adversarial networks. Further, enabling one-shot learning for an unseen single image increases the complexity of the problem while simultaneously making it more applicable to practical scenarios. In the paper, we propose a novel approach OneShotA2V to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person. Further, it feeds the features generated from the audio input directly into a generative adversarial network and it adapts to any given unseen selfie by applying fewshot learning with only a few output updation epochs. OneShotA2V leverages spatially adaptive normalization based multi-level generator and multiple multi-level discriminators based architecture. The input audio clip is not restricted to any specific language, which gives the method multilingual applicability. Experimental evaluation demonstrates superior performance of OneShotA2V as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [43], Speech2Vid [8], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio) and CPBD (image sharpness). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题