使用神经重新定位的迭代基于文本的谈话编辑

论文标题

使用神经重新定位的迭代基于文本的谈话编辑

Iterative Text-based Editing of Talking-heads Using Neural Retargeting

论文作者

Yao, Xinwei, Fried, Ohad, Fatahalian, Kayvon, Agrawala, Maneesh

论文摘要

我们提出了一种基于文本的工具，用于编辑说话头视频，以启用迭代编辑工作流程。在每个迭代中，用户都可以编辑演讲的措辞，如有必要，可以进一步完善口腔动作，以减少伪像，并通过插入口腔手势（例如微笑）或更改整体性能风格（例如Energetic，Mumble）来减少伪像和操纵性能的非语言方面。我们的工具仅需要目标演员视频的2-3分钟，它在大约40秒内综合了每次迭代的视频，从而使用户可以快速探索许多迭代时的编辑可能性。我们的方法基于两个关键思想。（1）我们开发了一种快速的音素搜索算法，该算法可以快速识别源存储库视频的音素级子序列，该视频最能匹配所需的编辑。这使我们可以快速迭代循环。（2）我们利用了源演员的大量视频存储库，并开发了一种新的自我监视的神经重新定位技术，以将源演员的口腔动作转移到目标演员。这使我们能够使用相对较短的目标演员视频，使我们的方法适用于许多现实世界的编辑方案。最后，我们的改进和性能控制使用户能够进一步微调合成的结果。

We present a text-based tool for editing talking-head video that enables an iterative editing workflow. On each iteration users can edit the wording of the speech, further refine mouth motions if necessary to reduce artifacts and manipulate non-verbal aspects of the performance by inserting mouth gestures (e.g. a smile) or changing the overall performance style (e.g. energetic, mumble). Our tool requires only 2-3 minutes of the target actor video and it synthesizes the video for each iteration in about 40 seconds, allowing users to quickly explore many editing possibilities as they iterate. Our approach is based on two key ideas. (1) We develop a fast phoneme search algorithm that can quickly identify phoneme-level subsequences of the source repository video that best match a desired edit. This enables our fast iteration loop. (2) We leverage a large repository of video of a source actor and develop a new self-supervised neural retargeting technique for transferring the mouth motions of the source actor to the target actor. This allows us to work with relatively short target actor videos, making our approach applicable in many real-world editing scenarios. Finally, our refinement and performance controls give users the ability to further fine-tune the synthesized results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题