通过自我监督的学习无监督的多模式视频对视频翻译

论文标题

通过自我监督的学习无监督的多模式视频对视频翻译

Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning

论文作者

Liu, Kangning, Gu, Shuhang, Romero, Andres, Timofte, Radu

论文摘要

现有的无监督视频对视频翻译方法无法产生翻译的视频，这些视频是符合框架现实，语义信息保存和视频级别一致的视频。在这项工作中，我们提出了一种新颖的无监督视频对视频翻译模型的Uvit。我们的模型分解了样式和内容，使用了专门的编码器 - 码头结构，并通过双向复发神经网络（RNN）单元传播框架间信息。样式包含分解机制使我们能够达到样式一致的视频翻译结果，并为我们提供了一个良好的界面，用于模态灵活翻译。此外，通过更改翻译中包含的输入框架和样式代码，我们提出了一个视频插值损失，该视频插值损失在序列中捕获了时间信息，以以自制的方式训练我们的构建块。我们的模型可以以多模式的方式产生照片现实的时空一致翻译视频。主观和客观的实验结果验证了我们模型比现有方法的优越性。更多详细信息可以在我们的项目网站上找到：https：//uvit.netlify.com

Existing unsupervised video-to-video translation methods fail to produce translated videos which are frame-wise realistic, semantic information preserving and video-level consistent. In this work, we propose UVIT, a novel unsupervised video-to-video translation model. Our model decomposes the style and the content, uses the specialized encoder-decoder structure and propagates the inter-frame information through bidirectional recurrent neural network (RNN) units. The style-content decomposition mechanism enables us to achieve style consistent video translation results as well as provides us with a good interface for modality flexible translation. In addition, by changing the input frames and style codes incorporated in our translation, we propose a video interpolation loss, which captures temporal information within the sequence to train our building blocks in a self-supervised manner. Our model can produce photo-realistic, spatio-temporal consistent translated videos in a multimodal way. Subjective and objective experimental results validate the superiority of our model over existing methods. More details can be found on our project website: https://uvit.netlify.com

下载PDF全文

下载文献需遵守相关版权规定

论文标题