新兴的编码范式VCM：超出功能和信号的可扩展编码方法

论文标题

新兴的编码范式VCM：超出功能和信号的可扩展编码方法

An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal

论文作者

Xia, Sifeng, Liang, Kunchangtai, Yang, Wenhan, Duan, Ling-Yu, Liu, Jiaying

论文摘要

在本文中，我们研究了新兴的MPEG标准化努力引起的新问题，该努力旨在弥合视觉功能压缩和经典视频编码之间的差距。 VCM致力于以或多或少可扩展的方式解决机器和人类视觉的紧凑信号表示的要求。为此，我们在利用预测和生成模型的强度来支持机器和人类视觉任务的高级压缩技术方面做出努力，在这些压缩技术中，视觉特征是以可扩展方式连接信号级别和任务级的紧凑表示的桥梁。具体来说，我们采用有条件的深度生成网络来重建视频帧，并在学习模式的指导下重建视频帧。通过学习通过预测模型提取稀疏运动模式，该网络优雅地利用了特征表示，以通过生成模型来生成要编码的帧的外观，并依赖于编码的密钥帧的外观。同时，稀疏运动模式是紧凑的，并且对于高级视觉任务，例如行动识别。实验结果表明，与传统的视频编解码器（SSIM的0.0063增益）相比，我们的方法的重建质量更好，而最先进的动作识别性能比高度压缩的视频（以识别准确性增长了9.4％），这表明了人类和机器视觉的编码范式。

In this paper, we study a new problem arising from the emerging MPEG standardization effort Video Coding for Machine (VCM), which aims to bridge the gap between visual feature compression and classical video coding. VCM is committed to address the requirement of compact signal representation for both machine and human vision in a more or less scalable way. To this end, we make endeavors in leveraging the strength of predictive and generative models to support advanced compression techniques for both machine and human vision tasks simultaneously, in which visual features serve as a bridge to connect signal-level and task-level compact representations in a scalable manner. Specifically, we employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames via a generative model, relying on the appearance of the coded key frames. Meanwhile, the sparse motion pattern is compact and highly effective for high-level vision tasks, e.g. action recognition. Experimental results demonstrate that our method yields much better reconstruction quality compared with the traditional video codecs (0.0063 gain in SSIM), as well as state-of-the-art action recognition performance over highly compressed videos (9.4% gain in recognition accuracy), which showcases a promising paradigm of coding signal for both human and machine vision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题