自我监督和弱监督的对比度学习的框架行动表示

论文标题

自我监督和弱监督的对比度学习的框架行动表示

Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

论文作者

Chen, Minghao, Tu, Renbo, Huang, Chenxi, Lin, Yuqi, Wu, Boxi, Cai, Deng

论文摘要

以前关于行动表示学习的工作集中在短视频剪辑的全球表示上。相比之下，许多实际应用，例如视频一致性，强烈要求学习长期视频的密集代表。在本文中，我们介绍了一个新的对比动作表示学习（CARL）的框架，以一种自我监督或弱监督的方式学习框架的动作表示形式，尤其是对于长时间的视频。具体来说，我们介绍了一个简单但有效的视频编码器，该编码器通过结合卷积和变压器来考虑空间和时间上下文。受到自我监督学习的最新进展的启发，我们提出了一种新的序列对比度损失（SCL），该序列损失（SCL）应用于两个相关观点，通过扩展了两个版本中的一系列时空数据，获得了两个相关观点。一种是自我监督的版本，它通过最大程度地减少两个增强视图的序列相似性与时间戳距离的高斯分布之间的序列相似性之间的kl差异来优化嵌入空间。另一个是弱监督的版本，该版本使用动态时间包装（DTW）使用视频级标签在视频中构建了更多的示例对。关于finegym，pennation和Pouping数据集的实验表明，我们的方法的表现优于先前的最先进的边距，用于下游细粒度的动作分类，甚至更快的推断。令人惊讶的是，尽管没有像以前的作品那样对配对视频进行培训，但我们的自我监视版本也显示出在视频对齐和细粒框架检索任务中出色的表现。

Previous work on action representation learning focused on global representations for short video clips. In contrast, many practical applications, such as video alignment, strongly demand learning the intensive representation of long videos. In this paper, we introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner, especially for long videos. Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context by combining convolution and transformer. Inspired by the recent massive progress in self-supervised learning, we propose a new sequence contrast loss (SCL) applied to two related views obtained by expanding a series of spatio-temporal data in two versions. One is the self-supervised version that optimizes embedding space by minimizing KL-divergence between sequence similarity of two augmented views and prior Gaussian distribution of timestamp distance. The other is the weakly-supervised version that builds more sample pairs among videos using video-level labels by dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference. Surprisingly, although without training on paired videos like in previous works, our self-supervised version also shows outstanding performance in video alignment and fine-grained frame retrieval tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题