从时空的混合骨骼序列中学习的对比度学习，以基于自我监督的骨架识别

论文标题

从时空的混合骨骼序列中学习的对比度学习，以基于自我监督的骨架识别

Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition

论文作者

Chen, Zhan, Liu, Hong, Guo, Tianyu, Chen, Zhengyan, Song, Pinhao, Tang, Hao

论文摘要

基于对比度学习的基于自我监督的骨骼识别引起了很多关注。最近的文献表明，数据增强和大量对比度对对于学习此类表示至关重要。在本文中，我们发现，基于正常增强的直接扩展对对比对的表现在绩效方面带来了有限的回报，因为随着培训的进展，对比对成对从正常数据增强到损失的贡献变小。因此，我们深入研究了对比对比对比对比的学习。由混合增强策略的成功激发，通过综合新样本来改善许多任务的执行，我们提出了SkelemixClr：一种与时空的骨架混合增强（Skelemix）的对比学习框架（Skelemix），以补充当前的对比度对比对比样品，以补充当前的对比样品。首先，Skelemix利用骨骼数据的拓扑信息将两个骨骼序列混合在一起，通过将裁切的骨骼片段（修剪视图）与其余的骨骼序列（截断视图）随机梳理。其次，将时空掩码池应用于特征级别分开这两个视图。第三，我们将对比度对与这两种观点扩展。 SkelemixClr利用修剪和截断的视图来提供丰富的硬对比度对，因为由于图卷积操作，它们涉及彼此的某些上下文信息，这使该模型可以学习更好的运动表示以进行动作识别。在NTU-RGB+D，NTU120-RGB+D和PKU-MMD数据集上进行了广泛的实验表明，SkelemixClr实现了最先进的性能。代码可在https://github.com/czhaneva/skelemixclr上找到。

Self-supervised skeleton-based action recognition with contrastive learning has attracted much attention. Recent literature shows that data augmentation and large sets of contrastive pairs are crucial in learning such representations. In this paper, we found that directly extending contrastive pairs based on normal augmentations brings limited returns in terms of performance, because the contribution of contrastive pairs from the normal data augmentation to the loss get smaller as training progresses. Therefore, we delve into hard contrastive pairs for contrastive learning. Motivated by the success of mixing augmentation strategy which improves the performance of many tasks by synthesizing novel samples, we propose SkeleMixCLR: a contrastive learning framework with a spatio-temporal skeleton mixing augmentation (SkeleMix) to complement current contrastive learning approaches by providing hard contrastive samples. First, SkeleMix utilizes the topological information of skeleton data to mix two skeleton sequences by randomly combing the cropped skeleton fragments (the trimmed view) with the remaining skeleton sequences (the truncated view). Second, a spatio-temporal mask pooling is applied to separate these two views at the feature level. Third, we extend contrastive pairs with these two views. SkeleMixCLR leverages the trimmed and truncated views to provide abundant hard contrastive pairs since they involve some context information from each other due to the graph convolution operations, which allows the model to learn better motion representations for action recognition. Extensive experiments on NTU-RGB+D, NTU120-RGB+D, and PKU-MMD datasets show that SkeleMixCLR achieves state-of-the-art performance. Codes are available at https://github.com/czhaneva/SkeleMixCLR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题