联合CNN变压器编码器，以增强精细颗粒的人类作用识别

论文标题

联合CNN变压器编码器，以增强精细颗粒的人类作用识别

Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition

论文作者

Leong, Mei Chee, Zhang, Haosong, Tan, Hui Li, Li, Liyuan, Lim, Joo Hwee

论文摘要

细粒度的动作识别是计算机视觉中的一项具有挑战性的任务。由于细粒的数据集在空间和时间空间中具有较小的类间变化，因此细粒度的动作识别模型需要良好的时间推理和属性动作语义的歧视。 Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's modeling efficiency in capturing latent semantics and global dependencies, we investigate two frameworks that combine CNN vision backbone and Transformer Encoder to enhance fine-grained action recognition: 1) a vision-based encoder to learn latent temporal semantics, and 2) a multi-modal video-text cross encoder to exploit additional文本输入并了解视觉和文本语义之间的交叉关联。我们的实验结果表明，我们的变压器编码器框架有效地学习了潜在的时间语义和跨模式关联，并且比CNN视觉模型改善了识别性能。我们在FINEGYM基准数据集上实现了新的最先进的性能。

Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's modeling efficiency in capturing latent semantics and global dependencies, we investigate two frameworks that combine CNN vision backbone and Transformer Encoder to enhance fine-grained action recognition: 1) a vision-based encoder to learn latent temporal semantics, and 2) a multi-modal video-text cross encoder to exploit additional text input and learn cross association between visual and text semantics. Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association, with improved recognition performance over CNN vision model. We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题