论文标题
视图不变,闭塞刺激性概率嵌入人姿势
View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose
论文作者
论文摘要
对人类姿势和行动的认识对于自主系统与人平稳互动至关重要。但是,相机通常以2D为图像和视频捕获人类的姿势,这些姿势可以在跨观点上具有显着的外观变化,从而使识别任务具有挑战性。为了解决这个问题,我们探索了3D人体在2D信息中的认识相似性,而现有作品中尚未得到充分研究。在这里,我们提出了一种从2D身体关节关键点学习紧凑的视图不变空间的方法,而无需明确预测3D姿势。 2D的输入歧义性来自投影和遮挡,难以通过确定性映射来表示,因此我们为嵌入空间采用了概率表述。实验结果表明,与3D姿势估计模型相比,我们的嵌入模型在从不同的相机视图中检索相似的姿势时可以达到更高的精度。我们还表明,通过训练一个简单的时间嵌入模型,我们可以在姿势序列检索上实现卓越的性能,并在很大程度上从基于堆叠框架的嵌入式中降低了嵌入尺寸,以有效地进行大规模检索。此外,为了使我们的嵌入能够与部分可见的输入一起工作,我们进一步研究了训练过程中不同的关键遮挡效果。我们证明,这些遮挡增强可显着改善部分2D输入姿势的检索性能。动作识别和视频对齐的结果表明,相对于针对每个任务专门培训的其他模型,使用我们的嵌入而无需任何其他培训就可以实现竞争性能。
Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.