Aimusicguru：音乐协助人姿势更正

论文标题

Aimusicguru：音乐协助人姿势更正

AIMusicGuru: Music Assisted Human Pose Correction

论文作者

Shrestha, Snehesh, Fermüller, Cornelia, Huang, Tianyu, Win, Pyone Thant, Zukerman, Adam, Parameshwara, Chethan M., Aloimonos, Yiannis

论文摘要

姿势估计技术依赖于通过以像素形式表示的观测值可用的视觉提示。但是，性能是由视频的帧速率和动作模糊，遮挡和时间连贯性而挣扎的。当人们与物体和仪器互动时，例如演奏小提琴时，这个问题就会放大。后处理的标准方法使用插值和平滑函数来滤除噪声和填充空白，但它们无法建模高度非线性运动。我们提出了一种方法，该方法利用了我们对产生的声音与产生它们的运动之间因果关系高度关系的理解。我们使用音频签名来完善和预测准确的人体姿势运动模型。我们提出了MapNet（音乐辅助姿势网络），以从稀疏输入姿势序列但连续音频产生细晶粒运动模型。为了加速该领域的进一步研究，我们还开源MAPDAT，这是一个新的3D小提琴的多模式数据集，其中包括音乐。我们对不同的标准机器学习模型进行比较，并对输入方式，采样技术以及音频和运动功能进行分析。 MAPDAT上的实验表明，像我们这样的多模式方法是先前仅使用视觉方法接近的任务的有希望的方向。我们的结果表明，在定性和定量上可以将音频与视觉观察结合在一起，以帮助改善任何姿势估计方法。

Pose Estimation techniques rely on visual cues available through observations represented in the form of pixels. But the performance is bounded by the frame rate of the video and struggles from motion blur, occlusions, and temporal coherence. This issue is magnified when people are interacting with objects and instruments, for example playing the violin. Standard approaches for postprocessing use interpolation and smoothing functions to filter noise and fill gaps, but they cannot model highly non-linear motion. We present a method that leverages our understanding of the high degree of a causal relationship between the sound produced and the motion that produces them. We use the audio signature to refine and predict accurate human body pose motion models. We propose MAPnet (Music Assisted Pose network) for generating a fine grain motion model from sparse input pose sequences but continuous audio. To accelerate further research in this domain, we also open-source MAPdat, a new multi-modal dataset of 3D violin playing motion with music. We perform a comparison of different standard machine learning models and perform analysis on input modalities, sampling techniques, and audio and motion features. Experiments on MAPdat suggest multi-modal approaches like ours as a promising direction for tasks previously approached with visual methods only. Our results show both qualitatively and quantitatively how audio can be combined with visual observation to help improve any pose estimation methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题