论文标题

使用接触线变压器了解体现的参考

Understanding Embodied Reference with Touch-Line Transformer

论文作者

Li, Yang, Chen, Xiaoxue, Zhao, Hao, Gong, Jiangtao, Zhou, Guyue, Rossano, Federico, Zhu, Yixin

论文摘要

我们研究了具有体现的参考理解,即使用具体的手势信号和语言参考来定位参照员的任务。人类的研究表明,被提及或指向的物体不在肘部线上,这是一个普遍的误解。相反,它们位于所谓的虚拟触摸线上。但是,现有的人姿势表示未能合并虚拟触摸线。为了解决此问题,我们设计了接触线变压器:它以输入令牌化的视觉和文本功能为例,并同时预测了参考文献的边界框和触摸线向量。在利用此触摸线之前,我们进一步设计了一种几何一致性损失,从而鼓励参考线和触摸线之间的共线性。使用触摸线作为手势信息可显着改善模型性能。在您的保费数据集上的实验表明,我们的方法在0.75 IOU标准下取得了 +25.0%的精度提高,缩小了模型和人类性能之间差距的63.6%。此外,我们通过显示使用虚拟触摸线比使用肘部 - 折线线更准确地定位了对参数的计算验证人类研究。

We study embodied reference understanding, the task of locating referents using embodied gestural signals and language references. Human studies have revealed that objects referred to or pointed to do not lie on the elbow-wrist line, a common misconception; instead, they lie on the so-called virtual touch line. However, existing human pose representations fail to incorporate the virtual touch line. To tackle this problem, we devise the touch-line transformer: It takes as input tokenized visual and textual features and simultaneously predicts the referent's bounding box and a touch-line vector. Leveraging this touch-line prior, we further devise a geometric consistency loss that encourages the co-linearity between referents and touch lines. Using the touch-line as gestural information improves model performances significantly. Experiments on the YouRefIt dataset show our method achieves a +25.0% accuracy improvement under the 0.75 IoU criterion, closing 63.6% of the gap between model and human performances. Furthermore, we computationally verify prior human studies by showing that computational models more accurately locate referents when using the virtual touch line than when using the elbow-wrist line.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源