Scan2CAP：RGB-D扫描中的上下文感知的密集字幕

论文标题

Scan2CAP：RGB-D扫描中的上下文感知的密集字幕

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

论文作者

Chen, Dave Zhenyu, Gholami, Ali, Nießner, Matthias, Chang, Angel X.

论文摘要

我们介绍了商品RGB-D传感器中3D扫描中密集字幕的任务。作为输入，我们假设一个3D场景的点云；预期的输出是边界框以及基础对象的描述。为了解决3D对象检测和描述问题，我们提出了Scan2CAP（一种端到端训练的方法），以检测输入场景中的对象并用自然语言描述它们。我们使用一种注意机制，该机制在局部环境中引用相关组件时会产生描述性令牌。为了反映生成的字幕中的对象关系（即相对空间关系），我们使用消息传递图模块来促进学习对象关系特征。我们的方法可以有效地定位和描述扫描仪数据集的场景中的3D对象，从而优于2D基线方法，其差距很大（27.61％[email protected]）。

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% [email protected]).

下载PDF全文

下载文献需遵守相关版权规定

论文标题