通过以对象为中心的分层表示将移动对象分割

论文标题

通过以对象为中心的分层表示将移动对象分割

Segmenting Moving Objects via an Object-Centric Layered Representation

论文作者

Xie, Junyu, Xie, Weidi, Zisserman, Andrew

论文摘要

本文的目的是能够在视频中发现，跟踪和细分多个移动对象的模型。我们做出四个贡献：首先，我们引入了一个以对象为中心的分段模型，具有深度订购的层表示。这是使用摄入光流的变压器体系结构的变体来实现的，每个查询向量为整个视频指定对象及其图层。该模型可以有效地发现多个移动对象并处理相互阻塞。其次，我们引入了一条可扩展的管道，用于通过层组成生成多对象合成训练数据，该数据用于训练提出的模型，大大降低了对劳动密集型注释的要求，并支持SIM2REAL概括；第三，我们进行了彻底的消融研究，表明该模型能够学习对象的持久性和时间形状的一致性，并能够预测Amodal分割掩模。第四，我们评估了模型，仅对综合数据，标准视频细分基准，戴维斯，MOCA，SEGTRACK，FBMS-59进行培训，并在不依赖任何手动注释的现有方法中实现最先进的性能。通过测试时间适应，我们观察到进一步的性能提高。

The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video. We make four contributions: First, we introduce an object-centric segmentation model with a depth-ordered layer representation. This is implemented using a variant of the transformer architecture that ingests optical flow, where each query vector specifies an object and its layer for the entire video. The model can effectively discover multiple moving objects and handle mutual occlusions; Second, we introduce a scalable pipeline for generating multi-object synthetic training data via layer compositions, that is used to train the proposed model, significantly reducing the requirements for labour-intensive annotations, and supporting Sim2Real generalisation; Third, we conduct thorough ablation studies, showing that the model is able to learn object permanence and temporal shape consistency, and is able to predict amodal segmentation masks; Fourth, we evaluate our model, trained only on synthetic data, on standard video segmentation benchmarks, DAVIS, MoCA, SegTrack, FBMS-59, and achieve state-of-the-art performance among existing methods that do not rely on any manual annotations. With test-time adaptation, we observe further performance boosts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题