仔细观察视频中的依据的时间句子接地：数据集，度量和方法

论文标题

仔细观察视频中的依据的时间句子接地：数据集，度量和方法

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

论文作者

Lan, Xiaohan, Yuan, Yitian, Wang, Xin, Chen, Long, Wang, Zhi, Ma, Lin, Zhu, Wenwu

论文摘要

在过去几年中，旨在在未修饰的视频中以自然语言句子进行自然语言句子的视频（TSGV）中的时间句子接地（TSGV）。但是，最近的研究发现，当前的基准数据集可能具有明显的时刻注释偏见，即使没有训练以实现SOTA性能，也可以实现几个简单的基准。在本文中，我们仔细研究了现有的评估协议，并发现流行的数据集和评估指标都是导致不可信的基准测试的魔鬼。因此，我们建议重新组织两个广泛使用的数据集，从而使训练和测试拆分（即分布式分布（OOD）测试）中的地面真实力矩分布不同。同时，我们引入了一个新的评估指标“ dr@n，iou@m”，该指标折扣了基本召回分数，以减轻偏见数据集引起的膨胀评估。新的基准测试结果表明，我们提出的评估协议可以更好地监控研究进度。此外，我们提出了一种基于因果关系的新型多分支变形性欺骗性（MDD）框架，以实现无偏的力矩预测。具体而言，我们设计了一个多分支反对创始人，以消除多个因果干预的混杂因素引起的影响。为了帮助模型更好地对齐句子查询和视频矩之间的语义，我们在功能编码过程中增强了表示形式。具体来说，对于文本信息，查询被解析为以动词为中心的短语，以获得更细粒度的文本功能。为了进行视觉信息，将位置信息从瞬间特征分解出来，以增强各种位置的时刻表示。广泛的实验表明，我们提出的方法可以在现有的SOTA方法中取得竞争成果，并以巨大的进步胜过基本模型。

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may have obvious moment annotation biases, enabling several simple baselines even without training to achieve SOTA performance. In this paper, we take a closer look at existing evaluation protocols, and find both the prevailing dataset and evaluation metrics are the devils that lead to untrustworthy benchmarking. Therefore, we propose to re-organize the two widely-used datasets, making the ground-truth moment distributions different in the training and test splits, i.e., out-of-distribution (OOD) test. Meanwhile, we introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets. New benchmarking results indicate that our proposed evaluation protocols can better monitor the research progress. Furthermore, we propose a novel causality-based Multi-branch Deconfounding Debiasing (MDD) framework for unbiased moment prediction. Specifically, we design a multi-branch deconfounder to eliminate the effects caused by multiple confounders with causal intervention. In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding. Specifically, for textual information, the query is parsed into several verb-centered phrases to obtain a more fine-grained textual feature. For visual information, the positional information has been decomposed from moment features to enhance representations of moments with diverse locations. Extensive experiments demonstrate that our proposed approach can achieve competitive results among existing SOTA approaches and outperform the base model with great gains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题