通过自我监督的视听匹配来判别发声对象本地化

论文标题

通过自我监督的视听匹配来判别发声对象本地化

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

论文作者

Hu, Di, Qian, Rui, Jiang, Minyue, Tan, Xiao, Wen, Shilei, Ding, Errui, Lin, Weiyao, Dou, Dejing

论文摘要

在鸡尾酒会上，即混合的声音场景在鸡尾酒会上进行区分定位的声音对象对于人类来说是司空见惯的，但仍然对机器充满挑战。在本文中，我们提出了一个两阶段的学习框架，以执行自我观察的班级感知的声音对象本地化。首先，我们建议通过在单一源场景中汇总候选声音本地化来学习强大的对象表示。然后，通过参考预了解的对象知识，在鸡尾酒会场景中生成了类感知的对象本地化图，并且通过匹配音频和视觉对象类别分布来选择声音对象，其中视听一致性被视为自我求职信号。实验性的结果在现实和合成的鸡尾酒会视频中都表明，我们的模型在滤除无声物体并指出不同类别的声音对象的位置方面优越。代码可在https://github.com/dtaoo/discriminative-sounding-objects-localization上找到。

Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题