论文标题
策划的城市场景数据集用于视听场景分析
A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis
论文作者
论文摘要
本文介绍了一个策划的城市场景数据集,以进行视听场景分析,其中包括经过精心选择和记录的材料。数据是在每个场景中的多个位置都在多个欧洲城市记录的,并公开可用。我们还提出了一项案例研究,以进行视听场景识别,并表明与最先进的单模式系统相比,音频和视觉方式的联合建模带来了显着的性能增长。我们的方法获得了84.8%的准确性,而仅视频的系统仅为75.8%,仅视频等效系统为68.4%。
This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8% accuracy compared to 75.8% for the audio-only and 68.4% for the video-only equivalent systems.