半监督的NMF-CNN进行声音事件检测

论文标题

半监督的NMF-CNN进行声音事件检测

Semi-Supervised NMF-CNN For Sound Event Detection

论文作者

Kai, Chan Teck, Siong, Chin Cheng, Ye, Li

论文摘要

在本文中，提出了使用非负矩阵分解（NMF）和卷积神经网络（CNN）的组合方法进行音频夹声事件检测（SED）。主要思想始于使用NMF近似标记的数据的强标标签。随后，使用近似标记的数据，在半监督的框架中训练了两个不同的CNN，其中一个CNN用于剪辑级预测，另一个用于帧级预测。基于这个想法，我们的模型可以在声学场景和事件的检测和分类（DCASE）2020挑战任务4验证数据集上实现基于事件的F1得分为45.7％。通过平均后输出结合模型，基于事件的F1得分可以增加到48.6％。通过与基线模型进行比较，我们提出的模型优于基线模型超过8％。通过在Dcase 2020 Challenge Task 4测试集上测试我们的模型，我们的模型可以达到基于事件的F1分数为44.4％，而我们的结合系统可以实现基于事件的F1分数46.3％。在基线系统上，此类结果的最小余量为7％，这证明了我们在不同数据集上提出的方法的鲁棒性。

In this paper, a combinative approach using Nonnegative Matrix Factorization (NMF) and Convolutional Neural Network (CNN) is proposed for audio clip Sound Event Detection (SED). The main idea begins with the use of NMF to approximate strong labels for the weakly labeled data. Subsequently, using the approximated strongly labeled data, two different CNNs are trained in a semi-supervised framework where one CNN is used for clip-level prediction and the other for frame-level prediction. Based on this idea, our model can achieve an event-based F1-score of 45.7% on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge Task 4 validation dataset. By ensembling models through averaging the posterior outputs, event-based F1-score can be increased to 48.6%. By comparing with the baseline model, our proposed models outperform the baseline model by over 8%. By testing our models on the DCASE 2020 Challenge Task 4 test set, our models can achieve an event-based F1-score of 44.4% while our ensembled system can achieve an event-based F1-score of 46.3%. Such results have a minimum margin of 7% over the baseline system which demonstrates the robustness of our proposed method on different datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题