论文标题
唱歌语音检测的知识蒸馏
Knowledge Distillation for Singing Voice Detection
论文作者
论文摘要
唱歌语音检测(SVD)一直是音乐信息检索研究的活跃领域(MIR)。目前,在文献中存在两种基于CNN的深度基于神经网络的方法,一种基于CNN,另一种基于RNN,它们在语音检测任务(VD)任务中学习了优化的功能,并在常见数据集上实现了最先进的性能。这两个模型都有大量参数(CNN为140万,RNN的65.7K),因此不适合在智能手机或嵌入式传感器等设备上部署,在内存和计算功率方面容量有限。解决此问题的最流行方法被称为深度学习文献(除模型压缩)中的知识蒸馏,其中一个被称为教师的大型预训练网络用于培训较小的学生网络。鉴于SVD在音乐信息检索中的广泛应用,据我们所知,尚未探索用于实际部署的模型压缩。在本文中,已经努力使用常规和集合知识蒸馏技术调查此问题。
Singing Voice Detection (SVD) has been an active area of research in music information retrieval (MIR). Currently, two deep neural network-based methods, one based on CNN and the other on RNN, exist in literature that learn optimized features for the voice detection (VD) task and achieve state-of-the-art performance on common datasets. Both these models have a huge number of parameters (1.4M for CNN and 65.7K for RNN) and hence not suitable for deployment on devices like smartphones or embedded sensors with limited capacity in terms of memory and computation power. The most popular method to address this issue is known as knowledge distillation in deep learning literature (in addition to model compression) where a large pre-trained network known as the teacher is used to train a smaller student network. Given the wide applications of SVD in music information retrieval, to the best of our knowledge, model compression for practical deployment has not yet been explored. In this paper, efforts have been made to investigate this issue using both conventional as well as ensemble knowledge distillation techniques.