扬声器识别的非本地卷积神经网络（NLCNN）

论文标题

扬声器识别的非本地卷积神经网络（NLCNN）

Non-local convolutional neural networks (nlcnn) for speaker recognition

论文作者

Yang, Haici, Mao, Hongda, Li, Ruirui, Ju, Chelsea J. T., Elibol, Oguz

论文摘要

说话者认可是根据声音识别说话者的过程。随着智能语音助手（例如亚马逊Alexa）的流行，该技术引起了更多的关注。在过去的几年中，已经提出并实现了令人满意的性能。但是，卷积操作是通常一次在当地社区上执行的基础，因此错过了在功能级别上捕获全球远程交互的基础，这对于理解说话者声音中的模式至关重要。在这项工作中，我们建议采用非本地卷积神经网络（NLCNN）来提高在功能级别捕获长距离依赖性的能力，从而提高说话者的识别性能。具体而言，我们引入了非本地块，其中将位置的输出响应计算为所有位置输入特征的加权总和。将非本地块与预定义的CNN网络相结合，我们研究了NLCNN模型的有效性。在没有大规模调整的情况下，提议的NLCNN模型在公共Voxceleb数据集上的表现式识别算法优于最先进的说话者识别算法。更重要的是，我们研究了分别应用于频率时间域，时间域，频域和帧级别的不同类型的非本地操作。其中，时间域是说话者识别应用程序最有效的域。

Speaker recognition is the process of identifying a speaker based on the voice. The technology has attracted more attention with the recent increase in popularity of smart voice assistants, such as Amazon Alexa. In the past few years, various convolutional neural network (CNN) based speaker recognition algorithms have been proposed and achieved satisfactory performance. However, convolutional operations are building blocks that typically perform on a local neighborhood at a time and thus miss to capture global, long-range interactions at the feature level which are critical for understanding the pattern in a speaker's voice. In this work, we propose to apply Non-local Convolutional Neural Networks (NLCNN) to improve the capability of capturing long-range dependencies at the feature level, therefore improving speaker recognition performance. Specifically, we introduce non-local blocks where the output response of a position is computed as a weighted sum of the input features at all positions. Combining non-local blocks with pre-defined CNN networks, we investigate the effectiveness of NLCNN models. Without extensive tuning, the proposed NLCNN models outperform state-of-the-art speaker recognition algorithms on the public Voxceleb dataset. What's more, we investigate different types of non-local operations applied to the frequency-time domain, time domain, frequency domain and frame-level respectively. Among them, time domain is the most effective one for speaker recognition applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题