通过可训练的基础功能了解音频功能

论文标题

通过可训练的基础功能了解音频功能

Understanding Audio Features via Trainable Basis Functions

论文作者

Heung, Kwan Yee, Cheuk, Kin Wai, Herremans, Dorien

论文摘要

在本文中，我们探讨了通过使频谱图基函数可训练来最大化频谱图中表示的信息的可能性。我们实验了两个不同的任务，即关键字点（KWS）和自动语音识别（ASR）。对于大多数神经网络模型，体系结构和超参数通常在实验中进行微调和优化。但是，输入特征通常被视为固定。在音频的情况下，信号可以主要以两种主要方式表示：原始波形（时域）或频谱图（时频域）。另外，经常使用和量身定制不同的频谱图类型以适合不同的应用。在我们的实验中，我们允许直接裁缝作为网络的一部分。我们的实验结果表明，使用可训练的基础功能可以将关键字发现（KWS）的准确性提高14.2个百分点，并将电话错误率（PER）降低9.5个百分点。尽管随着模型的复杂性的增加，使用可训练的基础功能的模型变得越来越有效，但训练有素的滤波器形状仍然可以为我们提供有关哪些频率箱对于该特定任务很重要的见解。从我们的实验中，我们可以得出结论，当模型复杂性受到限制时，可训练的基础功能是提高性能的有用工具。

In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features, however, are often treated as fixed. In the case of audio, signals can be mainly expressed in two main ways: raw waveforms (time-domain) or spectrograms (time-frequency-domain). In addition, different spectrogram types are often used and tailored to fit different applications. In our experiments, we allow for this tailoring directly as part of the network. Our experimental results show that using trainable basis functions can boost the accuracy of Keyword Spotting (KWS) by 14.2 percentage points, and lower the Phone Error Rate (PER) by 9.5 percentage points. Although models using trainable basis functions become less effective as the model complexity increases, the trained filter shapes could still provide us with insights on which frequency bins are important for that specific task. From our experiments, we can conclude that trainable basis functions are a useful tool to boost the performance when the model complexity is limited.

下载PDF全文

下载文献需遵守相关版权规定

论文标题