使用LSTM神经网络增强基于空间聚类的时频面膜

论文标题

使用LSTM神经网络增强基于空间聚类的时频面膜

Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

论文作者

Grezes, Felix, Ni, Zhaoheng, Trinh, Viet Anh, Mandel, Michael

论文摘要

最近的作品表明，使用LSTM体系结构的深层复发网络可以通过估计时间频面掩码来实现强大的单渠道语音增强。但是，这些模型并不能自然地从不同的麦克风配置中概括为多通道输入。相反，空间聚类技术可以实现这种概括，但缺乏强大的信号模型。我们的工作提出了两种方法的结合。通过使用LSTM来增强基于空间聚类的时间频面掩码，我们既实现了多个单渠道LSTM-DNN语音增强器的信号建模性能，又可以实现多通道空间聚类的信号分离性能和通用性。我们将提出的系统与Chime-3数据集上的几个基线进行比较。我们使用BSS \ _EVAL TOOLKIT和PESQ的SDR评估每个系统的音频质量。我们使用Kaldi自动语音识别器中的单词错误率评估了每个系统输出的清晰度。

Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Our work proposes a combination of the two approaches. By using LSTMs to enhance spatial clustering based time-frequency masks, we achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance and generality of multi-channel spatial clustering. We compare our proposed system to several baselines on the CHiME-3 dataset. We evaluate the quality of the audio from each system using SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题