为什么X-Vector系统会错过目标扬声器？声学不匹配对目标评分对Voxceleb数据的影响

论文标题

为什么X-Vector系统会错过目标扬声器？声学不匹配对目标评分对Voxceleb数据的影响

Why Did the x-Vector System Miss a Target Speaker? Impact of Acoustic Mismatch Upon Target Score on VoxCeleb Data

论文作者

Hautamäki, Rosa González, Kinnunen, Tomi

论文摘要

现代自动扬声器验证（ASV）在很大程度上依赖于通过深神经网络实施的机器学习。很难解释这些黑匣子的输出。与解释性的机器学习一致，我们对ASV检测评分的依赖性对入学和测试话语的声学不匹配的依赖性进行了建模。我们旨在确定解释目标说话者错过的不匹配因素（虚假拒绝）。我们在选定的声学特征的一阶统计和二阶统计中使用距离作为线性混合效应模型中的预测指标，而标准的Kaldi X-vector系统则形成我们的ASV黑框。我们在Voxceleb数据上的结果揭示了最突出的不匹配因子在F0平均值中，其次是与强峰频率相关的不匹配。我们的发现表明，X-Vector系统缺乏言论内部变化的鲁棒性。

Modern automatic speaker verification (ASV) relies heavily on machine learning implemented through deep neural networks. It can be difficult to interpret the output of these black boxes. In line with interpretative machine learning, we model the dependency of ASV detection score upon acoustic mismatch of the enrollment and test utterances. We aim to identify mismatch factors that explain target speaker misses (false rejections). We use distance in the first- and second-order statistics of selected acoustic features as the predictors in a linear mixed effects model, while a standard Kaldi x-vector system forms our ASV black-box. Our results on the VoxCeleb data reveal the most prominent mismatch factor to be in F0 mean, followed by mismatches associated with formant frequencies. Our findings indicate that x-vector systems lack robustness to intra-speaker variations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题