论文标题
通过变形金刚和源滤波器扭曲的稳健低资源儿童演讲的转移学习
Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping
论文作者
论文摘要
众所周知,自动语音识别(ASR)系统在转录儿童言语时会出现困难。这主要归因于缺乏大型儿童的语音语言语料库来训练健壮的ASR模型以及在用接受成人数据培训的系统解码儿童演讲时所产生的领域不匹配。在本文中,我们提出了多种增强功能来减轻这些问题。首先,我们根据语音源过滤器模型提出了一种数据增强技术,以弥补成人和儿童语音之间的领域差距。这使我们能够通过使这些样本在感知上与儿童的言语相似,从而利用成人语音语料库的数据可用性。其次,使用这种增强策略,我们将转移学习应用于成人数据预先训练的变压器模型。该模型遵循最近引入的XLS-R体系结构,这是对几个跨语性成人语音语料库进行预训练的WAV2VEC 2.0模型,以学习一般和强大的声学框架级表示。使用拟议的来源滤清器扭曲策略增强的成人数据,采用此模型,用于ASR任务,并且在PF-Star英国英国英国儿童的演讲语料库中,在官方测试集上的范围有限的先前最先前的结果都超过了先前的最先进的结果。
Automatic Speech Recognition (ASR) systems are known to exhibit difficulties when transcribing children's speech. This can mainly be attributed to the absence of large children's speech corpora to train robust ASR models and the resulting domain mismatch when decoding children's speech with systems trained on adult data. In this paper, we propose multiple enhancements to alleviate these issues. First, we propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech. This enables us to leverage the data availability of adult speech corpora by making these samples perceptually similar to children's speech. Second, using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data. This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora to learn general and robust acoustic frame-level representations. Adopting this model for the ASR task using adult data augmented with the proposed source-filter warping strategy and a limited amount of in-domain children's speech significantly outperforms previous state-of-the-art results on the PF-STAR British English Children's Speech corpus with a 4.86% WER on the official test set.