论文标题

基于深度变压器的数据增强,并具有子词单元,用于形态上丰富的在线ASR

Deep Transformer based Data Augmentation with Subword Units for Morphologically Rich Online ASR

论文作者

Tarján, Balázs, Szaszák, György, Fegyó, Tibor, Mihajlik, Péter

论文摘要

最近,Deep Transformer模型已被证明在ASR的语言建模任务中特别强大。但是,它们的高复杂性使它们很难在在线系统的第一个(单个)中应用。最近的研究表明,通过使用基于神经文本生成的数据扩展,可以将大量的神经网络语言模型知识(LM)转移到传统的N-Gram中。在我们的论文中,我们在一般文本语料库上预先培训了GPT-2变压器LM,并在我们的匈牙利对话呼叫中心ASR任务中对其进行了微调。我们表明,尽管具有变压器生成的文本的数据增强在隔离语言方面非常有效,但它会以形态上丰富的语言引起词汇爆炸。因此,我们提出了一种称为基于子词的神经文本增强的新方法,在该方法中,我们将生成的文本重述为统计衍生的子词。我们比较默菲斯特和BPE统计子字样,并表明这两种方法都可以显着改善该方法,同时大大减少词汇量和内存要求。最后,我们还证明,基于子词的神经文本增强不仅在整体上,而且在识别OOV单词方面都优于基于单词的方法。

Recently Deep Transformer models have proven to be particularly powerful in language modeling tasks for ASR. Their high complexity, however, makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of neural network Language Models (LM) can be transferred to traditional n-grams by using neural text generation based data augmentation. In our paper, we pre-train a GPT-2 Transformer LM on a general text corpus and fine-tune it on our Hungarian conversational call center ASR task. We show that although data augmentation with Transformer-generated text works well for isolating languages, it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new method called subword-based neural text augmentation, where we retokenize the generated text into statistically derived subwords. We compare Morfessor and BPE statistical subword tokenizers and show that both methods can significantly improve the WER while greatly reducing vocabulary size and memory requirements. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of OOV words.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源