论文标题
对设备完全神经端到端自动语音识别算法的评论
A review of on-device fully neural end-to-end automatic speech recognition algorithms
论文作者
论文摘要
在本文中,我们回顾了各种端到端自动语音识别算法及其针对内在应用程序的优化技术。传统的语音识别系统包括大量离散组件,例如声学模型,语言模型,发音模型,文本范围,逆文本归一化器,基于加权有限状态换音器(WFST)等的解码器等。为了通过这种传统的语音识别系统获得足够高的语音识别精度,通常需要一个非常大的语言模型(最多100 GB)。因此,相应的WFST大小变得巨大,这禁止其在设备上实施。最近,已经提出了完全神经网络端到端的语音识别算法。示例包括基于Connectionist Perimal分类(CTC),经常性神经网络传感器(RNN-T),基于注意力的编码器模型(AED),单调块质量注意(MOCHA),基于变形金刚的语音识别系统等等的语音识别系统。与传统算法相比,这些完全基于神经网络的系统需要更小的内存足迹,因此它们的设备实现变得可行。在本文中,我们回顾了这种端到端的语音识别模型。与常规算法相比,我们广泛讨论了它们的结构,性能和优势。
In this paper, we review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications. Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite State Transducer (WFST), and so on. To obtain sufficiently high speech recognition accuracy with such conventional speech recognition systems, a very large language model (up to 100 GB) is usually needed. Hence, the corresponding WFST size becomes enormous, which prohibits their on-device implementation. Recently, fully neural network end-to-end speech recognition algorithms have been proposed. Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise Attention (MoChA), transformer-based speech recognition systems, and so on. These fully neural network-based systems require much smaller memory footprints compared to conventional algorithms, therefore their on-device implementation has become feasible. In this paper, we review such end-to-end speech recognition models. We extensively discuss their structures, performance, and advantages compared to conventional algorithms.