论文标题
具有联合语言标识的端到端双语ASR系统流媒体
Streaming End-to-End Bilingual ASR Systems with Joint Language Identification
论文作者
论文摘要
多语言ASR技术简化了模型培训和部署,但已知其准确性取决于运行时语言信息的可用性。由于在现实世界中很少知道语言身份,因此必须以最小的延迟来直接推断出语言身份。此外,在语音激活的智能助手系统中,ASR输出的下游处理也需要语言身份。在本文中,我们介绍了使用复发性神经网络传感器(RNN-T)体系结构进行ASR和语言标识(LID)的流媒体,端到端的双语系统。在输入侧,仅预验证的纯盖分类器的嵌入来指导RNN-T训练和推理,而在输出方面,语言目标是通过ASR目标共同建模的。提出的方法应用于两种语言对:在美国所说的英语 - 西班牙语,在印度所说的英语印地语。实验表明,对于英式 - 西班牙,双语关节ASR-LID结构与单语ASR和仅声盖的精度相匹配。对于更具挑战性的(由于室内代码切换)的情况,英语印地语的情况,英语ASR和LID指标显示退化。总体而言,在用户在语言之间动态切换的情况下,提出的架构为运行多个单语ASR模型和并行的盖子分类器提供了有希望的简化。
Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.