论文标题
具有零零功能的弱监督流媒体多语言语音模型
A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability
论文作者
论文摘要
在本文中,我们介绍了构建流媒体多语言语音模型(SM2)的工作,该模型可以将多种口语转录或翻译成目标语言的文本。 SM2的主干是变压器传感器,具有高流媒体能力。 SM2模型不是人类标记的语音翻译(ST)数据,而是使用通过机器翻译服务转换语音识别语料库中的转录来生成的弱监督数据。与最近流行的大规模非流式语音模型相比,SM2模型具有35.1万小时的匿名语音培训数据,获得了可比甚至更好的ST质量。更重要的是,我们表明SM2在扩展到新的目标语言时具有真正的零击功能,从而为{source-speech,target-texxt}成对产生高质量的ST结果,这些结果在训练过程中未见。
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.